RAG from 0 to 1: Let LLMs Answer Based on Your Data
Enterprise AI applications are 90% RAG. This article walks you through building a runnable RAG system—from chunking to deployment.
L0-04 covered: RAG is the most practical antidote to LLM hallucination.
The RAG visualization showed you the full process. This article opens up each step—build a runnable RAG system from scratch.
You can use this to build one in 100 lines of Python.
Why RAG
Two big problems with naked LLMs:
- Knowledge cutoff—training data ends at some date; things after that, unknown
- Doesn’t know your private data—your company docs, your customer emails, your wiki
How does RAG solve this?
Retrieval → Augmented → Generation
Treat LLM as a “capable intern who reads materials, writes human language.” Don’t have it answer from memory; give it materials so it reads first, then answers.
The Full Process (5 steps)
1. Indexing phase (offline, once)
Your docs → chunks → vectorize → store in vector DB
2. Query phase (every time user asks)
User question → vectorize → find similar in vector DB → take top-K →
pack into prompt → LLM answers
Step by Step
Step 1: Prepare Documents
Assume you have an enterprise wiki:
docs/
├── product_intro.md
├── pricing.md
├── faq.md
├── tutorial.md
└── ...
Read into Python:
import os
documents = []
for filename in os.listdir('docs/'):
with open(f'docs/{filename}', 'r', encoding='utf-8') as f:
documents.append({
'source': filename,
'text': f.read()
})
Step 2: Chunking
Why not just send the whole article to LLM?
- LLM context is limited (GPT-4 128k)
- Too much content wastes money per call
- “Finding precise small passages” beats “sending entire articles” for retrieval accuracy
Chunking strategies (simplest to hardest):
A. Fixed-length chunking
def fixed_chunk(text, chunk_size=500, overlap=50):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i:i + chunk_size])
return chunks
Simple but might cut paragraphs.
B. Paragraph chunking
def paragraph_chunk(text):
return [p.strip() for p in text.split('\n\n') if p.strip()]
Smarter but paragraphs aren’t uniform length.
C. Recursive chunking (practical choice)
LangChain’s RecursiveCharacterTextSplitter: try paragraphs first, then sentences, then characters.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_text(text)
D. Semantic chunking (advanced)
Use embeddings to determine “which sentences belong together.” Best quality, hardest to implement.
A RAG system’s quality is 70% in chunking strategy. Most-overlooked engineering detail.
Step 3: Embedding
Turn each chunk into a vector.
With OpenAI Embeddings
from openai import OpenAI
client = OpenAI()
def embed(text):
resp = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return resp.data[0].embedding # 1536-dim vector
vectors = [embed(chunk) for chunk in chunks]
With Open-Source (no $ cost)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
vectors = model.encode(chunks)
Top embedding models (2026):
| Model | Dim | Strength |
|---|---|---|
| OpenAI text-embedding-3 | 1536/3072 | General, strongest in English |
| Cohere embed-multilingual-v3 | 1024 | Multilingual |
| BGE (BAAI) | 1024 | Best open-source |
| Voyage AI | 1024 | Long text |
Choosing embedding > choosing LLM—a RAG system’s retrieval precision depends on embedding model first.
Step 4: Store in Vector DB
Can’t compute cosine similarity over tens of thousands every query with numpy—too slow. Need a vector database.
Choices
| DB | Type | When to use |
|---|---|---|
| FAISS (Facebook) | Local lib | Prototype, single machine |
| ChromaDB | Local/remote | Small to medium |
| Pinecone | Cloud | Production, managed |
| Weaviate | Self/cloud | Complex queries |
| Qdrant (Rust) | Self/cloud | Best performance |
| Milvus | Self-host | Massive scale |
| pgvector (Postgres extension) | Existing PG | No new dependency |
Get Started with ChromaDB
import chromadb
client = chromadb.Client()
collection = client.create_collection("my_docs")
# Add
collection.add(
documents=chunks,
embeddings=vectors,
metadatas=[{'source': doc['source']} for doc in documents for _ in chunks_of(doc)],
ids=[f"chunk_{i}" for i in range(len(chunks))]
)
Step 5: Retrieve
When user asks:
def retrieve(query, k=5):
query_vec = embed(query)
results = collection.query(
query_embeddings=[query_vec],
n_results=k
)
return results['documents'][0]
Step 6: (Optional, highly recommended) Rerank
Retrieved top-K isn’t always great—top-1 might be worse than top-5.
Add a rerank model—use a more precise model to score top-N, pick the truly most relevant top-K:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('BAAI/bge-reranker-large')
def rerank(query, candidates, top_k=3):
pairs = [(query, c) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [c for c, _ in ranked[:top_k]]
Classic: retrieve top-20 → rerank → top-3. Significantly improves RAG accuracy.
Step 7: Build prompt + Call LLM
def rag_answer(query):
candidates = retrieve(query, k=20)
top_chunks = rerank(query, candidates, top_k=3)
context = "\n\n".join([f"[{i+1}] {chunk}" for i, chunk in enumerate(top_chunks)])
prompt = f"""Please answer the user's question based on the materials below.
If the materials don't contain relevant information, clearly say "I don't know."
【Materials】
{context}
【User Question】
{query}
【Your Answer】"""
resp = client.chat.completions.create(
model="claude-haiku-4-5",
messages=[{"role": "user", "content": prompt}]
)
return resp.choices[0].message.content
Complete RAG—about 100 lines of code.
Real Deployment “Gotchas”
Theory RAG is easy; production RAG is all details:
1. Decomposing Complex Queries
User asks “compare our A product vs B product price-performance”— pure retrieval of “A product” or “B product” alone is insufficient. Need to decompose query with LLM first, retrieve separately, then synthesize.
2. Metadata Filtering
“What was our 2024 policy?”— shouldn’t mix 2023, 2022 results. Vector DB needs to support metadata filtering (year, department).
3. Hybrid Search
Vector search is great at “semantic similarity,” but bad at exact keywords (SKUs, names, jargon). Fix: run BM25 (term matching) AND vector search, fuse results.
4. Chunks Too Small or Too Big
Too small (100 chars) → context lost. Too big (2000 chars) → diluted key info. Sweet spot: 300-800 chars + 50-char overlap.
5. Citations and Traceability
After LLM answers, show “this answer comes from chunk X”— let users trace back.
# In prompt, require model to cite
"Please tag claims with source numbers like [1] [2]"
6. Performance
- Cache: repeated queries
- Async: parallel retrieval and prompt assembly
- Streaming: LLM streams as it generates
7. Evaluation
How do you know your RAG is good?
- Recall: are relevant chunks retrieved (need ground truth)
- Generation quality: human eval / LLM eval (use GPT-4 as judge)
- Response time: how long from question to answer
- Cost: $ per query
Recommend RAGAS (open-source RAG evaluation).
Tooling Ecosystem
Don’t write from scratch? Use frameworks:
| Tool | Note |
|---|---|
| LangChain | Most popular, biggest ecosystem, but verbose |
| LlamaIndex | RAG-focused, elegant API |
| Haystack (Deepset) | Old, enterprise-grade |
| Verba (Weaviate) | UI included |
| Dify | Low-code, visual |
Get started: write once with LangChain / LlamaIndex—understand each step. Production: pick by team and scenario—no one-size-fits-all.
RAG in Real Business
Enterprise AI apps are 90% RAG.
- Customer service bots: retrieve FAQs + ticket history
- Legal assistants: retrieve statutes + case law
- Medical consultation: retrieve clinical guidelines + records
- Company AI knowledge base: employees ask “how do we do X here”
- AI writing assistants: retrieve brand guidelines + past articles
From “POC” to “launch”, 90% of engineering is on RAG—chunking strategy, embedding choice, rerank, prompt tuning, evaluation.
In 2026:
- Almost all enterprise AI projects are RAG-based
- Almost all AI startups are doing “X industry’s RAG”
- Big companies use more complex variants (GraphRAG, Agentic RAG)
Learning RAG = learning LLM application engineering.
Next: “LoRA Fine-tuning Basics” — RAG lets LLMs “know” your data; fine-tuning makes the LLM “become” what you want.