L4 Chapter 3 🐣 🕒 14 min

RAG from 0 to 1: Let LLMs Answer Based on Your Data

Enterprise AI applications are 90% RAG. This article walks you through building a runnable RAG system—from chunking to deployment.

HelloAI Editors

7/5/2026

L0-04 covered: RAG is the most practical antidote to LLM hallucination.

The RAG visualization showed you the full process. This article opens up each step—build a runnable RAG system from scratch.

You can use this to build one in 100 lines of Python.

Why RAG

Two big problems with naked LLMs:

Knowledge cutoff—training data ends at some date; things after that, unknown
Doesn’t know your private data—your company docs, your customer emails, your wiki

How does RAG solve this?

Retrieval → Augmented → Generation

Treat LLM as a “capable intern who reads materials, writes human language.” Don’t have it answer from memory; give it materials so it reads first, then answers.

The Full Process (5 steps)

1. Indexing phase (offline, once)
   Your docs → chunks → vectorize → store in vector DB

2. Query phase (every time user asks)
   User question → vectorize → find similar in vector DB → take top-K → 
   pack into prompt → LLM answers

Step by Step

Step 1: Prepare Documents

Assume you have an enterprise wiki:

docs/
├── product_intro.md
├── pricing.md
├── faq.md
├── tutorial.md
└── ...

Read into Python:

import os

documents = []
for filename in os.listdir('docs/'):
    with open(f'docs/{filename}', 'r', encoding='utf-8') as f:
        documents.append({
            'source': filename,
            'text': f.read()
        })

Step 2: Chunking

Why not just send the whole article to LLM?

LLM context is limited (GPT-4 128k)
Too much content wastes money per call
“Finding precise small passages” beats “sending entire articles” for retrieval accuracy

Chunking strategies (simplest to hardest):

A. Fixed-length chunking

def fixed_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

Simple but might cut paragraphs.

B. Paragraph chunking

def paragraph_chunk(text):
    return [p.strip() for p in text.split('\n\n') if p.strip()]

Smarter but paragraphs aren’t uniform length.

C. Recursive chunking (practical choice)

LangChain’s RecursiveCharacterTextSplitter: try paragraphs first, then sentences, then characters.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_text(text)

D. Semantic chunking (advanced)

Use embeddings to determine “which sentences belong together.” Best quality, hardest to implement.

A RAG system’s quality is 70% in chunking strategy. Most-overlooked engineering detail.

Step 3: Embedding

Turn each chunk into a vector.

With OpenAI Embeddings

from openai import OpenAI
client = OpenAI()

def embed(text):
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return resp.data[0].embedding   # 1536-dim vector

vectors = [embed(chunk) for chunk in chunks]

With Open-Source (no $ cost)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
vectors = model.encode(chunks)

Top embedding models (2026):

Model	Dim	Strength
OpenAI text-embedding-3	1536/3072	General, strongest in English
Cohere embed-multilingual-v3	1024	Multilingual
BGE (BAAI)	1024	Best open-source
Voyage AI	1024	Long text

Choosing embedding > choosing LLM—a RAG system’s retrieval precision depends on embedding model first.

Step 4: Store in Vector DB

Can’t compute cosine similarity over tens of thousands every query with numpy—too slow. Need a vector database.

Choices

DB	Type	When to use
FAISS (Facebook)	Local lib	Prototype, single machine
ChromaDB	Local/remote	Small to medium
Pinecone	Cloud	Production, managed
Weaviate	Self/cloud	Complex queries
Qdrant (Rust)	Self/cloud	Best performance
Milvus	Self-host	Massive scale
pgvector (Postgres extension)	Existing PG	No new dependency

Get Started with ChromaDB

import chromadb

client = chromadb.Client()
collection = client.create_collection("my_docs")

# Add
collection.add(
    documents=chunks,
    embeddings=vectors,
    metadatas=[{'source': doc['source']} for doc in documents for _ in chunks_of(doc)],
    ids=[f"chunk_{i}" for i in range(len(chunks))]
)

Step 5: Retrieve

When user asks:

def retrieve(query, k=5):
    query_vec = embed(query)
    results = collection.query(
        query_embeddings=[query_vec],
        n_results=k
    )
    return results['documents'][0]

Step 6: (Optional, highly recommended) Rerank

Retrieved top-K isn’t always great—top-1 might be worse than top-5.

Add a rerank model—use a more precise model to score top-N, pick the truly most relevant top-K:

from sentence_transformers import CrossEncoder
reranker = CrossEncoder('BAAI/bge-reranker-large')

def rerank(query, candidates, top_k=3):
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:top_k]]

Classic: retrieve top-20 → rerank → top-3. Significantly improves RAG accuracy.

Step 7: Build prompt + Call LLM

def rag_answer(query):
    candidates = retrieve(query, k=20)
    top_chunks = rerank(query, candidates, top_k=3)

    context = "\n\n".join([f"[{i+1}] {chunk}" for i, chunk in enumerate(top_chunks)])
    prompt = f"""Please answer the user's question based on the materials below.
If the materials don't contain relevant information, clearly say "I don't know."

【Materials】
{context}

【User Question】
{query}

【Your Answer】"""

    resp = client.chat.completions.create(
        model="claude-haiku-4-5",
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.choices[0].message.content

Complete RAG—about 100 lines of code.

Real Deployment “Gotchas”

Theory RAG is easy; production RAG is all details:

1. Decomposing Complex Queries

User asks “compare our A product vs B product price-performance”— pure retrieval of “A product” or “B product” alone is insufficient. Need to decompose query with LLM first, retrieve separately, then synthesize.

2. Metadata Filtering

“What was our 2024 policy?”— shouldn’t mix 2023, 2022 results. Vector DB needs to support metadata filtering (year, department).

3. Hybrid Search

Vector search is great at “semantic similarity,” but bad at exact keywords (SKUs, names, jargon). Fix: run BM25 (term matching) AND vector search, fuse results.

4. Chunks Too Small or Too Big

Too small (100 chars) → context lost. Too big (2000 chars) → diluted key info. Sweet spot: 300-800 chars + 50-char overlap.

5. Citations and Traceability

After LLM answers, show “this answer comes from chunk X”— let users trace back.

# In prompt, require model to cite
"Please tag claims with source numbers like [1] [2]"

6. Performance

Cache: repeated queries
Async: parallel retrieval and prompt assembly
Streaming: LLM streams as it generates

7. Evaluation

How do you know your RAG is good?

Recall: are relevant chunks retrieved (need ground truth)
Generation quality: human eval / LLM eval (use GPT-4 as judge)
Response time: how long from question to answer
Cost: $ per query

Recommend RAGAS (open-source RAG evaluation).

Tooling Ecosystem

Don’t write from scratch? Use frameworks:

Tool	Note
LangChain	Most popular, biggest ecosystem, but verbose
LlamaIndex	RAG-focused, elegant API
Haystack (Deepset)	Old, enterprise-grade
Verba (Weaviate)	UI included
Dify	Low-code, visual

Get started: write once with LangChain / LlamaIndex—understand each step. Production: pick by team and scenario—no one-size-fits-all.

RAG in Real Business

Enterprise AI apps are 90% RAG.

Customer service bots: retrieve FAQs + ticket history
Legal assistants: retrieve statutes + case law
Medical consultation: retrieve clinical guidelines + records
Company AI knowledge base: employees ask “how do we do X here”
AI writing assistants: retrieve brand guidelines + past articles

From “POC” to “launch”, 90% of engineering is on RAG—chunking strategy, embedding choice, rerank, prompt tuning, evaluation.

💡 RAG is the de facto standard for LLM applications

In 2026:

Almost all enterprise AI projects are RAG-based
Almost all AI startups are doing “X industry’s RAG”
Big companies use more complex variants (GraphRAG, Agentic RAG)

Learning RAG = learning LLM application engineering.

Next: “LoRA Fine-tuning Basics” — RAG lets LLMs “know” your data; fine-tuning makes the LLM “become” what you want.