HelloAI
L4 Chapter 3 🐣 🕒 17 min

RAG from 0 to 1: Let LLMs Answer Based on Your Data

Enterprise AI applications are 90% RAG. This article walks you through building a runnable RAG system—from chunking to deployment.

H
HelloAI Editors
7/5/2026

L0-04 covered: RAG is the most practical antidote to LLM hallucination.

The RAG visualization showed you the full process. This article opens up each step—build a runnable RAG system from scratch.

You can use this to build one in 100 lines of Python.

Why RAG

Two big problems with naked LLMs:

  1. Knowledge cutoff—training data ends at some date; things after that, unknown
  2. Doesn’t know your private data—your company docs, your customer emails, your wiki

How does RAG solve this?

Retrieval → Augmented → Generation

Treat LLM as a “capable intern who reads materials, writes human language.” Don’t have it answer from memory; give it materials so it reads first, then answers.

The Full Process (5 steps)

1. Indexing phase (offline, once)
   Your docs → chunks → vectorize → store in vector DB

2. Query phase (every time user asks)
   User question → vectorize → find similar in vector DB → take top-K → 
   pack into prompt → LLM answers

Step by Step

Step 1: Prepare Documents

Assume you have an enterprise wiki:

docs/
├── product_intro.md
├── pricing.md
├── faq.md
├── tutorial.md
└── ...

Read into Python:

import os

documents = []
for filename in os.listdir('docs/'):
    with open(f'docs/{filename}', 'r', encoding='utf-8') as f:
        documents.append({
            'source': filename,
            'text': f.read()
        })

Step 2: Chunking

Why not just send the whole article to LLM?

  • LLM context is limited (GPT-4 128k)
  • Too much content wastes money per call
  • “Finding precise small passages” beats “sending entire articles” for retrieval accuracy

Chunking strategies (simplest to hardest):

A. Fixed-length chunking

def fixed_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

Simple but might cut paragraphs.

B. Paragraph chunking

def paragraph_chunk(text):
    return [p.strip() for p in text.split('\n\n') if p.strip()]

Smarter but paragraphs aren’t uniform length.

C. Recursive chunking (practical choice)

LangChain’s RecursiveCharacterTextSplitter: try paragraphs first, then sentences, then characters.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_text(text)

D. Semantic chunking (advanced)

Use embeddings to determine “which sentences belong together.” Best quality, hardest to implement.

A RAG system’s quality is 70% in chunking strategy. Most-overlooked engineering detail.

Step 3: Embedding

Turn each chunk into a vector.

With OpenAI Embeddings

from openai import OpenAI
client = OpenAI()

def embed(text):
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return resp.data[0].embedding   # 1536-dim vector

vectors = [embed(chunk) for chunk in chunks]

With Open-Source (no $ cost)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
vectors = model.encode(chunks)

Top embedding models (2026):

ModelDimStrength
OpenAI text-embedding-31536/3072General, strongest in English
Cohere embed-multilingual-v31024Multilingual
BGE (BAAI)1024Best open-source
Voyage AI1024Long text

Choosing embedding > choosing LLM—a RAG system’s retrieval precision depends on embedding model first.

Step 4: Store in Vector DB

Can’t compute cosine similarity over tens of thousands every query with numpy—too slow. Need a vector database.

Choices

DBTypeWhen to use
FAISS (Facebook)Local libPrototype, single machine
ChromaDBLocal/remoteSmall to medium
PineconeCloudProduction, managed
WeaviateSelf/cloudComplex queries
Qdrant (Rust)Self/cloudBest performance
MilvusSelf-hostMassive scale
pgvector (Postgres extension)Existing PGNo new dependency

Get Started with ChromaDB

import chromadb

client = chromadb.Client()
collection = client.create_collection("my_docs")

# Add
collection.add(
    documents=chunks,
    embeddings=vectors,
    metadatas=[{'source': doc['source']} for doc in documents for _ in chunks_of(doc)],
    ids=[f"chunk_{i}" for i in range(len(chunks))]
)

Step 5: Retrieve

When user asks:

def retrieve(query, k=5):
    query_vec = embed(query)
    results = collection.query(
        query_embeddings=[query_vec],
        n_results=k
    )
    return results['documents'][0]

Retrieved top-K isn’t always great—top-1 might be worse than top-5.

Add a rerank model—use a more precise model to score top-N, pick the truly most relevant top-K:

from sentence_transformers import CrossEncoder
reranker = CrossEncoder('BAAI/bge-reranker-large')

def rerank(query, candidates, top_k=3):
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:top_k]]

Classic: retrieve top-20 → rerank → top-3. Significantly improves RAG accuracy.

Step 7: Build prompt + Call LLM

def rag_answer(query):
    candidates = retrieve(query, k=20)
    top_chunks = rerank(query, candidates, top_k=3)

    context = "\n\n".join([f"[{i+1}] {chunk}" for i, chunk in enumerate(top_chunks)])
    prompt = f"""Please answer the user's question based on the materials below.
If the materials don't contain relevant information, clearly say "I don't know."

【Materials】
{context}

【User Question】
{query}

【Your Answer】"""

    resp = client.chat.completions.create(
        model="claude-haiku-4-5",
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.choices[0].message.content

Complete RAG—about 100 lines of code.

Real Deployment “Gotchas”

Theory RAG is easy; production RAG is all details:

1. Decomposing Complex Queries

User asks “compare our A product vs B product price-performance”— pure retrieval of “A product” or “B product” alone is insufficient. Need to decompose query with LLM first, retrieve separately, then synthesize.

2. Metadata Filtering

“What was our 2024 policy?”— shouldn’t mix 2023, 2022 results. Vector DB needs to support metadata filtering (year, department).

Vector search is great at “semantic similarity,” but bad at exact keywords (SKUs, names, jargon). Fix: run BM25 (term matching) AND vector search, fuse results.

4. Chunks Too Small or Too Big

Too small (100 chars) → context lost. Too big (2000 chars) → diluted key info. Sweet spot: 300-800 chars + 50-char overlap.

5. Citations and Traceability

After LLM answers, show “this answer comes from chunk X”— let users trace back.

# In prompt, require model to cite
"Please tag claims with source numbers like [1] [2]"

6. Performance

  • Cache: repeated queries
  • Async: parallel retrieval and prompt assembly
  • Streaming: LLM streams as it generates

7. Evaluation

How do you know your RAG is good?

  • Recall: are relevant chunks retrieved (need ground truth)
  • Generation quality: human eval / LLM eval (use GPT-4 as judge)
  • Response time: how long from question to answer
  • Cost: $ per query

Recommend RAGAS (open-source RAG evaluation).

Tooling Ecosystem

Don’t write from scratch? Use frameworks:

ToolNote
LangChainMost popular, biggest ecosystem, but verbose
LlamaIndexRAG-focused, elegant API
Haystack (Deepset)Old, enterprise-grade
Verba (Weaviate)UI included
DifyLow-code, visual

Get started: write once with LangChain / LlamaIndex—understand each step. Production: pick by team and scenario—no one-size-fits-all.

RAG in Real Business

Enterprise AI apps are 90% RAG.

  • Customer service bots: retrieve FAQs + ticket history
  • Legal assistants: retrieve statutes + case law
  • Medical consultation: retrieve clinical guidelines + records
  • Company AI knowledge base: employees ask “how do we do X here”
  • AI writing assistants: retrieve brand guidelines + past articles

From “POC” to “launch”, 90% of engineering is on RAG—chunking strategy, embedding choice, rerank, prompt tuning, evaluation.

💡 RAG is the de facto standard for LLM applications

In 2026:

  • Almost all enterprise AI projects are RAG-based
  • Almost all AI startups are doing “X industry’s RAG”
  • Big companies use more complex variants (GraphRAG, Agentic RAG)

Learning RAG = learning LLM application engineering.

Next: “LoRA Fine-tuning Basics” — RAG lets LLMs “know” your data; fine-tuning makes the LLM “become” what you want.