How I Built an AI That Reads Any Document and Answers Your Questions
Most AI chatbots only know what they were trained on. This one reads YOUR files. Here is how RAG works — explained with visuals simple enough for a 10-year-old.
Imagine you have a magic robot friend. You give it your 500-page textbook before a test. It reads the whole thing in 2 seconds. Now you ask: 'What does Chapter 7 say about photosynthesis?' — and it gives you the exact right answer, straight from YOUR book, with the page number. That is exactly what this project does.
The Problem: AI Does Not Know Your Files
ChatGPT, Claude, and other AI models were trained on billions of web pages — but they stop learning at a certain date. They have never seen YOUR documents: your company policy manual, last quarter's financial report, or the 200-page contract you need to understand. Ask them about it and they either say 'I don't know' — or worse, confidently make something up.
Without RAG (The Old Problem)
User: 'What does our refund policy say?' → AI: 'Most companies allow 30-day returns...' — completely invented, because it never read your actual policy document.
The Solution: Give the AI a Library Card 📚
RAG — Retrieval-Augmented Generation — fixes this with a clever trick. Before answering, the AI first searches your actual documents, reads the most relevant parts, and then answers from what it found. Think of it like an open-book exam: instead of relying purely on memory, the AI gets to look at the book first.
How RAG Works — The Full Pipeline
📄 Your PDF / Word Doc / Text File
│
▼
✂️ CHUNKING → Split into 500-word pieces
│
▼
🧮 EMBEDDING → Convert each piece into numbers
│
▼
🗃️ VECTOR DB → Store all number-pieces (ChromaDB)
❓ You ask a question
│
▼
🔍 SEARCH → Find the 3 most relevant pieces
│
▼
🤖 LLM → 'Answer ONLY from these pieces'
│
▼
✅ Accurate answer + source citationStep 1 — Chunking (Slice the Book)
The document is cut into small, overlapping pieces of about 500 words each. Think of cutting a pizza — each slice is small enough to handle, but still captures the full flavour. Overlap between chunks means no meaning falls through the cracks.
Step 2 — Embedding (Words → Numbers)
Each chunk is converted into a list of ~1,500 numbers called a vector. Similar meanings get similar numbers — so 'dog' and 'puppy' end up close together in number-space. We use Google's text-embedding-004 model for this.
Step 3 — Vector Database (The Smart Filing Cabinet)
All the number-vectors are stored in ChromaDB. Unlike a regular database that searches by exact words, ChromaDB searches by meaning. So 'cost' matches 'price' and 'expense' even if the exact word is different.
Step 4 — Semantic Search (Find the Right Pieces)
Your question is converted into numbers too. ChromaDB finds the stored chunks whose numbers are closest to your question's numbers — the most relevant paragraphs from your document.
Step 5 — LLM Generation (The Final Answer)
The top 3–5 chunks are handed to Google Gemini alongside your question, with one instruction: answer ONLY from what you have been given. No guessing allowed. The result is accurate, grounded, and citable.
80%+
Faster than manual search
Zero
Hallucinated answers
Any
PDF size supported
Live
Publicly deployed
The engineering challenge is not assembling the pipeline — LangChain makes that fast. The hard part is tuning: the right chunk size, the right overlap percentage, the right number of retrieved chunks, and whether to add a re-ranker on top. A working demo takes a day. A production-grade system that handles edge cases reliably takes weeks. If you need one, I build it right.