AI Series Interview Question 11: How to Optimize RAG?

Optimizing RAG is not a single-step adjustment but a full-chain optimization process. Below I provide systematic optimization strategies from four dimensions: data indexing side, retrieval side, generation side, and evaluation side, along with practical experience that can be mentioned in interviews.

1. Data Indexing Side Optimization (Improve "Knowledge Base" Quality)

This is the most overlooked yet most effective area.

Optimization Point	Problem Phenomenon	Specific Approach	Effect Metric
Document Parsing	Tables and flowcharts in PDFs are ignored, or text is garbled and out of order.	Use better parsing libraries (e.g., `unstructured`, `pypdf` with layout preservation mode); for tables, extract using `pandas` and convert to Markdown.	Recall +5~15%
Text Chunk Size	Small chunks lose context (e.g., "his revenue growth this year" where "his" reference is lost); large chunks introduce retrieval noise.	Experiment with different chunk sizes (256/512/768 tokens), overlap set to 10~20%; for long documents, split by semantic boundaries (paragraphs/headings) rather than fixed length.	Hit Rate / Faithfulness
Metadata Attachment	Relevant paragraphs are retrieved but cannot be traced back to source or time, or filtering by domain is needed.	Add metadata to each chunk: `source` (filename/URL), `timestamp`, `page_num`, `doc_type`. Use filters during retrieval (e.g., `doc_type == 'legal'`).	Filter Precision
Embedding Model Selection	Generic embeddings perform poorly in vertical domains (medical, code, legal).	Use domain fine-tuned models (BGE-large-zh, GTE-Qwen2-7B-instruct); or fine-tune your own embedding model (using triplet loss).	Retrieval MRR@10 +10~20%

2. Retrieval Side Optimization (Make "Book Searching" More Accurate)

Retrieval determines the quality of "reference materials" fed to the LLM.

Optimization Point	Problem Phenomenon	Specific Approach	Effect
Hybrid Retrieval	Vector retrieval cannot match exact terms (e.g., product model `ABC-123`), keyword retrieval cannot understand synonyms.	Use both vector retrieval (semantic) and BM25 (keyword), fuse via weighting (e.g., 0.7vector + 0.3BM25) or reranking.	Recall +10~25%
Reranking	The top results from vector retrieval are not necessarily the most relevant; the 10th result might be the best.	Use a cross-encoder model (e.g., `BGE-reranker-v2`, Cohere Rerank) to rescore the candidate set (e.g., top 20) and take top-K.	Significant improvement in hit rate (especially top-1)
Query Rewriting	User questions are vague or have unclear references in multi-turn dialogue (e.g., "What's its price?").	Use LLM to rewrite the original question into a more retrieval-friendly form (e.g., "What is the price of iPhone 15?"); or complete with dialogue history.	Recall +5~15%
HyDE	User questions are too short or abstract (e.g., "Tell me about photosynthesis"), direct retrieval is poor.	First let the LLM generate a hypothetical answer, then use that answer to retrieve documents.	Suitable for open-domain, but not for factual precise QA
Top-K Adjustment	K too small may miss key information; K too large increases token consumption and noise.	Experiment with K=3/5/10, observe the balance between recall and answer faithfulness.	Efficiency vs. Effect trade-off

3. Generation Side Optimization (Make LLM Use Reference Material Well)

Retrieval can be accurate, but if prompts are poor or the model is weak, it won't work.

Optimization Point	Problem Phenomenon	Specific Approach	Effect
Prompt Engineering	LLM ignores retrieved content or fabricates information.	Clear instruction: "Answer based only on the provided reference materials. If the information is insufficient or irrelevant, respond with 'Not enough information'." Add few-shot examples showing how to cite sources.	Faithfulness +20~40%
Context Compression	Retrieved content is too long (exceeds model context window) or mostly noise.	Use `LLMLingua` or `Selective Context` compression, keep the most relevant sentences before feeding to LLM.	Reduce risk of losing information
LLM Model Upgrade	Small models (7B) cannot perform complex reasoning or remember long contexts.	Switch to stronger models (GPT-4o, Claude 3.5 Sonnet, Qwen2.5-72B).	Significant improvement in reasoning accuracy
Streaming and Citations	Users cannot verify answer credibility.	During generation, have LLM output `[citation:1]` corresponding to the retrieved document ID. Attach original links on the backend.	User trust + debuggability
Refusal Answer Calibration	Model fabricates when it shouldn't, or says "I don't know" when it should answer.	Set a similarity threshold: if the cosine similarity between the top-1 retrieved chunk and the question is below 0.7, prompt the LLM that "information is irrelevant".	Reduce hallucination rate

4. Evaluation and Iteration Side (Know Where to Tune)

Without measurement, there is no optimization.

Optimization Point	Approach	Metric
Build Evaluation Set	Prepare 100~300 real user questions + standard answers + correct retrieval document IDs.	Cover different difficulty levels and intents.
Automated Evaluation	Use RAGAS (Faithfulness, Answer Relevance, Context Recall) or TruLens.	Three core metrics: Faithfulness, Answer Relevance, Context Recall.
Human Evaluation	Sample 20 bad cases weekly, analyze error types (retrieval failure / generation error / knowledge base missing).	Prioritize improvements.
A/B Testing	Bucket test different retrieval strategies in production (e.g., BM25 vs hybrid retrieval).	Online metrics: user satisfaction, no-answer rate.

5. "Practical Experience" to Mention in Interviews (Bonus Points)

"In my RAG project, the baseline hit rate was initially only 67%. I did three things:
1. Changed from fixed 1024-token chunks to dynamic semantic chunking (by heading + paragraph), increasing the hit rate to 74%;
2. Added hybrid retrieval (vector + BM25) and a small rerank model, raising the hit rate to 83%;
3. Optimized prompts and enforced a [No relevant information found] requirement, reducing the hallucination rate from 22% to below 5%.

Additionally, we built a continuous evaluation pipeline, running RAGAS scores on 200 questions before each change to ensure no regression."

Final Summary: A Complete RAG Optimization Roadmap

Data Layer    → Document cleaning, chunk optimization, metadata enhancement, domain embedding
Retrieval Layer → Hybrid retrieval, rerank, query rewriting, HyDE, Top-K tuning
Generation Layer → Prompt reinforcement, instruction requirements, compression, citation, refusal threshold
Evaluation Layer → Evaluation set, RAGAS, human analysis, A/B experiments