AI Series Interview Question 11: How to Optimize RAG?
Optimizing RAG is not a single-step adjustment but a full-chain optimization process. Below I provide systematic optimization strategies from four dimensions: data indexing side, retrieval side, generation side, and evaluation side, along with practical experience that can be mentioned in interviews.
1. Data Indexing Side Optimization (Improve "Knowledge Base" Quality)
This is the most overlooked yet most effective area.
| Optimization Point | Problem Phenomenon | Specific Approach | Effect Metric |
|---|---|---|---|
| Document Parsing | Tables and flowcharts in PDFs are ignored, or text is garbled and out of order. | Use better parsing libraries (e.g., unstructured, pypdf with layout preservation mode); for tables, extract using pandas and convert to Markdown. |
Recall +5~15% |
| Text Chunk Size | Small chunks lose context (e.g., "his revenue growth this year" where "his" reference is lost); large chunks introduce retrieval noise. | Experiment with different chunk sizes (256/512/768 tokens), overlap set to 10~20%; for long documents, split by semantic boundaries (paragraphs/headings) rather than fixed length. | Hit Rate / Faithfulness |
| Metadata Attachment | Relevant paragraphs are retrieved but cannot be traced back to source or time, or filtering by domain is needed. | Add metadata to each chunk: source (filename/URL), timestamp, page_num, doc_type. Use filters during retrieval (e.g., doc_type == 'legal'). |
Filter Precision |
| Embedding Model Selection | Generic embeddings perform poorly in vertical domains (medical, code, legal). | Use domain fine-tuned models (BGE-large-zh, GTE-Qwen2-7B-instruct); or fine-tune your own embedding model (using triplet loss). | Retrieval MRR@10 +10~20% |
2. Retrieval Side Optimization (Make "Book Searching" More Accurate)
Retrieval determines the quality of "reference materials" fed to the LLM.
| Optimization Point | Problem Phenomenon | Specific Approach | Effect |
|---|---|---|---|
| Hybrid Retrieval | Vector retrieval cannot match exact terms (e.g., product model ABC-123), keyword retrieval cannot understand synonyms. |
Use both vector retrieval (semantic) and BM25 (keyword), fuse via weighting (e.g., 0.7vector + 0.3BM25) or reranking. | Recall +10~25% |
| Reranking | The top results from vector retrieval are not necessarily the most relevant; the 10th result might be the best. | Use a cross-encoder model (e.g., BGE-reranker-v2, Cohere Rerank) to rescore the candidate set (e.g., top 20) and take top-K. |
Significant improvement in hit rate (especially top-1) |
| Query Rewriting | User questions are vague or have unclear references in multi-turn dialogue (e.g., "What's its price?"). | Use LLM to rewrite the original question into a more retrieval-friendly form (e.g., "What is the price of iPhone 15?"); or complete with dialogue history. | Recall +5~15% |
| HyDE | User questions are too short or abstract (e.g., "Tell me about photosynthesis"), direct retrieval is poor. | First let the LLM generate a hypothetical answer, then use that answer to retrieve documents. | Suitable for open-domain, but not for factual precise QA |
| Top-K Adjustment | K too small may miss key information; K too large increases token consumption and noise. | Experiment with K=3/5/10, observe the balance between recall and answer faithfulness. | Efficiency vs. Effect trade-off |
3. Generation Side Optimization (Make LLM Use Reference Material Well)
Retrieval can be accurate, but if prompts are poor or the model is weak, it won't work.
| Optimization Point | Problem Phenomenon | Specific Approach | Effect |
|---|---|---|---|
| Prompt Engineering | LLM ignores retrieved content or fabricates information. | Clear instruction: "Answer based only on the provided reference materials. If the information is insufficient or irrelevant, respond with 'Not enough information'." Add few-shot examples showing how to cite sources. | Faithfulness +20~40% |
| Context Compression | Retrieved content is too long (exceeds model context window) or mostly noise. | Use LLMLingua or Selective Context compression, keep the most relevant sentences before feeding to LLM. |
Reduce risk of losing information |
| LLM Model Upgrade | Small models (7B) cannot perform complex reasoning or remember long contexts. | Switch to stronger models (GPT-4o, Claude 3.5 Sonnet, Qwen2.5-72B). | Significant improvement in reasoning accuracy |
| Streaming and Citations | Users cannot verify answer credibility. | During generation, have LLM output [citation:1] corresponding to the retrieved document ID. Attach original links on the backend. |
User trust + debuggability |
| Refusal Answer Calibration | Model fabricates when it shouldn't, or says "I don't know" when it should answer. | Set a similarity threshold: if the cosine similarity between the top-1 retrieved chunk and the question is below 0.7, prompt the LLM that "information is irrelevant". | Reduce hallucination rate |
4. Evaluation and Iteration Side (Know Where to Tune)
Without measurement, there is no optimization.
| Optimization Point | Approach | Metric |
|---|---|---|
| Build Evaluation Set | Prepare 100~300 real user questions + standard answers + correct retrieval document IDs. | Cover different difficulty levels and intents. |
| Automated Evaluation | Use RAGAS (Faithfulness, Answer Relevance, Context Recall) or TruLens. | Three core metrics: Faithfulness, Answer Relevance, Context Recall. |
| Human Evaluation | Sample 20 bad cases weekly, analyze error types (retrieval failure / generation error / knowledge base missing). | Prioritize improvements. |
| A/B Testing | Bucket test different retrieval strategies in production (e.g., BM25 vs hybrid retrieval). | Online metrics: user satisfaction, no-answer rate. |
5. "Practical Experience" to Mention in Interviews (Bonus Points)
"In my RAG project, the baseline hit rate was initially only 67%. I did three things:
1. Changed from fixed 1024-token chunks to dynamic semantic chunking (by heading + paragraph), increasing the hit rate to 74%;
2. Added hybrid retrieval (vector + BM25) and a small rerank model, raising the hit rate to 83%;
3. Optimized prompts and enforced a[No relevant information found]requirement, reducing the hallucination rate from 22% to below 5%.Additionally, we built a continuous evaluation pipeline, running RAGAS scores on 200 questions before each change to ensure no regression."
Final Summary: A Complete RAG Optimization Roadmap
Data Layer → Document cleaning, chunk optimization, metadata enhancement, domain embedding
Retrieval Layer → Hybrid retrieval, rerank, query rewriting, HyDE, Top-K tuning
Generation Layer → Prompt reinforcement, instruction requirements, compression, citation, refusal threshold
Evaluation Layer → Evaluation set, RAGAS, human analysis, A/B experiments
评论
暂无已展示的评论。
发表评论(匿名)