AI Interview Series 9: How to Evaluate the Accuracy of Knowledge Question Answering Systems?

Accuracy is the core lifeline of a knowledge question answering system, especially when you try to apply it to serious scenarios (such as medical, legal, internal enterprise support). My view can be summarized as: Accuracy is a multi-dimensional concept; you cannot just look at a single number, but must comprehensively evaluate it by combining system capabilities, task difficulty, and cost of errors.

Let's expand on this from four levels:

1. Accuracy is Not Simply "Right or Wrong"

For traditional classification problems (e.g., image recognition), accuracy is clear. But knowledge QA systems are different. Common dimensions include:

Dimension	Meaning	Evaluation Example
Retrieval Hit Rate	Can the system retrieve document chunks containing the correct answer from the knowledge base?	User asks "Company A's 2024 revenue", can the system retrieve the segment of the financial report containing that data?
Generation Fidelity	Is the model's answer strictly based on the retrieved content, not made up?	Retrieved material doesn't mention "growth rate", but the model says "grew by 5%" → unfaithful
Answer Correctness	Is the final answer consistent with the facts (or reference answer)?	Correct answer is "4.2 billion", model outputs "4.2 billion" or "about 4.2 billion RMB" can both be considered correct
Refusal Rate	When the knowledge base lacks relevant information, can the system proactively say "I don't know" instead of guessing?	When retrieval is empty or confidence is low, output "Sorry, no relevant information found"

A system may have a very high retrieval hit rate (always finding relevant paragraphs) but very low generation fidelity (always embellishing), resulting in poor final accuracy. Therefore, when evaluating accuracy, you must first clarify which stage you are measuring.

2. Under Current Technology, What Accuracy Can RAG Systems Achieve?

There is no unified number, but we can refer to some public research and practice:

Simple factual QA (single-hop, answer appears directly in one piece of material):
Retrieval hit rate can reach 90-98% (depending on knowledge base quality and retriever), generation fidelity can be 95%+ with well-designed prompts, and overall accuracy can be between 85-95%.
Multi-hop reasoning (requires combining information from two or more different pieces of material):
Retrieval accuracy plummets to 50-70%, and answer correctness may only be 40-60%. This is a major challenge for current RAG.
Open-domain + noisy knowledge base (e.g., massive web pages):
Accuracy drops significantly because retrieval may introduce noise, and models are easily distracted.

Conclusion: In controlled environments (clean, structured, appropriate document granularity), RAG can achieve over 90% accuracy; but in complex, open, multi-step reasoning scenarios, accuracy is often unsatisfactory and requires extensive optimization.

3. Core Factors Affecting Accuracy

If you find your RAG system's accuracy unsatisfactory, you can usually check from the following four stages:

Knowledge Base Itself
Is the data outdated, incomplete, or even erroneous?
Are documents messy (e.g., scanned files not OCRed, tables broken into garbled text)?
Chunking and Indexing
Text chunks too short → lose context; too long → introduce noise.
Is the embedding model suitable for your domain (general models may perform poorly on legal terms)?
Retrieval Strategy
Using only vector retrieval may miss exact keywords (e.g., product model numbers).
Without re-ranking, irrelevant content may mix into top results.
Generation Stage
Does the prompt explicitly require "answer only based on the provided materials; if insufficient, refuse"?
Is the model capable enough (small models easily overlook details in long contexts)?

A common misconception: attributing low accuracy directly to insufficient LLM capability, when actually most problems lie in "retrieval" and "prompt design".

4. How to Properly "View" Accuracy – Key Attitudes in Practice

1. Set Reasonable Benchmarks and Expectations

For high-risk domains (medical diagnosis, legal advice), even 90% accuracy is far from enough; human review or multi-validation must be introduced.
For low-risk scenarios (customer service fallback, internal knowledge search), 80% accuracy plus a friendly "I don't know" response may already significantly improve efficiency.

2. Don't Pursue 100%, Pursue "Verifiable Accuracy"

Make the system automatically attach citation sources (which document, which paragraph).
Users can see the original text to verify; even if the answer occasionally errs, transparency builds trust.
Add confidence scores, and when low, proactively indicate "This answer has low reliability; we recommend checking the original document."

3. Treat Accuracy as a Continuous Optimization Target, Not a One-Time Goal

Establish an evaluation pipeline: regularly sample a set of human-annotated questions, and automatically evaluate retrieval hit rate and generation fidelity.
Use tools like RAGAS, TruLens for systematic evaluation, rather than relying on a few case-by-case gut feelings.
Continuously adjust based on bad cases: chunking method, retriever parameters, re-ranking model, prompts.

4. Distinguish Between "System Errors" and "Human Standard Inconsistency"

Sometimes the system's answer differs from user expectations, but it is actually correct according to the knowledge base (because the knowledge base itself has limitations or controversies).
In such cases, you need to define whether accuracy is based on "knowledge base facts" or "externally recognized facts."

Final Summary

The accuracy of a knowledge question answering system is not a static perfect score, but a comprehensive capability indicator reflecting "knowledge coverage + retrieval precision + generation fidelity + ability to refuse." When evaluating it, you should rationally recognize that current technology cannot achieve perfection, but also realize its practical value in business through designs such as citation tracing, confidence prompts, and human-machine collaboration.