RAG 2.0 Japanese Search Evaluation Guide (Detailed)
Version: 1.0
Date: 2026-05-20
Status: Production-Ready
Document ID: RAG-JP-EVAL-GUIDE-001
Table of Contents
- Overview
- Quick Start
- Detailed Usage
- Evaluation Criteria & Improvement Guide
- Test Set Customization
- Dashboard Implementation
- Troubleshooting
- Best Practices
Overview
This document explains how to evaluate search accuracy of the EvoSpikeNet RAG system for Japanese language processing.
Target Audience
- Developers: Measure search accuracy for Phase 1-5
- Operations: Continuous quality monitoring and alerting
- Researchers: Algorithm improvement verification
Evaluation Flow
Evaluation Workflow:
Test cases → Variation eval + Entity eval → Metrics aggregation
↓ ↓
Execution Report generation
↓
Improvement proposals
Quick Start
1. Install Dependencies
cd /path/to/rag-system
pip install pandas numpy pyyaml sudachi elasticsearch pymilvus transformers tner
2. Run Evaluation
# Full evaluation
python backend/evaluate_robustness.py
# Variation only
python backend/evaluate_robustness.py --variation-only
# Entity only
python backend/evaluate_robustness.py --entity-only
3. Check Results
Generated Files:
backend/
├── variation_results.csv # Variation evaluation results
├── entity_results.csv # Entity evaluation results
└── evaluation_report.txt # Summary report
Detailed Usage
A. Text Variation Evaluation
A.1 Concept
Text variation refers to different representations of the same concept (common in Japanese):
Example: "盆休み", "お盆休み", "盆期間"
→ All should find the same document
Example: "サーバー" vs "サーバ"
→ Long vowel variation
A.2 Evaluation Metrics
| Metric | Definition | Formula | Target |
|---|---|---|---|
| MRR | Reciprocal rank of correct document | 1/rank or 0 | > 0.7 |
| found_rate | Ratio of variations finding correct doc | found/total | > 0.8 |
| MRR_std | Standard deviation of MRR | std(MRRs) | < 0.15 |
A.3 Implementation
from backend.evaluate_robustness import RAGRobustnessEvaluator
import yaml
rag = EvoRAG(llm_type="huggingface")
evaluator = RAGRobustnessEvaluator(rag)
with open("backend/test_cases.yaml") as f:
cases = yaml.safe_load(f)
results_df = evaluator.variation_eval.evaluate_variation_set(cases["variation"])
results_df.to_csv("variation_results.csv", index=False)
A.4 Result Interpretation
MRR Value │ Rank │ Assessment
───────────┼───────┼─────────────
1.0 │ 1st │ ✓ Perfect
0.5 │ 2nd │ ✓ Good
0.33 │ 3rd │ △ Needs work
0.0 │ None │ ✗ Not found
B. Named Entity Evaluation
B.1 Concept
Named entities are meaningful noun phrases: person names, organizations, project IDs, product names.
Example: Project ID "EV-2024-001"
→ Should match exactly when in query
Example: Person name "John Doe"
→ Should not confuse with "Jane Doe"
B.2 Evaluation Metrics
| Metric | Definition | Formula | Target |
|---|---|---|---|
| Recall | Correct docs found / total correct | found/total_correct | > 0.8 |
| Precision | Correct results / all results | correct/returned | > 0.8 |
| F1 Score | Harmonic mean | 2(P*R)/(P+R) | > 0.75 |
B.3 Implementation
entity_cases = cases["entity"]
results_df = evaluator.entity_eval.evaluate_entity_set(entity_cases)
# Summary by entity type
summary = results_df.groupby("entity_type")[["recall", "precision", "f1_score"]].mean()
print(summary)
results_df.to_csv("entity_results.csv", index=False)
Evaluation Criteria & Improvement Guide
Variation: MRR < 0.5
Symptoms:
Query: "お盆休みの規定"
Expected: DOC_001 appears in top-3
Actual: DOC_001 not found (MRR = 0.0)
Improvement Strategy:
-
Upgrade Sudachi (Priority: High)
Sudachi handles complex compound words better than standard tokenizers.pip install --upgrade sudachi sudachipy_dict_small -
Add Normalization Mappings (Priority: Medium)
{ "char_filter": { "mapping": { "mappings": [ "サーバー => サーバ", "データベース => DB" ] } } } -
Increase Vector Search Weight (Priority: Medium)
final_score = 0.3 * bm25_score + 0.7 * vector_score
Entity: Recall < 0.8
Symptoms:
Query: "Project EV-2024-001 progress"
Expected: DOC_A, DOC_B found
Actual: Only DOC_A found (Recall = 50%)
Improvement Strategy:
-
Add NER Layer (Priority: High)
from transformers import pipeline ner = pipeline("ner", model="tner/roberta-large-japanese-char-luw-ner") entities = ner("Project EV-2024-001") # Prevents tokenizer from splitting entity -
Increase Entity Boost (Priority: High)
es_query = { "term": { "entity_project": { "value": "EV-2024-001", "boost": 5.0 # Increase boost multiplier } } } -
Build Internal Dictionary (Priority: Medium) Document all organizational entities (projects, departments, products).
Test Set Customization
Add Organization-Specific Terms
Step 1: Edit test_cases.yaml
variation:
- case_id: "var_custom_001"
canonical_form: "your_org_term"
variations:
- "variation1"
- "variation2"
ground_truth_doc_id: "DOC_CUSTOM_001"
variation_type: "custom"
entity:
- case_id: "ent_custom_001"
query: "Query containing your custom entity"
entity: "your_entity"
entity_type: "CUSTOM_TYPE"
ground_truth_doc_ids: ["DOC_A", "DOC_B"]
should_not_match: ["DOC_NEG"]
Step 2: Add Test Documents
from backend.rag_milvus import add_user_text
add_user_text(
collection=rag.collection,
text="Document about your custom term...",
source="DOC_CUSTOM_001"
)
Step 3: Run Evaluation
results = evaluator.run_full_evaluation(cases["variation"], cases["entity"])
print(results[results["case_id"].str.contains("custom")])
Dashboard Implementation
Streamlit Dashboard
Installation
pip install streamlit matplotlib plotly
app.py
import streamlit as st
import pandas as pd
from backend.evaluate_robustness import RAGRobustnessEvaluator
st.set_page_config(page_title="RAG Evaluation Dashboard", layout="wide")
st.title("🔍 RAG 2.0 Search Evaluation Dashboard")
if st.button("🚀 Run Evaluation"):
with st.spinner("Evaluating..."):
rag = EvoRAG(llm_type="huggingface")
evaluator = RAGRobustnessEvaluator(rag)
import yaml
with open("backend/test_cases.yaml") as f:
cases = yaml.safe_load(f)
var_results = evaluator.variation_eval.evaluate_variation_set(cases["variation"])
ent_results = evaluator.entity_eval.evaluate_entity_set(cases["entity"])
# Display metrics
col1, col2, col3, col4 = st.columns(4)
with col1:
mrr = var_results["mrr_mean"].mean()
st.metric("Variation MRR", f"{mrr:.1%}", "Target: 70%")
with col2:
recall = ent_results["recall"].mean()
st.metric("Entity Recall", f"{recall:.1%}", "Target: 80%")
# Display tables
st.subheader("📊 Variation Results")
st.dataframe(var_results, use_container_width=True)
st.subheader("📊 Entity Results")
st.dataframe(ent_results, use_container_width=True)
Run
streamlit run app.py
# Opens at http://localhost:8501
Troubleshooting
Q: Evaluation takes > 5 minutes
A: Optimize indexes
from pymilvus import Collection
collection = Collection("rag_kb")
collection.create_index("embedding", {"index_type": "IVF_FLAT", "nlist": 128})
# Elasticsearch optimization
es.indices.optimize(index="rag_kb_index_v2", force_merge=True)
Q: NER model not found
A: Install TNER with model
pip install tner
python -c "from transformers import pipeline; \
ner = pipeline('ner', model='tner/roberta-large-japanese-char-luw-ner')"
Q: Sudachi installation fails
A: Install with dependencies
pip install sudachi sudachipy-dict-small
Q: Elasticsearch connection error
A: Verify Elasticsearch is running
curl -X GET "localhost:9200/_cluster/health?pretty"
# If not running (Docker)
docker run -p 9200:9200 -e "discovery.type=single-node" \
docker.elastic.co/elasticsearch/elasticsearch:8.6.2
Best Practices
1. Weekly Evaluation
Cron Job:
# Add to crontab (every Monday 9am)
0 9 * * 1 cd /path/to/rag && python backend/evaluate_robustness.py
2. User Feedback Collection
@app.post("/api/v2/rag/feedback")
def record_feedback(query: str, doc_id: str, rating: int):
"""
rating: 1 (not relevant) ~ 5 (highly relevant)
"""
db.feedbacks.insert_one({
"query": query,
"doc_id": doc_id,
"rating": rating,
"timestamp": datetime.utcnow()
})
3. Monthly Improvement Cycle
Week 1 (Mon) → Run evaluation → Aggregate results
Week 2 (Mon) → Analyze low-rated cases
Week 3 (Mon) → Implement improvements
Week 4 (Mon) → Regression testing
End of month → Production deployment
References
Official Documentation
Papers
Document Version: 1.0
Last Updated: 2026-05-20
Status: Production-Ready
For questions: rag-team@evospikenet.com