RAG 2.0 Japanese Search Evaluation Guide (Detailed)

Version: 1.0
Date: 2026-05-20
Status: Production-Ready
Document ID: RAG-JP-EVAL-GUIDE-001

Overview

This document explains how to evaluate search accuracy of the EvoSpikeNet RAG system for Japanese language processing.

Target Audience

Developers: Measure search accuracy for Phase 1-5
Operations: Continuous quality monitoring and alerting
Researchers: Algorithm improvement verification

Evaluation Flow

Evaluation Workflow:
  Test cases → Variation eval + Entity eval → Metrics aggregation
              ↓                              ↓
              Execution                      Report generation
              ↓
              Improvement proposals

Quick Start

1. Install Dependencies

cd /path/to/rag-system
pip install pandas numpy pyyaml sudachi elasticsearch pymilvus transformers tner

2. Run Evaluation

# Full evaluation
python backend/evaluate_robustness.py

# Variation only
python backend/evaluate_robustness.py --variation-only

# Entity only
python backend/evaluate_robustness.py --entity-only

3. Check Results

Generated Files:

backend/
├── variation_results.csv      # Variation evaluation results
├── entity_results.csv         # Entity evaluation results
└── evaluation_report.txt      # Summary report

Detailed Usage

A. Text Variation Evaluation

A.1 Concept

Text variation refers to different representations of the same concept (common in Japanese):

Example: "盆休み", "お盆休み", "盆期間"
→ All should find the same document

Example: "サーバー" vs "サーバ"
→ Long vowel variation

A.2 Evaluation Metrics

Metric	Definition	Formula	Target
MRR	Reciprocal rank of correct document	1/rank or 0	> 0.7
found_rate	Ratio of variations finding correct doc	found/total	> 0.8
MRR_std	Standard deviation of MRR	std(MRRs)	< 0.15

A.3 Implementation

from backend.evaluate_robustness import RAGRobustnessEvaluator
import yaml

rag = EvoRAG(llm_type="huggingface")
evaluator = RAGRobustnessEvaluator(rag)

with open("backend/test_cases.yaml") as f:
    cases = yaml.safe_load(f)

results_df = evaluator.variation_eval.evaluate_variation_set(cases["variation"])
results_df.to_csv("variation_results.csv", index=False)

A.4 Result Interpretation

MRR Value  │ Rank  │ Assessment
───────────┼───────┼─────────────
1.0        │ 1st   │ ✓ Perfect
0.5        │ 2nd   │ ✓ Good
0.33       │ 3rd   │ △ Needs work
0.0        │ None  │ ✗ Not found

B. Named Entity Evaluation

B.1 Concept

Named entities are meaningful noun phrases: person names, organizations, project IDs, product names.

Example: Project ID "EV-2024-001"
→ Should match exactly when in query

Example: Person name "John Doe"
→ Should not confuse with "Jane Doe"

B.2 Evaluation Metrics

Metric	Definition	Formula	Target
Recall	Correct docs found / total correct	found/total_correct	> 0.8
Precision	Correct results / all results	correct/returned	> 0.8
F1 Score	Harmonic mean	2(P*R)/(P+R)	> 0.75

B.3 Implementation

entity_cases = cases["entity"]
results_df = evaluator.entity_eval.evaluate_entity_set(entity_cases)

# Summary by entity type
summary = results_df.groupby("entity_type")[["recall", "precision", "f1_score"]].mean()
print(summary)

results_df.to_csv("entity_results.csv", index=False)

Evaluation Criteria & Improvement Guide

Variation: MRR < 0.5

Symptoms:

Query: "お盆休みの規定"
Expected: DOC_001 appears in top-3
Actual: DOC_001 not found (MRR = 0.0)

Improvement Strategy:

Upgrade Sudachi (Priority: High)
```
pip install --upgrade sudachi sudachipy_dict_small
```
Sudachi handles complex compound words better than standard tokenizers.

Add Normalization Mappings (Priority: Medium)

{
  "char_filter": {
    "mapping": {
      "mappings": [
        "サーバー => サーバ",
        "データベース => DB"
      ]
    }
  }
}

Increase Vector Search Weight (Priority: Medium)

final_score = 0.3 * bm25_score + 0.7 * vector_score

Entity: Recall < 0.8

Symptoms:

Query: "Project EV-2024-001 progress"
Expected: DOC_A, DOC_B found
Actual: Only DOC_A found (Recall = 50%)

Improvement Strategy:

Add NER Layer (Priority: High)

from transformers import pipeline
ner = pipeline("ner", model="tner/roberta-large-japanese-char-luw-ner")
entities = ner("Project EV-2024-001")
# Prevents tokenizer from splitting entity

Increase Entity Boost (Priority: High)

es_query = {
    "term": {
        "entity_project": {
            "value": "EV-2024-001",
            "boost": 5.0  # Increase boost multiplier
        }
    }
}

Build Internal Dictionary (Priority: Medium) Document all organizational entities (projects, departments, products).

Test Set Customization

Add Organization-Specific Terms

Step 1: Edit test_cases.yaml

variation:
  - case_id: "var_custom_001"
    canonical_form: "your_org_term"
    variations:
      - "variation1"
      - "variation2"
    ground_truth_doc_id: "DOC_CUSTOM_001"
    variation_type: "custom"

entity:
  - case_id: "ent_custom_001"
    query: "Query containing your custom entity"
    entity: "your_entity"
    entity_type: "CUSTOM_TYPE"
    ground_truth_doc_ids: ["DOC_A", "DOC_B"]
    should_not_match: ["DOC_NEG"]

Step 2: Add Test Documents

from backend.rag_milvus import add_user_text

add_user_text(
    collection=rag.collection,
    text="Document about your custom term...",
    source="DOC_CUSTOM_001"
)

Step 3: Run Evaluation

results = evaluator.run_full_evaluation(cases["variation"], cases["entity"])
print(results[results["case_id"].str.contains("custom")])

Dashboard Implementation

Streamlit Dashboard

Installation

pip install streamlit matplotlib plotly

app.py

import streamlit as st
import pandas as pd
from backend.evaluate_robustness import RAGRobustnessEvaluator

st.set_page_config(page_title="RAG Evaluation Dashboard", layout="wide")
st.title("🔍 RAG 2.0 Search Evaluation Dashboard")

if st.button("🚀 Run Evaluation"):
    with st.spinner("Evaluating..."):
        rag = EvoRAG(llm_type="huggingface")
        evaluator = RAGRobustnessEvaluator(rag)

        import yaml
        with open("backend/test_cases.yaml") as f:
            cases = yaml.safe_load(f)

        var_results = evaluator.variation_eval.evaluate_variation_set(cases["variation"])
        ent_results = evaluator.entity_eval.evaluate_entity_set(cases["entity"])

    # Display metrics
    col1, col2, col3, col4 = st.columns(4)
    with col1:
        mrr = var_results["mrr_mean"].mean()
        st.metric("Variation MRR", f"{mrr:.1%}", "Target: 70%")
    with col2:
        recall = ent_results["recall"].mean()
        st.metric("Entity Recall", f"{recall:.1%}", "Target: 80%")

    # Display tables
    st.subheader("📊 Variation Results")
    st.dataframe(var_results, use_container_width=True)

    st.subheader("📊 Entity Results")
    st.dataframe(ent_results, use_container_width=True)

Run

streamlit run app.py
# Opens at http://localhost:8501

Troubleshooting

Q: Evaluation takes > 5 minutes

A: Optimize indexes

from pymilvus import Collection
collection = Collection("rag_kb")
collection.create_index("embedding", {"index_type": "IVF_FLAT", "nlist": 128})

# Elasticsearch optimization
es.indices.optimize(index="rag_kb_index_v2", force_merge=True)

Q: NER model not found

A: Install TNER with model

pip install tner
python -c "from transformers import pipeline; \
           ner = pipeline('ner', model='tner/roberta-large-japanese-char-luw-ner')"

Q: Sudachi installation fails

A: Install with dependencies

pip install sudachi sudachipy-dict-small

Q: Elasticsearch connection error

A: Verify Elasticsearch is running

curl -X GET "localhost:9200/_cluster/health?pretty"

# If not running (Docker)
docker run -p 9200:9200 -e "discovery.type=single-node" \
  docker.elastic.co/elasticsearch/elasticsearch:8.6.2

Best Practices

1. Weekly Evaluation

Cron Job:

# Add to crontab (every Monday 9am)
0 9 * * 1 cd /path/to/rag && python backend/evaluate_robustness.py

2. User Feedback Collection

@app.post("/api/v2/rag/feedback")
def record_feedback(query: str, doc_id: str, rating: int):
    """
    rating: 1 (not relevant) ~ 5 (highly relevant)
    """
    db.feedbacks.insert_one({
        "query": query,
        "doc_id": doc_id,
        "rating": rating,
        "timestamp": datetime.utcnow()
    })

3. Monthly Improvement Cycle

Week 1 (Mon) → Run evaluation → Aggregate results
Week 2 (Mon) → Analyze low-rated cases
Week 3 (Mon) → Implement improvements
Week 4 (Mon) → Regression testing
End of month  → Production deployment

References

Official Documentation

Papers

Document Version: 1.0
Last Updated: 2026-05-20
Status: Production-Ready

For questions: rag-team@evospikenet.com