Skip to content

RAG 2.0 Japanese Search Evaluation Guide (Detailed)

Version: 1.0
Date: 2026-05-20
Status: Production-Ready
Document ID: RAG-JP-EVAL-GUIDE-001


Table of Contents

  1. Overview
  2. Quick Start
  3. Detailed Usage
  4. Evaluation Criteria & Improvement Guide
  5. Test Set Customization
  6. Dashboard Implementation
  7. Troubleshooting
  8. Best Practices

Overview

This document explains how to evaluate search accuracy of the EvoSpikeNet RAG system for Japanese language processing.

Target Audience

  • Developers: Measure search accuracy for Phase 1-5
  • Operations: Continuous quality monitoring and alerting
  • Researchers: Algorithm improvement verification

Evaluation Flow

Evaluation Workflow:
  Test cases → Variation eval + Entity eval → Metrics aggregation
              ↓                              ↓
              Execution                      Report generation
              ↓
              Improvement proposals

Quick Start

1. Install Dependencies

cd /path/to/rag-system
pip install pandas numpy pyyaml sudachi elasticsearch pymilvus transformers tner

2. Run Evaluation

# Full evaluation
python backend/evaluate_robustness.py

# Variation only
python backend/evaluate_robustness.py --variation-only

# Entity only
python backend/evaluate_robustness.py --entity-only

3. Check Results

Generated Files:

backend/
├── variation_results.csv      # Variation evaluation results
├── entity_results.csv         # Entity evaluation results
└── evaluation_report.txt      # Summary report


Detailed Usage

A. Text Variation Evaluation

A.1 Concept

Text variation refers to different representations of the same concept (common in Japanese):

Example: "盆休み", "お盆休み", "盆期間"
→ All should find the same document

Example: "サーバー" vs "サーバ"
→ Long vowel variation

A.2 Evaluation Metrics

Metric Definition Formula Target
MRR Reciprocal rank of correct document 1/rank or 0 > 0.7
found_rate Ratio of variations finding correct doc found/total > 0.8
MRR_std Standard deviation of MRR std(MRRs) < 0.15

A.3 Implementation

from backend.evaluate_robustness import RAGRobustnessEvaluator
import yaml

rag = EvoRAG(llm_type="huggingface")
evaluator = RAGRobustnessEvaluator(rag)

with open("backend/test_cases.yaml") as f:
    cases = yaml.safe_load(f)

results_df = evaluator.variation_eval.evaluate_variation_set(cases["variation"])
results_df.to_csv("variation_results.csv", index=False)

A.4 Result Interpretation

MRR Value  │ Rank  │ Assessment
───────────┼───────┼─────────────
1.0        │ 1st   │ ✓ Perfect
0.5        │ 2nd   │ ✓ Good
0.33       │ 3rd   │ △ Needs work
0.0        │ None  │ ✗ Not found

B. Named Entity Evaluation

B.1 Concept

Named entities are meaningful noun phrases: person names, organizations, project IDs, product names.

Example: Project ID "EV-2024-001"
→ Should match exactly when in query

Example: Person name "John Doe"
→ Should not confuse with "Jane Doe"

B.2 Evaluation Metrics

Metric Definition Formula Target
Recall Correct docs found / total correct found/total_correct > 0.8
Precision Correct results / all results correct/returned > 0.8
F1 Score Harmonic mean 2(P*R)/(P+R) > 0.75

B.3 Implementation

entity_cases = cases["entity"]
results_df = evaluator.entity_eval.evaluate_entity_set(entity_cases)

# Summary by entity type
summary = results_df.groupby("entity_type")[["recall", "precision", "f1_score"]].mean()
print(summary)

results_df.to_csv("entity_results.csv", index=False)

Evaluation Criteria & Improvement Guide

Variation: MRR < 0.5

Symptoms:

Query: "お盆休みの規定"
Expected: DOC_001 appears in top-3
Actual: DOC_001 not found (MRR = 0.0)

Improvement Strategy:

  1. Upgrade Sudachi (Priority: High)

    pip install --upgrade sudachi sudachipy_dict_small
    
    Sudachi handles complex compound words better than standard tokenizers.

  2. Add Normalization Mappings (Priority: Medium)

    {
      "char_filter": {
        "mapping": {
          "mappings": [
            "サーバー => サーバ",
            "データベース => DB"
          ]
        }
      }
    }
    

  3. Increase Vector Search Weight (Priority: Medium)

    final_score = 0.3 * bm25_score + 0.7 * vector_score
    


Entity: Recall < 0.8

Symptoms:

Query: "Project EV-2024-001 progress"
Expected: DOC_A, DOC_B found
Actual: Only DOC_A found (Recall = 50%)

Improvement Strategy:

  1. Add NER Layer (Priority: High)

    from transformers import pipeline
    ner = pipeline("ner", model="tner/roberta-large-japanese-char-luw-ner")
    entities = ner("Project EV-2024-001")
    # Prevents tokenizer from splitting entity
    

  2. Increase Entity Boost (Priority: High)

    es_query = {
        "term": {
            "entity_project": {
                "value": "EV-2024-001",
                "boost": 5.0  # Increase boost multiplier
            }
        }
    }
    

  3. Build Internal Dictionary (Priority: Medium) Document all organizational entities (projects, departments, products).


Test Set Customization

Add Organization-Specific Terms

Step 1: Edit test_cases.yaml

variation:
  - case_id: "var_custom_001"
    canonical_form: "your_org_term"
    variations:
      - "variation1"
      - "variation2"
    ground_truth_doc_id: "DOC_CUSTOM_001"
    variation_type: "custom"

entity:
  - case_id: "ent_custom_001"
    query: "Query containing your custom entity"
    entity: "your_entity"
    entity_type: "CUSTOM_TYPE"
    ground_truth_doc_ids: ["DOC_A", "DOC_B"]
    should_not_match: ["DOC_NEG"]

Step 2: Add Test Documents

from backend.rag_milvus import add_user_text

add_user_text(
    collection=rag.collection,
    text="Document about your custom term...",
    source="DOC_CUSTOM_001"
)

Step 3: Run Evaluation

results = evaluator.run_full_evaluation(cases["variation"], cases["entity"])
print(results[results["case_id"].str.contains("custom")])

Dashboard Implementation

Streamlit Dashboard

Installation

pip install streamlit matplotlib plotly

app.py

import streamlit as st
import pandas as pd
from backend.evaluate_robustness import RAGRobustnessEvaluator

st.set_page_config(page_title="RAG Evaluation Dashboard", layout="wide")
st.title("🔍 RAG 2.0 Search Evaluation Dashboard")

if st.button("🚀 Run Evaluation"):
    with st.spinner("Evaluating..."):
        rag = EvoRAG(llm_type="huggingface")
        evaluator = RAGRobustnessEvaluator(rag)

        import yaml
        with open("backend/test_cases.yaml") as f:
            cases = yaml.safe_load(f)

        var_results = evaluator.variation_eval.evaluate_variation_set(cases["variation"])
        ent_results = evaluator.entity_eval.evaluate_entity_set(cases["entity"])

    # Display metrics
    col1, col2, col3, col4 = st.columns(4)
    with col1:
        mrr = var_results["mrr_mean"].mean()
        st.metric("Variation MRR", f"{mrr:.1%}", "Target: 70%")
    with col2:
        recall = ent_results["recall"].mean()
        st.metric("Entity Recall", f"{recall:.1%}", "Target: 80%")

    # Display tables
    st.subheader("📊 Variation Results")
    st.dataframe(var_results, use_container_width=True)

    st.subheader("📊 Entity Results")
    st.dataframe(ent_results, use_container_width=True)

Run

streamlit run app.py
# Opens at http://localhost:8501

Troubleshooting

Q: Evaluation takes > 5 minutes

A: Optimize indexes

from pymilvus import Collection
collection = Collection("rag_kb")
collection.create_index("embedding", {"index_type": "IVF_FLAT", "nlist": 128})

# Elasticsearch optimization
es.indices.optimize(index="rag_kb_index_v2", force_merge=True)

Q: NER model not found

A: Install TNER with model

pip install tner
python -c "from transformers import pipeline; \
           ner = pipeline('ner', model='tner/roberta-large-japanese-char-luw-ner')"

Q: Sudachi installation fails

A: Install with dependencies

pip install sudachi sudachipy-dict-small

Q: Elasticsearch connection error

A: Verify Elasticsearch is running

curl -X GET "localhost:9200/_cluster/health?pretty"

# If not running (Docker)
docker run -p 9200:9200 -e "discovery.type=single-node" \
  docker.elastic.co/elasticsearch/elasticsearch:8.6.2

Best Practices

1. Weekly Evaluation

Cron Job:

# Add to crontab (every Monday 9am)
0 9 * * 1 cd /path/to/rag && python backend/evaluate_robustness.py

2. User Feedback Collection

@app.post("/api/v2/rag/feedback")
def record_feedback(query: str, doc_id: str, rating: int):
    """
    rating: 1 (not relevant) ~ 5 (highly relevant)
    """
    db.feedbacks.insert_one({
        "query": query,
        "doc_id": doc_id,
        "rating": rating,
        "timestamp": datetime.utcnow()
    })

3. Monthly Improvement Cycle

Week 1 (Mon) → Run evaluation → Aggregate results
Week 2 (Mon) → Analyze low-rated cases
Week 3 (Mon) → Implement improvements
Week 4 (Mon) → Regression testing
End of month  → Production deployment

References

Official Documentation

Papers


Document Version: 1.0
Last Updated: 2026-05-20
Status: Production-Ready

For questions: rag-team@evospikenet.com