Skip to content

RAG Search Japanese Particle Problem Verification Report

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Implementation notes (artifacts): See docs/implementation/ARTIFACT_MANIFESTS.md for the artifact_manifest.json output by the training script and recommended CLI flags.

Creation date: December 13, 2025

Purpose and use of this document

  • Purpose: Organize the causes and countermeasures for the particle mixing problem in RAG Japanese search, and provide guidelines for correction.
  • Target audience: RAG implementation/search engineers, QA.
  • First reading order: Overview → Problem details → Verification/reproduction steps → Countermeasure proposal.
  • Related links: Distributed brain script in examples/run_zenoh_distributed_brain.py, PFC/Zenoh/Executive details in implementation/PFC_ZENOH_EXECUTIVE.md.

overview

We investigated and verified the problem of particles (ha, ga, wo, ni, de, etc.) being included in search results and token selection when searching Japanese using the RAG (Retrieval-Augmented Generation) system.


Problem details

Issues discovered

  1. Janome tokenizer particle processing
  2. wakati=True mode returns all tokens including particles
  3. wakati=False is required to get part of speech information
  4. The current implementation does not utilize part-of-speech information

  5. Tokenize function inside RAG

  6. tokens_of function in _extractive_answer method does not filter particles.
  7. Since Janome's wakati=True is used, the particle is included as is.
  8. Particles are not removed even in the regular expression during fallback.

  9. Impact on TF-IDF Scoring

  10. Particles appear frequently, so TF (Term Frequency) will be high.
  11. IDF (Inverse Document Frequency) tends to be low because it appears in all documents
  12. However, matching may be affected if the query contains particles.

  13. Elasticsearch kuromoji filter

  14. In the settings, the kuromoji_part_of_speech filter performs particle removal.
  15. Particles are not removed in an environment without the kuromoji plugin -Requires actual operation confirmation

Target code part

1. evospikenet/rag_milvus.py

NOTE: The implementation referenced below has been moved to the rag-system/ directory. For runtime usage prefer the RAG API exposed by rag-system and evospikenet.rag_client.

File: rag_milvus.py

Problematic code (before fix):```python def tokens_of(text): if not text: return [] if self.janome_tokenizer: try: # janome Tokenizer().tokenize(..., wakati=True) yields tokens return [t for t in self.janome_tokenizer.tokenize(text, wakati=True) if t.strip()] except Exception: pass # Fallback: simple word regex (works for Latin scripts and basic tokenization) return re.findall(r"\w+", text.lower())

**Problems**:
- Part of speech information cannot be obtained with `wakati=True`
- Added to token list with particle included
- No particle removal even during fallback

### 2. evospikenet/elasticsearch_client.py

**File**: `elasticsearch_client.py`

**kuromoji settings**:```python
"analyzer": {
    "ja_kuromoji_analyzer": {
        "type": "custom",
        "tokenizer": "kuromoji_tokenizer",
        "filter": [
            "kuromoji_baseform",  # Convert to dictionary form
            "kuromoji_part_of_speech",  # Remove particles and auxiliary verbs
            "cjk_width",  # Normalize full-width and half-width characters
            "ja_stop",  # Remove Japanese stop words
            "lowercase"  # Lowercase for English text
        ]
    }
}

Notes: - kuromoji_part_of_speech filter is responsible for particle removal - Behavior changes depending on the presence or absence of the kuromoji plugin - Requires confirmation of operation in real environment


Implemented solution

Solution: Particle filtering when tokenizing

Approach: Combination of solution A (utilizing part of speech information) + solution B (stop word list)

Code after modification:```python def tokens_of(text): if not text: return []

# Japanese particles and auxiliary verbs to filter out
japanese_stopwords = {
    # particles
    'は', 'が', 'を', 'に', 'へ', 'で', 'と', 'から', 'まで', 'より',
    'の', 'や', 'か', 'も', 'など', 'さえ', 'こそ', 'って', 'つつ',
    'ながら', 'たり', 'だけ', 'ばかり', 'くらい', 'ほど', 'し',
    'だの', 'やら', 'なり', 'とか', 'ね', 'よ', 'な', 'ぞ', 'ぜ', 'わ',
    'のに', 'ので', 'けど', 'けれど', 'けれども', 'が', 'て',
    # auxiliary verbs
    'です', 'ます', 'だ', 'である', 'でした', 'ました', 'た', 'う',
    'よう', 'そう', 'れる', 'られる', 'せる', 'させる', 'ない',
    # Other function words
    'こと', 'もの', 'ため', 'ところ', 'はず', 'わけ'
}

if self.janome_tokenizer:
    try:
        # Use wakati=False to get POS (part-of-speech) information
        tokens = []
        for token in self.janome_tokenizer.tokenize(text):
            parts = str(token).split('\t')
            if len(parts) >= 2:
                surface = parts[0]
                info = parts[1].split(',')
                # Extract POS tag
                if len(info) >= 1:
                    pos = info[0]
                    # Filter out particles, auxiliary verbs, and symbols
                    if pos not in ['助詞', '助動詞', '記号'] and surface.strip():
                        # Additional check: filter known stopwords
                        if surface not in japanese_stopwords:
                            tokens.append(surface)
        return tokens
    except Exception:
        pass

# Fallback: simple word regex with stopword filtering
tokens = re.findall(r"\w+", text.lower())
return [t for t in tokens if t not in japanese_stopwords]

```

Improvements

  1. Use of part-of-speech information
  2. Get part-of-speech tag using wakati=False
  3. Exclude particles (particles), auxiliary verbs (auxiliary verbs), and symbols
  4. Add stop word list
  5. Comprehensive list of common particles and auxiliary verbs
  6. Works even in environments where Janome is not available
  7. Fallback protection
  8. Filter by stop words even after regular expression tokenization
  9. Environment-independent behavior
  10. Double check
  11. After filtering with part-of-speech tags, check with stop word list
  12. More reliable particle removal

Verification results

Corrections

File Line number Changes
rag_milvus.py 271-320 Modify tokens_of function and implement particle filtering
### Expected effect
#### Before (before modification)```python
# Text: "EvoSpikeNet is a distributed brain simulation"
tokens = ['EvoSpikeNet', 'は', '分散', '脳', 'シミュレーション', 'です']
# Contains the particle 'wa' and the auxiliary verb 'desu' ❌
```

After (after modification)```python

Text: "EvoSpikeNet is a distributed brain simulation"

tokens = ['EvoSpikeNet', '分散', '脳', 'シミュレーション']

Particles and auxiliary verbs are excluded ✅

### Impact on TF-IDF

**Before modification**:
- The particle is included in the token
- Particles in the query affect matching
- Similarity calculations are contaminated with meaningless words

**After modification**:
- TF-IDF calculation using only content words (nouns, verbs, etc.)
- More accurate document similarity
- Improving the quality of extractive answers

---

## Test case

### 1. Basic particle filtering

**Input text**:```
"EvoSpikeNetは分散脳シミュレーションのフレームワークです。"

Expected output:```python ['EvoSpikeNet', '分散', '脳', 'シミュレーション', 'フレームワーク']

**Excluded words**:
- ha (particle)
- no (particle)
- Desu (auxiliary verb)

### 2. Processing complex statements

**Input text**:```
"これが問題です。私を助けてください。"

Expected output:```python ['これ', '問題', '私', '助け', 'て', 'ください']

**Excluded words**:
- ga (particle)
- Desu (auxiliary verb)
- wo (particle)

### 3. Tokenize queries

**Input query**:```
"分散脳シミュレーションとは何ですか"

Expected output:```python ['分散', '脳', 'シミュレーション', '何']

**Excluded words**:
- and (particle)
- ha (particle)
- Desu (auxiliary verb)
- ka (particle)

---

## Known limitations

### 1. Janome not installed environment

**Situation**: If Janome is not installed

**Works**:
- Fallback to regex-based tokenization
- Exclude particles only in stop word list
- Part-of-speech information is not available

**Limitations**:
- Unknown particles and new expressions may not be excluded
- No normalization to dictionary form

### 2. Comprehensiveness of stop word list

**Current response**:
- List of common particles and auxiliary verbs (more than 40)
- Includes some colloquial expressions

**Limitations**:
- Does not cover all particles and auxiliary verbs
- Does not support dialects or archaic languages

### 3. Impact on English text

**Situation**: When English text contains Japanese stop words

**Impact**:
- Almost no effect (particle letters rarely appear alone in English)
- Regular expression tokenization also applies to English

---

## Future improvement ideas

### Short-term improvements

1. **Expansion of stop word list**
   - Added more particles and auxiliary verbs
   - Enhancement of colloquial expressions
   - Support for dialects

2. **Check consistency with Elasticsearch**
   - Operation verification of kuromoji settings
   - Comparison of results with Milvus

3. **Adding unit tests**
   - Unit test for `tokens_of` function
   - Test with various Japanese patterns

### Long-term improvements

1. **Adoption of advanced morphological analysis**
   - MeCab integration study
   - Consider adopting SudachiPy
   - More accurate part-of-speech tagging

2. **Improved language detection**
   - Sentence-by-sentence language detection
   - Support for mixed text

3. **Machine learning based filtering**
   - Utilization of important word extraction model
   - Contextual filtering

---

## Reference materials

### Japanese parts of speech system

| Part of speech | Explanation | Examples | Exclusions |
|-----|------|-----|---------|
| Noun | Name of thing | Dispersion, brain, simulation | ❌ |
| Verb | Action or state | Run, eat, be | ❌ |
| adjective | quality or condition | big, beautiful, new | ❌ |
| Adverb | Modifier | very, quickly, slowly | ❌ |
| Particles | Showing grammatical relationships | is, is, is, in, in | ✅ |
| Auxiliary verbs | Attached to verbs and adjectives | desu, masu, ta, nai | ✅ |
| Conjunctions | Connect sentences | And, but, or | △ |
| Interjection | Expressing emotion | Oh, oh, eh | △ |
| Symbols | Punctuation marks, etc. | ,! ,? | ✅ |

### Janome part of speech tag

Reference: [Janome official documentation](https://mocobeta.github.io/janome/)
表層形\t品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用型,活用形,原形,読み,発音
**example**:```
走る\t動詞,自立,*,*,五段・ラ行,基本形,走る,ハシル,ハシル
は\t助詞,係助詞,*,*,*,*,は,ハ,ワ
です\t助動詞,*,*,*,特殊・デス,基本形,です,デス,デス

Elasticsearch kuromoji settings

Reference: Elasticsearch kuromoji plugin

kuromoji_part_of_speech filter: - Remove particles: specified with stoptags parameter - Remove particles and auxiliary verbs by default

kuromoji_baseform filter: - Convert verbs and adjectives to dictionary form - "Run" → "Run"


summary

Measures taken

  1. Use of Janome part-of-speech information: Get the part-of-speech tag with wakati=False and exclude particles, auxiliary verbs, and symbols.
  2. Comprehensive stop word list: Covers over 40 particles and auxiliary verbs
  3. Fallback protection: Stop word removal even in regular expressions
  4. Double check mechanism: Ensure exclusion using part-of-speech tags + stop words

Problem solved

  • ✅ Particles are excluded from RAG search results
  • ✅ TF-IDF calculation is performed only on content words
  • ✅ Improves the quality of extractive answers
  • ✅ More accurate document similarity calculation

Continuous monitoring

We will continue to monitor the following and make improvements as needed:

  1. Frequency of particles in actual RAG queries
  2. Quality evaluation of extracted answers
  3. User feedback
  4. Consistency of results with Elasticsearch

File Role
rag_milvus.py Main implementation of RAG system
elasticsearch_client.py Elasticsearch integration
test_japanese_rag_particle_issue.py Verification script
test_rag_debug.py RAG debug script

Last updated: December 13, 2025