RAG Search Japanese Particle Problem Verification Report
[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).
Implementation notes (artifacts): See
docs/implementation/ARTIFACT_MANIFESTS.mdfor theartifact_manifest.jsonoutput by the training script and recommended CLI flags.
Creation date: December 13, 2025
Purpose and use of this document
- Purpose: Organize the causes and countermeasures for the particle mixing problem in RAG Japanese search, and provide guidelines for correction.
- Target audience: RAG implementation/search engineers, QA.
- First reading order: Overview → Problem details → Verification/reproduction steps → Countermeasure proposal.
- Related links: Distributed brain script in
examples/run_zenoh_distributed_brain.py, PFC/Zenoh/Executive details in implementation/PFC_ZENOH_EXECUTIVE.md.
overview
We investigated and verified the problem of particles (ha, ga, wo, ni, de, etc.) being included in search results and token selection when searching Japanese using the RAG (Retrieval-Augmented Generation) system.
Problem details
Issues discovered
- Janome tokenizer particle processing
wakati=Truemode returns all tokens including particleswakati=Falseis required to get part of speech information-
The current implementation does not utilize part-of-speech information
-
Tokenize function inside RAG
tokens_offunction in_extractive_answermethod does not filter particles.- Since Janome's
wakati=Trueis used, the particle is included as is. -
Particles are not removed even in the regular expression during fallback.
-
Impact on TF-IDF Scoring
- Particles appear frequently, so TF (Term Frequency) will be high.
- IDF (Inverse Document Frequency) tends to be low because it appears in all documents
-
However, matching may be affected if the query contains particles.
-
Elasticsearch kuromoji filter
- In the settings, the
kuromoji_part_of_speechfilter performs particle removal. - Particles are not removed in an environment without the kuromoji plugin -Requires actual operation confirmation
Target code part
1. evospikenet/rag_milvus.py
NOTE: The implementation referenced below has been moved to the
rag-system/directory. For runtime usage prefer the RAG API exposed byrag-systemandevospikenet.rag_client.
File: rag_milvus.py
Problematic code (before fix):```python def tokens_of(text): if not text: return [] if self.janome_tokenizer: try: # janome Tokenizer().tokenize(..., wakati=True) yields tokens return [t for t in self.janome_tokenizer.tokenize(text, wakati=True) if t.strip()] except Exception: pass # Fallback: simple word regex (works for Latin scripts and basic tokenization) return re.findall(r"\w+", text.lower())
**Problems**:
- Part of speech information cannot be obtained with `wakati=True`
- Added to token list with particle included
- No particle removal even during fallback
### 2. evospikenet/elasticsearch_client.py
**File**: `elasticsearch_client.py`
**kuromoji settings**:```python
"analyzer": {
"ja_kuromoji_analyzer": {
"type": "custom",
"tokenizer": "kuromoji_tokenizer",
"filter": [
"kuromoji_baseform", # Convert to dictionary form
"kuromoji_part_of_speech", # Remove particles and auxiliary verbs
"cjk_width", # Normalize full-width and half-width characters
"ja_stop", # Remove Japanese stop words
"lowercase" # Lowercase for English text
]
}
}
Notes:
- kuromoji_part_of_speech filter is responsible for particle removal
- Behavior changes depending on the presence or absence of the kuromoji plugin
- Requires confirmation of operation in real environment
Implemented solution
Solution: Particle filtering when tokenizing
Approach: Combination of solution A (utilizing part of speech information) + solution B (stop word list)
Code after modification:```python def tokens_of(text): if not text: return []
# Japanese particles and auxiliary verbs to filter out
japanese_stopwords = {
# particles
'は', 'が', 'を', 'に', 'へ', 'で', 'と', 'から', 'まで', 'より',
'の', 'や', 'か', 'も', 'など', 'さえ', 'こそ', 'って', 'つつ',
'ながら', 'たり', 'だけ', 'ばかり', 'くらい', 'ほど', 'し',
'だの', 'やら', 'なり', 'とか', 'ね', 'よ', 'な', 'ぞ', 'ぜ', 'わ',
'のに', 'ので', 'けど', 'けれど', 'けれども', 'が', 'て',
# auxiliary verbs
'です', 'ます', 'だ', 'である', 'でした', 'ました', 'た', 'う',
'よう', 'そう', 'れる', 'られる', 'せる', 'させる', 'ない',
# Other function words
'こと', 'もの', 'ため', 'ところ', 'はず', 'わけ'
}
if self.janome_tokenizer:
try:
# Use wakati=False to get POS (part-of-speech) information
tokens = []
for token in self.janome_tokenizer.tokenize(text):
parts = str(token).split('\t')
if len(parts) >= 2:
surface = parts[0]
info = parts[1].split(',')
# Extract POS tag
if len(info) >= 1:
pos = info[0]
# Filter out particles, auxiliary verbs, and symbols
if pos not in ['助詞', '助動詞', '記号'] and surface.strip():
# Additional check: filter known stopwords
if surface not in japanese_stopwords:
tokens.append(surface)
return tokens
except Exception:
pass
# Fallback: simple word regex with stopword filtering
tokens = re.findall(r"\w+", text.lower())
return [t for t in tokens if t not in japanese_stopwords]
```
Improvements
- Use of part-of-speech information
- Get part-of-speech tag using
wakati=False - Exclude particles (particles), auxiliary verbs (auxiliary verbs), and symbols
- Add stop word list
- Comprehensive list of common particles and auxiliary verbs
- Works even in environments where Janome is not available
- Fallback protection
- Filter by stop words even after regular expression tokenization
- Environment-independent behavior
- Double check
- After filtering with part-of-speech tags, check with stop word list
- More reliable particle removal
Verification results
Corrections
| File | Line number | Changes |
|---|---|---|
rag_milvus.py |
271-320 | Modify tokens_of function and implement particle filtering |
| ### Expected effect | ||
| #### Before (before modification)```python | ||
| # Text: "EvoSpikeNet is a distributed brain simulation" | ||
| tokens = ['EvoSpikeNet', 'は', '分散', '脳', 'シミュレーション', 'です'] | ||
| # Contains the particle 'wa' and the auxiliary verb 'desu' ❌ | ||
| ``` |
After (after modification)```python
Text: "EvoSpikeNet is a distributed brain simulation"
tokens = ['EvoSpikeNet', '分散', '脳', 'シミュレーション']
Particles and auxiliary verbs are excluded ✅
### Impact on TF-IDF
**Before modification**:
- The particle is included in the token
- Particles in the query affect matching
- Similarity calculations are contaminated with meaningless words
**After modification**:
- TF-IDF calculation using only content words (nouns, verbs, etc.)
- More accurate document similarity
- Improving the quality of extractive answers
---
## Test case
### 1. Basic particle filtering
**Input text**:```
"EvoSpikeNetは分散脳シミュレーションのフレームワークです。"
Expected output:```python ['EvoSpikeNet', '分散', '脳', 'シミュレーション', 'フレームワーク']
**Excluded words**:
- ha (particle)
- no (particle)
- Desu (auxiliary verb)
### 2. Processing complex statements
**Input text**:```
"これが問題です。私を助けてください。"
Expected output:```python ['これ', '問題', '私', '助け', 'て', 'ください']
**Excluded words**:
- ga (particle)
- Desu (auxiliary verb)
- wo (particle)
### 3. Tokenize queries
**Input query**:```
"分散脳シミュレーションとは何ですか"
Expected output:```python ['分散', '脳', 'シミュレーション', '何']
**Excluded words**:
- and (particle)
- ha (particle)
- Desu (auxiliary verb)
- ka (particle)
---
## Known limitations
### 1. Janome not installed environment
**Situation**: If Janome is not installed
**Works**:
- Fallback to regex-based tokenization
- Exclude particles only in stop word list
- Part-of-speech information is not available
**Limitations**:
- Unknown particles and new expressions may not be excluded
- No normalization to dictionary form
### 2. Comprehensiveness of stop word list
**Current response**:
- List of common particles and auxiliary verbs (more than 40)
- Includes some colloquial expressions
**Limitations**:
- Does not cover all particles and auxiliary verbs
- Does not support dialects or archaic languages
### 3. Impact on English text
**Situation**: When English text contains Japanese stop words
**Impact**:
- Almost no effect (particle letters rarely appear alone in English)
- Regular expression tokenization also applies to English
---
## Future improvement ideas
### Short-term improvements
1. **Expansion of stop word list**
- Added more particles and auxiliary verbs
- Enhancement of colloquial expressions
- Support for dialects
2. **Check consistency with Elasticsearch**
- Operation verification of kuromoji settings
- Comparison of results with Milvus
3. **Adding unit tests**
- Unit test for `tokens_of` function
- Test with various Japanese patterns
### Long-term improvements
1. **Adoption of advanced morphological analysis**
- MeCab integration study
- Consider adopting SudachiPy
- More accurate part-of-speech tagging
2. **Improved language detection**
- Sentence-by-sentence language detection
- Support for mixed text
3. **Machine learning based filtering**
- Utilization of important word extraction model
- Contextual filtering
---
## Reference materials
### Japanese parts of speech system
| Part of speech | Explanation | Examples | Exclusions |
|-----|------|-----|---------|
| Noun | Name of thing | Dispersion, brain, simulation | ❌ |
| Verb | Action or state | Run, eat, be | ❌ |
| adjective | quality or condition | big, beautiful, new | ❌ |
| Adverb | Modifier | very, quickly, slowly | ❌ |
| Particles | Showing grammatical relationships | is, is, is, in, in | ✅ |
| Auxiliary verbs | Attached to verbs and adjectives | desu, masu, ta, nai | ✅ |
| Conjunctions | Connect sentences | And, but, or | △ |
| Interjection | Expressing emotion | Oh, oh, eh | △ |
| Symbols | Punctuation marks, etc. | ,! ,? | ✅ |
### Janome part of speech tag
Reference: [Janome official documentation](https://mocobeta.github.io/janome/)
**example**:```
走る\t動詞,自立,*,*,五段・ラ行,基本形,走る,ハシル,ハシル
は\t助詞,係助詞,*,*,*,*,は,ハ,ワ
です\t助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
Elasticsearch kuromoji settings
Reference: Elasticsearch kuromoji plugin
kuromoji_part_of_speech filter:
- Remove particles: specified with stoptags parameter
- Remove particles and auxiliary verbs by default
kuromoji_baseform filter: - Convert verbs and adjectives to dictionary form - "Run" → "Run"
summary
Measures taken
- ✅ Use of Janome part-of-speech information: Get the part-of-speech tag with
wakati=Falseand exclude particles, auxiliary verbs, and symbols. - ✅ Comprehensive stop word list: Covers over 40 particles and auxiliary verbs
- ✅ Fallback protection: Stop word removal even in regular expressions
- ✅ Double check mechanism: Ensure exclusion using part-of-speech tags + stop words
Problem solved
- ✅ Particles are excluded from RAG search results
- ✅ TF-IDF calculation is performed only on content words
- ✅ Improves the quality of extractive answers
- ✅ More accurate document similarity calculation
Continuous monitoring
We will continue to monitor the following and make improvements as needed:
- Frequency of particles in actual RAG queries
- Quality evaluation of extracted answers
- User feedback
- Consistency of results with Elasticsearch
Related files
| File | Role |
|---|---|
rag_milvus.py |
Main implementation of RAG system |
elasticsearch_client.py |
Elasticsearch integration |
test_japanese_rag_particle_issue.py |
Verification script |
test_rag_debug.py |
RAG debug script |
Last updated: December 13, 2025