EvoSpikeNet: Data handling guide

Copyright: 2026 Moonlight Technologies Inc. All Rights Reserved.

Author: Masahiro Aoki

Last updated: January 8, 2026

Purpose and use of this document

Purpose: To provide an overview of the creation, formatting, and verification procedures of various data (spike/text/RAG/multimodal).
Target audience: Data pipeline personnel, research/learning operations personnel.
First reading order: Data upload → Spike data → Text corpus → Knowledge base → Multimodal dataset.
Related links: examples/run_zenoh_distributed_brain.py for distributed brain script, docs/implementation/PFC_ZENOH_EXECUTIVE.md for PFC/Zenoh/Executive details.
Implementation notes (artifacts): See docs/implementation/ARTIFACT_MANIFESTS.md for the artifact_manifest.json generated by the training job and the rules for uploading.

This document details how to create, format, and validate various AI data used in the EvoSpikeNet framework, including spike data, text corpora, RAG knowledge bases, and multimodal datasets.

0. Data upload function

EvoSpikeNet provides the ability to upload training data to an API server and enable sharing in a distributed environment. This allows LLM training to be performed using uploaded datasets as well as local files.

Data formats that can be uploaded

Multimodal data: Image and caption pairs (captions.csv + images/ directory)
Future expansion planned: Support for audio data, text corpora, etc.

Upload steps

Data preparation: Create training data locally
Run script: Start training using --upload-data flag
API Upload: Data is automatically zipped and uploaded to the API server
Shared use: Uploaded data can be reused by other users and systems

Usage example

# Training with data upload
python examples/train_multi_modal_lm.py \
  --mode train \
  --dataset custom \
  --data-dir your_data_dir \
  --run-name your_run \
  --upload-data \
  --data-name your_dataset_name \
  --epochs 10 \
  --batch-size 4

If the API server is not available

If the API server is unavailable, it automatically falls back to training on local data. Data upload will be skipped and training will continue using local files.

1. Spike data generation and formatting

The spike data that is the direct input to the SNN model is represented as torch.Tensor.

Format: torch.Tensor
Shape: 2D tensor of (time_steps, num_input_neurons).
dtype: torch.int8
Value: 0 (no spike) or 1 (spike).

Artificial spike data for testing can be generated on the "SNN Models" page under the "Data Generation" menu in the UI.

2. Text corpus

For training language models (EvoSpikeNetLM, SpikingEvoSpikeNetLM), various data sources supported by the evospikenet/dataloaders.py module can be used.

WikipediaLoader: Dynamically loads Wikipedia articles.
AozoraBunkoLoader: Extracts text from HTML of Aozora Bunko.
LocalFileLoader: Loads a local text file.

These loaders are utilized in training scripts such as examples/train_spiking_evospikenet_lm.py.

3. RAG Knowledge Base Management

The Retrieval-Augmented Generation (RAG) feature stores external knowledge in Milvus and Elasticsearch. Data management is mainly done from the UI.

Data structure: Each document consists of the following fields: id (unique), embedding (vector), text (body) and source.
CRUD operations via UI: The Data Management tab on the RAG System page is a powerful interface for directly managing your knowledge base.
- Create: Add a row using the add row button and enter text and source. embedding is automatically generated and saved.
- Read: All data in Milvus will be displayed in a table.
- Update: Editing table cells directly updates the database in real time.
- Delete: Select a row and delete it with the delete row button.

4. Multimodal dataset

MultiModalEvoSpikeNetLM is trained on image and caption pairs.

Directory structure:data/multi_modal_dataset/ ├── images/ (画像ファイルを格納) └── captions.csv (画像パスとキャプションの対応を記述)- captions.csv format:csv image_path,caption images/image_0.png,"キャプション1" images/image_1.jpg,"キャプション2"This dataset will be used for model training with the examples/train_multi_modal_lm.py script.

5. Data for visualization

For interactive analysis in the UI and detailed offline visualization, the framework saves neuron activity data in .pt format.

Data structure: All files are stored as dictionaries with keys such as spikes, membrane_potential, etc.
Generated location:
- RAG Chat: Can save neuron data when selecting SNN backend.
- Spiking LM Chat: Can save neuron data when generating text.
- SNN Models: Generated when running a 4-layer SNN simulation (e.g. 4_layer_snn_data_lif.pt).
How to use: The generated .pt file can be uploaded to the Generic Visualization'' page in theData Analysis'' menu for re-visualization, or analyzed offline with the examples/visualize_*.py script.

6. Synthetic data generation (`Data Distillation`)

The evospikenet/distillation.py module provides functionality to generate high quality synthetic data using LLM (e.g. OpenAI). This helps you efficiently create datasets for specific tasks (sentiment analysis, QA pair generation, etc.).

You can specify the task type, number of samples, and prompt to run using the Distill Data command on the System Utilities page of the System Settings menu.

7. Audio data

Multimodal models support voice input.

Format: Standard audio file formats supported by torchaudio, such as .wav, .mp3, .flac, etc.
Available from the UI: Audio files can be uploaded from the Brain Simulation'' tab on theDistributed Brain'' page and used as input to the simulation along with text prompts and images.
Data processing: On the backend, uploaded files are converted to waveforms and sample rates with torchaudio.load and preprocessed into a format that can be processed by the model's SpikingAudioEncoder.

8. Federated training dataset

With federated learning, each client maintains an independent local dataset.

Format: CSV (.csv) file.
Data structure: The current implementation assumes a text classification task and requires each row to consist of two columns: text and label.
How to use: Specify the path to the local CSV file with the --data-path argument when running the examples/run_fl_client.py script.

9. Data flow of distributed brain simulation

Distributed brain simulations exchange multiple data between the UI, API, and simulation processes.

Input data (UI → API → Simulation):
1. Users enter text prompts and upload images and audio files in the Distributed Brain UI.
2. Press the “Execute Query” button and the media file will be Base64 encoded and sent along with the text to the API endpoint /api/distributed_brain/prompt.
3. The API writes out the received prompt data and media files as a JSON file with a unique ID and related files to the server's /tmp directory.
4. Simulation processes (especially PFCs with Rank 0) periodically scan (poll) this /tmp directory to detect new prompt files and start processing them.
Output data (Simulation → API → UI):
- Status: Rank 0 processes periodically POST the current state of the simulation (status of each node, activity of edges, entropy of PFC, etc.) to the API endpoint /api/distributed_brain/status. The UI polls this endpoint to update the display in real time.
- Result: When the simulation completes a task, it writes the final text result as a result file in the /tmp directory. The UI polls the API endpoint /api/distributed_brain/result. This endpoint reads the corresponding result file, returns its contents to the UI, and deletes the file.
- Logs: Each simulation process (Rank 0, 1, ...) writes its log to a file called /tmp/sim_rank_{rank}.log. The UI reads and displays logs for the selected node (optionally via API).
- Artifacts: During the simulation, each process can upload tensors of internal states (spikes, membrane potential, etc.) as artifacts to the database as .pt files. This is done via the upload_artifact method of EvoSpikeNetAPIClient.

Test validation

Test file

File: tests/unit/test_data*.py (37 test cases)
Test contents:
Verification of data upload function
Spike data generation and format confirmation
Text corpus processing and validation
RAG knowledge base construction test
Integrated processing of multimodal datasets
Distributed management of federated learning data

Test results

✅ All tests passed (37/37 passed)