EvoSpikeNet: Data handling guide
Copyright: 2026 Moonlight Technologies Inc. All Rights Reserved.
Author: Masahiro Aoki
Last updated: January 8, 2026
Purpose and use of this document
- Purpose: To provide an overview of the creation, formatting, and verification procedures of various data (spike/text/RAG/multimodal).
- Target audience: Data pipeline personnel, research/learning operations personnel.
- First reading order: Data upload → Spike data → Text corpus → Knowledge base → Multimodal dataset.
-
Related links:
examples/run_zenoh_distributed_brain.pyfor distributed brain script, docs/implementation/PFC_ZENOH_EXECUTIVE.md for PFC/Zenoh/Executive details. -
Implementation notes (artifacts): See
docs/implementation/ARTIFACT_MANIFESTS.mdfor theartifact_manifest.jsongenerated by the training job and the rules for uploading.
This document details how to create, format, and validate various AI data used in the EvoSpikeNet framework, including spike data, text corpora, RAG knowledge bases, and multimodal datasets.
0. Data upload function
EvoSpikeNet provides the ability to upload training data to an API server and enable sharing in a distributed environment. This allows LLM training to be performed using uploaded datasets as well as local files.
Data formats that can be uploaded
- Multimodal data: Image and caption pairs (
captions.csv+images/directory) - Future expansion planned: Support for audio data, text corpora, etc.
Upload steps
- Data preparation: Create training data locally
- Run script: Start training using
--upload-dataflag - API Upload: Data is automatically zipped and uploaded to the API server
- Shared use: Uploaded data can be reused by other users and systems
Usage example
# Training with data upload
python examples/train_multi_modal_lm.py \
--mode train \
--dataset custom \
--data-dir your_data_dir \
--run-name your_run \
--upload-data \
--data-name your_dataset_name \
--epochs 10 \
--batch-size 4
If the API server is not available
If the API server is unavailable, it automatically falls back to training on local data. Data upload will be skipped and training will continue using local files.
1. Spike data generation and formatting
The spike data that is the direct input to the SNN model is represented as torch.Tensor.
- Format:
torch.Tensor - Shape: 2D tensor of
(time_steps, num_input_neurons). - dtype:
torch.int8 - Value:
0(no spike) or1(spike).
Artificial spike data for testing can be generated on the "SNN Models" page under the "Data Generation" menu in the UI.
2. Text corpus
For training language models (EvoSpikeNetLM, SpikingEvoSpikeNetLM), various data sources supported by the evospikenet/dataloaders.py module can be used.
WikipediaLoader: Dynamically loads Wikipedia articles.AozoraBunkoLoader: Extracts text from HTML of Aozora Bunko.LocalFileLoader: Loads a local text file.
These loaders are utilized in training scripts such as examples/train_spiking_evospikenet_lm.py.
3. RAG Knowledge Base Management
The Retrieval-Augmented Generation (RAG) feature stores external knowledge in Milvus and Elasticsearch. Data management is mainly done from the UI.
- Data structure: Each document consists of the following fields:
id(unique),embedding(vector),text(body) andsource. - CRUD operations via UI: The Data Management tab on the RAG System page is a powerful interface for directly managing your knowledge base.
- Create: Add a row using the
add rowbutton and entertextandsource.embeddingis automatically generated and saved. - Read: All data in Milvus will be displayed in a table.
- Update: Editing table cells directly updates the database in real time.
- Delete: Select a row and delete it with the
delete rowbutton.
- Create: Add a row using the
4. Multimodal dataset
MultiModalEvoSpikeNetLM is trained on image and caption pairs.
- Directory structure:
data/multi_modal_dataset/ ├── images/ (画像ファイルを格納) └── captions.csv (画像パスとキャプションの対応を記述)-captions.csvformat:csv image_path,caption images/image_0.png,"キャプション1" images/image_1.jpg,"キャプション2"This dataset will be used for model training with theexamples/train_multi_modal_lm.pyscript.
5. Data for visualization
For interactive analysis in the UI and detailed offline visualization, the framework saves neuron activity data in .pt format.
- Data structure: All files are stored as dictionaries with keys such as
spikes,membrane_potential, etc. - Generated location:
- RAG Chat: Can save neuron data when selecting SNN backend.
- Spiking LM Chat: Can save neuron data when generating text.
- SNN Models: Generated when running a 4-layer SNN simulation (e.g.
4_layer_snn_data_lif.pt).
- How to use: The generated
.ptfile can be uploaded to theGeneric Visualization'' page in theData Analysis'' menu for re-visualization, or analyzed offline with theexamples/visualize_*.pyscript.
6. Synthetic data generation (Data Distillation)
The evospikenet/distillation.py module provides functionality to generate high quality synthetic data using LLM (e.g. OpenAI). This helps you efficiently create datasets for specific tasks (sentiment analysis, QA pair generation, etc.).
You can specify the task type, number of samples, and prompt to run using the Distill Data command on the System Utilities page of the System Settings menu.
7. Audio data
Multimodal models support voice input.
- Format: Standard audio file formats supported by
torchaudio, such as.wav,.mp3,.flac, etc. - Available from the UI: Audio files can be uploaded from the
Brain Simulation'' tab on theDistributed Brain'' page and used as input to the simulation along with text prompts and images. - Data processing: On the backend, uploaded files are converted to waveforms and sample rates with
torchaudio.loadand preprocessed into a format that can be processed by the model'sSpikingAudioEncoder.
8. Federated training dataset
With federated learning, each client maintains an independent local dataset.
- Format: CSV (
.csv) file. - Data structure: The current implementation assumes a text classification task and requires each row to consist of two columns:
textandlabel. - How to use: Specify the path to the local CSV file with the
--data-pathargument when running theexamples/run_fl_client.pyscript.
9. Data flow of distributed brain simulation
Distributed brain simulations exchange multiple data between the UI, API, and simulation processes.
-
Input data (UI → API → Simulation):
- Users enter text prompts and upload images and audio files in the Distributed Brain UI.
- Press the “Execute Query” button and the media file will be Base64 encoded and sent along with the text to the API endpoint
/api/distributed_brain/prompt. - The API writes out the received prompt data and media files as a JSON file with a unique ID and related files to the server's
/tmpdirectory. - Simulation processes (especially PFCs with Rank 0) periodically scan (poll) this
/tmpdirectory to detect new prompt files and start processing them.
-
Output data (Simulation → API → UI):
- Status: Rank 0 processes periodically POST the current state of the simulation (status of each node, activity of edges, entropy of PFC, etc.) to the API endpoint
/api/distributed_brain/status. The UI polls this endpoint to update the display in real time. - Result: When the simulation completes a task, it writes the final text result as a result file in the
/tmpdirectory. The UI polls the API endpoint/api/distributed_brain/result. This endpoint reads the corresponding result file, returns its contents to the UI, and deletes the file. - Logs: Each simulation process (Rank 0, 1, ...) writes its log to a file called
/tmp/sim_rank_{rank}.log. The UI reads and displays logs for the selected node (optionally via API). - Artifacts: During the simulation, each process can upload tensors of internal states (spikes, membrane potential, etc.) as artifacts to the database as
.ptfiles. This is done via theupload_artifactmethod ofEvoSpikeNetAPIClient.
- Status: Rank 0 processes periodically POST the current state of the simulation (status of each node, activity of edges, entropy of PFC, etc.) to the API endpoint
Test validation
Test file
- File:
tests/unit/test_data*.py(37 test cases) - Test contents:
- Verification of data upload function
- Spike data generation and format confirmation
- Text corpus processing and validation
- RAG knowledge base construction test
- Integrated processing of multimodal datasets
- Distributed management of federated learning data
Test results
✅ All tests passed (37/37 passed)