Test data management guide
[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).
overview
The EvoSpikeNet project employs the following strategies for managing test data:
- Dynamic generation: Small and simple test data can be dynamically generated using fixtures.
- Git LFS: Manage large test data with Git LFS
- Automatic setup: Automatically build a test environment with conftest.py
Classification of test data
Dynamic generation (recommended)
Target: - File size < 1 MB - Random data or simple patterns - Generation time < 100ms - Data that does not require reproducibility
Benefits: - Small repository size - Always up-to-date format - No dependencies - Easy maintenance
example:```python @pytest.fixture def dummy_audio_data(test_data_dir): """Generate dummy audio data dynamically.""" import wave import struct
audio_file = test_data_dir / "generated_audio.wav"
# Generate 1 second of audio at 16kHz
sample_rate = 16000
duration = 1.0
frequency = 440.0 # A4 note
num_samples = int(sample_rate * duration)
# Generate sine wave
samples = []
for i in range(num_samples):
value = int(32767.0 * 0.3 * np.sin(2.0 * np.pi * frequency * i / sample_rate))
samples.append(value)
# Write WAV file
with wave.open(str(audio_file), 'w') as wav_file:
wav_file.setnchannels(1)
wav_file.setsampwidth(2)
wav_file.setframerate(sample_rate)
wav_file.writeframes(struct.pack('h' * len(samples), *samples))
return audio_file
```
Git LFS management
Target:
- File size > 1 MB
- Real data and real samples
- Data that takes time to generate
- Data that requires reproducibility
Benefits:
- Realistic test cases
- Consistent testing
- Faster CI/CD (no generation required)
Disadvantages:
- Requires LFS storage
- Setup required
- Requires update management
example:python
@pytest.fixture
def large_pretrained_model():
"""Load large pre-trained model from Git LFS."""
model_file = Path(__file__).parent / "data" / "large_model_weights.pt"
if not model_file.exists():
pytest.skip("Large model file not available. Run: git lfs pull")
import torch, inspect
load_kwargs = {"map_location": "cpu"}
try:
sig = inspect.signature(torch.load)
if "weights_only" in sig.parameters:
load_kwargs["weights_only"] = True
except Exception:
pass
weights = torch.load(model_file, **load_kwargs)
return weights
Available Fixtures
Session scope
test_data_dir
Provides a temporary test data directory. Created for each session and cleaned up automatically.
def test_example(test_data_dir):
# test_data_dir is a Path object
my_file = test_data_dir / "mydata.txt"
my_file.write_text("test content")
setup_test_environment
Automatically set up the test environment (autouse=True). Create the following directories:
- models/: for model files
- artifacts/: for artifacts
- logs/: for log files
- cache/: for cache files
Also set environment variables:
- TEST_DATA_ROOT: Root path of test data
- TEST_MODE: set to "true"
Function scope
Audio/Video
dummy_audio_data(test_data_dir)
- 440Hz sine wave for 1 second (A4 sound)
- 16kHz sampling rate
- Mono, 16 bit
- WAV format
def test_audio_processing(dummy_audio_data):
import wave
with wave.open(str(dummy_audio_data), 'r') as f:
assert f.getnchannels() == 1
assert f.getframerate() == 16000
Text
dummy_text_corpus(test_data_dir)
- Multi-line text corpus
- UTF-8 encoding
- Save as text file
def test_text_tokenization(dummy_text_corpus):
with open(dummy_text_corpus, 'r') as f:
text = f.read()
# Testing text processing
Images
dummy_image_data(test_data_dir)
- MNIST format image (28x28 grayscale)
- Batch size: 4
- PyTorch tensor or NumPy array
- Also save files (.pt or .npy)
@pytest.mark.requires_torch
def test_image_classification(dummy_image_data):
images = dummy_image_data['tensor']
assert images.shape == (4, 1, 28, 28)
Neural Data
dummy_spike_train(test_data_dir)
- 100 neurons x 1000 timesteps
- Sparse (5% chance of firing)
- Binary (0/1)
- NumPy uint8 format
def test_spike_analysis(dummy_spike_train):
spikes = dummy_spike_train['data']
assert spikes.shape == (1000, 100)
assert spikes.dtype == np.uint8
dummy_embeddings(test_data_dir)
- 100 samples x 128 dimensions
- Normalized vector
- NumPy float32 format
def test_similarity_search(dummy_embeddings):
embeddings = dummy_embeddings['data']
# Each vector is normalized
norms = np.linalg.norm(embeddings, axis=1)
assert np.allclose(norms, 1.0, atol=1e-5)
Structured Data
dummy_csv_data(test_data_dir)
- Easy CSV data
- with header
- Numerical and categorical data
def test_data_loading(dummy_csv_data):
import pandas as pd
df = pd.read_csv(dummy_csv_data)
assert len(df) == 4
assert 'label' in df.columns
Model Weights
dummy_model_weights(test_data_dir)
- Easy neural network weights
- 2 layers (64 → 32 neurons)
- PyTorch state_dict or NumPy dict
@pytest.mark.requires_torch
def test_model_loading(dummy_model_weights):
weights = dummy_model_weights['weights']
assert 'layer1.weight' in weights
assert weights['layer1.weight'].shape == (64, 128)
Multimodal
mock_multimodal_data(dummy_image_data, dummy_audio_data, dummy_text_corpus)
- Combine images, audio and text
- For multimodal testing
def test_multimodal_processing(mock_multimodal_data):
image = mock_multimodal_data['image']
audio = mock_multimodal_data['audio']
text = mock_multimodal_data['text']
# Testing multimodal processing
Dynamic size support
large_test_dataset(test_data_dir, request)
- Size can be specified with marker
- Git LFS compatible (use if file exists)
- Dynamically generated if not present
@pytest.mark.dataset_size("medium")
def test_with_medium_dataset(large_test_dataset):
data = large_test_dataset['data']
# data is 1000 samples
assert len(data) == 1000
@pytest.mark.dataset_size("large")
def test_with_large_dataset(large_test_dataset):
data = large_test_dataset['data']
# data is 10000 samples (read or generated from Git LFS)
assert len(data) == 10000
Git LFS setup
Initial setup
# Install Git LFS
brew install git-lfs # macOS
apt-get install git-lfs # Ubuntu/Debian
# Enable Git LFS on your repository
git lfs install
# Download LFS file
git lfs pull
Add new file to LFS
# Create large test data
python scripts/generate_test_data.py --size large --output tests/data/large_dataset_large.npy
# Track with Git LFS
git lfs track "tests/data/large_dataset_large.npy"
# Commit (along with .gitattributes)
git add .gitattributes tests/data/large_dataset_large.npy
git commit -m "Add large test dataset (Git LFS)"
git push
Check LFS file
# Check files tracked by LFS
git lfs ls-files
# LFS storage usage
git lfs status
Best practices
1. Dynamically generate small data
# ✅ Good example: dynamic generation
@pytest.fixture
def small_test_data():
return np.random.randn(100, 10)
# ❌ Bad example: saving small data to a file
@pytest.fixture
def small_test_data():
return np.load("tests/data/small_data.npy") # Unnecessary
2. Git LFS for large data
# ✅ Good example: Loading from Git LFS
@pytest.fixture
def large_pretrained_model():
model_file = Path(__file__).parent / "data" / "pretrained_model.pt"
if model_file.exists():
import inspect
load_kwargs = {"map_location": "cpu"}
try:
sig = inspect.signature(torch.load)
if "weights_only" in sig.parameters:
load_kwargs["weights_only"] = True
except Exception:
pass
return torch.load(model_file, **load_kwargs)
pytest.skip("Model not available")
# ❌ Bad example: dynamically generating large data (slow)
@pytest.fixture
def large_pretrained_model():
# Training every time (takes a few minutes)
return train_large_model()
3. Use temporary directory
# ✅ Good example: use test_data_dir
def test_file_writing(test_data_dir):
output_file = test_data_dir / "output.txt"
output_file.write_text("test")
# automatic cleanup
# ❌ Bad example: write to current directory
def test_file_writing():
with open("output.txt", "w") as f:
f.write("test")
# Forgot to clean up
4. Reusing Fixtures
# ✅ Good example: combining existing fixtures
@pytest.fixture
def prepared_dataset(dummy_image_data, test_data_dir):
images = dummy_image_data['tensor']
# Pretreatment
normalized = (images - images.mean()) / images.std()
return normalized
# ❌ Bad example: Implementing everything from scratch
@pytest.fixture
def prepared_dataset(test_data_dir):
# Restart everything from image generation
images = torch.rand(4, 1, 28, 28) # Duplicate with dummy_image_data
normalized = (images - images.mean()) / images.std()
return normalized
5. Conditional Skip
# ✅ Good example: Skip if LFS file is missing
@pytest.fixture
def real_world_data():
data_file = Path(__file__).parent / "data" / "real_data.npy"
if not data_file.exists():
pytest.skip("Real data not available (run 'git lfs pull')")
return np.load(data_file)
# ❌ Bad example: error if file does not exist
@pytest.fixture
def real_world_data():
data_file = Path(__file__).parent / "data" / "real_data.npy"
return np.load(data_file) # FileNotFoundError
CI/CD considerations
GitHub Actions
name: Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout with LFS
uses: actions/checkout@v3
with:
lfs: true
- name: Pull LFS files (selective)
run: |
# Get only small files (cost reduction)
git lfs pull --include="tests/data/small_*.npy"
- name: Run tests
run: |
# Skip tests that require LFS files
pytest -m "not requires_lfs" tests/
Local development
# Clone without LFS (faster)
GIT_LFS_SKIP_SMUDGE=1 git clone <repo>
cd <repo>
# Get only the necessary LFS files
git lfs pull --include="tests/data/specific_file.npy"
# or get all LFS files
git lfs pull
troubleshooting
LFS files are not downloaded
# Check the status of LFS file
git lfs ls-files
# Force re-download
git lfs fetch --all
git lfs checkout
Test fails with "file not found"
# Implement skip in Fixture
@pytest.fixture
def optional_large_data():
data_file = Path(__file__).parent / "data" / "large_data.npy"
if not data_file.exists():
pytest.skip("Large data not available. Run: git lfs pull")
return np.load(data_file)
Storage quota exceeded
# Clean up old LFS files
git lfs prune
# Untrack only specific files
git lfs untrack "tests/data/old_file.npy"
git rm tests/data/old_file.npy
git add .gitattributes
git commit -m "Remove old LFS file"