Test data management guide

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

overview

The EvoSpikeNet project employs the following strategies for managing test data:

Dynamic generation: Small and simple test data can be dynamically generated using fixtures.
Git LFS: Manage large test data with Git LFS
Automatic setup: Automatically build a test environment with conftest.py

Classification of test data

Dynamic generation (recommended)

Target: - File size < 1 MB - Random data or simple patterns - Generation time < 100ms - Data that does not require reproducibility

Benefits: - Small repository size - Always up-to-date format - No dependencies - Easy maintenance

example:```python @pytest.fixture def dummy_audio_data(test_data_dir): """Generate dummy audio data dynamically.""" import wave import struct

audio_file = test_data_dir / "generated_audio.wav"

# Generate 1 second of audio at 16kHz
sample_rate = 16000
duration = 1.0
frequency = 440.0  # A4 note
num_samples = int(sample_rate * duration)

# Generate sine wave
samples = []
for i in range(num_samples):
    value = int(32767.0 * 0.3 * np.sin(2.0 * np.pi * frequency * i / sample_rate))
    samples.append(value)

# Write WAV file
with wave.open(str(audio_file), 'w') as wav_file:
    wav_file.setnchannels(1)
    wav_file.setsampwidth(2)
    wav_file.setframerate(sample_rate)
    wav_file.writeframes(struct.pack('h' * len(samples), *samples))

return audio_file

```

Git LFS management

Target: - File size > 1 MB - Real data and real samples - Data that takes time to generate - Data that requires reproducibility Benefits: - Realistic test cases - Consistent testing - Faster CI/CD (no generation required) Disadvantages: - Requires LFS storage - Setup required - Requires update management example:python @pytest.fixture def large_pretrained_model(): """Load large pre-trained model from Git LFS.""" model_file = Path(__file__).parent / "data" / "large_model_weights.pt" if not model_file.exists(): pytest.skip("Large model file not available. Run: git lfs pull") import torch, inspect load_kwargs = {"map_location": "cpu"} try: sig = inspect.signature(torch.load) if "weights_only" in sig.parameters: load_kwargs["weights_only"] = True except Exception: pass weights = torch.load(model_file, **load_kwargs) return weights

Available Fixtures

Session scope

`test_data_dir`

Provides a temporary test data directory. Created for each session and cleaned up automatically.

def test_example(test_data_dir):
    # test_data_dir is a Path object
    my_file = test_data_dir / "mydata.txt"
    my_file.write_text("test content")

`setup_test_environment`

Automatically set up the test environment (autouse=True). Create the following directories: - models/: for model files - artifacts/: for artifacts - logs/: for log files - cache/: for cache files

Also set environment variables: - TEST_DATA_ROOT: Root path of test data - TEST_MODE: set to "true"

Function scope

Audio/Video

dummy_audio_data(test_data_dir) - 440Hz sine wave for 1 second (A4 sound) - 16kHz sampling rate - Mono, 16 bit - WAV format

def test_audio_processing(dummy_audio_data):
    import wave
    with wave.open(str(dummy_audio_data), 'r') as f:
        assert f.getnchannels() == 1
        assert f.getframerate() == 16000

Text

dummy_text_corpus(test_data_dir) - Multi-line text corpus - UTF-8 encoding - Save as text file

def test_text_tokenization(dummy_text_corpus):
    with open(dummy_text_corpus, 'r') as f:
        text = f.read()
    # Testing text processing

Images

dummy_image_data(test_data_dir) - MNIST format image (28x28 grayscale) - Batch size: 4 - PyTorch tensor or NumPy array - Also save files (.pt or .npy)

@pytest.mark.requires_torch
def test_image_classification(dummy_image_data):
    images = dummy_image_data['tensor']
    assert images.shape == (4, 1, 28, 28)

Neural Data

dummy_spike_train(test_data_dir) - 100 neurons x 1000 timesteps - Sparse (5% chance of firing) - Binary (0/1) - NumPy uint8 format

def test_spike_analysis(dummy_spike_train):
    spikes = dummy_spike_train['data']
    assert spikes.shape == (1000, 100)
    assert spikes.dtype == np.uint8

dummy_embeddings(test_data_dir) - 100 samples x 128 dimensions - Normalized vector - NumPy float32 format

def test_similarity_search(dummy_embeddings):
    embeddings = dummy_embeddings['data']
    # Each vector is normalized
    norms = np.linalg.norm(embeddings, axis=1)
    assert np.allclose(norms, 1.0, atol=1e-5)

Structured Data

dummy_csv_data(test_data_dir) - Easy CSV data - with header - Numerical and categorical data

def test_data_loading(dummy_csv_data):
    import pandas as pd
    df = pd.read_csv(dummy_csv_data)
    assert len(df) == 4
    assert 'label' in df.columns

Model Weights

dummy_model_weights(test_data_dir) - Easy neural network weights - 2 layers (64 → 32 neurons) - PyTorch state_dict or NumPy dict

@pytest.mark.requires_torch
def test_model_loading(dummy_model_weights):
    weights = dummy_model_weights['weights']
    assert 'layer1.weight' in weights
    assert weights['layer1.weight'].shape == (64, 128)

Multimodal

mock_multimodal_data(dummy_image_data, dummy_audio_data, dummy_text_corpus) - Combine images, audio and text - For multimodal testing

def test_multimodal_processing(mock_multimodal_data):
    image = mock_multimodal_data['image']
    audio = mock_multimodal_data['audio']
    text = mock_multimodal_data['text']
    # Testing multimodal processing

Dynamic size support

large_test_dataset(test_data_dir, request) - Size can be specified with marker - Git LFS compatible (use if file exists) - Dynamically generated if not present

@pytest.mark.dataset_size("medium")
def test_with_medium_dataset(large_test_dataset):
    data = large_test_dataset['data']
    # data is 1000 samples
    assert len(data) == 1000

@pytest.mark.dataset_size("large")
def test_with_large_dataset(large_test_dataset):
    data = large_test_dataset['data']
    # data is 10000 samples (read or generated from Git LFS)
    assert len(data) == 10000

Git LFS setup

Initial setup

# Install Git LFS
brew install git-lfs  # macOS
apt-get install git-lfs  # Ubuntu/Debian

# Enable Git LFS on your repository
git lfs install

# Download LFS file
git lfs pull

Add new file to LFS

# Create large test data
python scripts/generate_test_data.py --size large --output tests/data/large_dataset_large.npy

# Track with Git LFS
git lfs track "tests/data/large_dataset_large.npy"

# Commit (along with .gitattributes)
git add .gitattributes tests/data/large_dataset_large.npy
git commit -m "Add large test dataset (Git LFS)"
git push

Check LFS file

# Check files tracked by LFS
git lfs ls-files

# LFS storage usage
git lfs status

Best practices

1. Dynamically generate small data

# ✅ Good example: dynamic generation
@pytest.fixture
def small_test_data():
    return np.random.randn(100, 10)

# ❌ Bad example: saving small data to a file
@pytest.fixture
def small_test_data():
    return np.load("tests/data/small_data.npy")  # Unnecessary

2. Git LFS for large data

# ✅ Good example: Loading from Git LFS
@pytest.fixture
def large_pretrained_model():
    model_file = Path(__file__).parent / "data" / "pretrained_model.pt"
        if model_file.exists():
            import inspect
            load_kwargs = {"map_location": "cpu"}
            try:
                sig = inspect.signature(torch.load)
                if "weights_only" in sig.parameters:
                    load_kwargs["weights_only"] = True
            except Exception:
                pass
            return torch.load(model_file, **load_kwargs)
    pytest.skip("Model not available")

# ❌ Bad example: dynamically generating large data (slow)
@pytest.fixture
def large_pretrained_model():
    # Training every time (takes a few minutes)
    return train_large_model()

3. Use temporary directory

# ✅ Good example: use test_data_dir
def test_file_writing(test_data_dir):
    output_file = test_data_dir / "output.txt"
    output_file.write_text("test")
    # automatic cleanup

# ❌ Bad example: write to current directory
def test_file_writing():
    with open("output.txt", "w") as f:
        f.write("test")
    # Forgot to clean up

4. Reusing Fixtures

# ✅ Good example: combining existing fixtures
@pytest.fixture
def prepared_dataset(dummy_image_data, test_data_dir):
    images = dummy_image_data['tensor']
    # Pretreatment
    normalized = (images - images.mean()) / images.std()
    return normalized

# ❌ Bad example: Implementing everything from scratch
@pytest.fixture
def prepared_dataset(test_data_dir):
    # Restart everything from image generation
    images = torch.rand(4, 1, 28, 28)  # Duplicate with dummy_image_data
    normalized = (images - images.mean()) / images.std()
    return normalized

5. Conditional Skip

# ✅ Good example: Skip if LFS file is missing
@pytest.fixture
def real_world_data():
    data_file = Path(__file__).parent / "data" / "real_data.npy"
    if not data_file.exists():
        pytest.skip("Real data not available (run 'git lfs pull')")
    return np.load(data_file)

# ❌ Bad example: error if file does not exist
@pytest.fixture
def real_world_data():
    data_file = Path(__file__).parent / "data" / "real_data.npy"
    return np.load(data_file)  # FileNotFoundError

CI/CD considerations

GitHub Actions

name: Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout with LFS
        uses: actions/checkout@v3
        with:
          lfs: true

      - name: Pull LFS files (selective)
        run: |
          # Get only small files (cost reduction)
          git lfs pull --include="tests/data/small_*.npy"

      - name: Run tests
        run: |
          # Skip tests that require LFS files
          pytest -m "not requires_lfs" tests/

Local development

# Clone without LFS (faster)
GIT_LFS_SKIP_SMUDGE=1 git clone <repo>
cd <repo>

# Get only the necessary LFS files
git lfs pull --include="tests/data/specific_file.npy"

# or get all LFS files
git lfs pull

troubleshooting

LFS files are not downloaded

# Check the status of LFS file
git lfs ls-files

# Force re-download
git lfs fetch --all
git lfs checkout

Test fails with "file not found"

# Implement skip in Fixture
@pytest.fixture
def optional_large_data():
    data_file = Path(__file__).parent / "data" / "large_data.npy"
    if not data_file.exists():
        pytest.skip("Large data not available. Run: git lfs pull")
    return np.load(data_file)

Storage quota exceeded

# Clean up old LFS files
git lfs prune

# Untrack only specific files
git lfs untrack "tests/data/old_file.npy"
git rm tests/data/old_file.npy
git add .gitattributes
git commit -m "Remove old LFS file"