Feature 36: Automatic recovery system

Copyright: 2026 Moonlight Technologies Inc.

Author: Masahiro Aoki

Implementation date: February 20, 2026 Version: 1.0.0 Status: ✅ Implemented

overview

EvoSpikeNet's Automated Recovery System (Feature 36) combines AI-based anomaly detection and automated playbook execution to reduce mean time to recovery (MTTR) for system failures by 80%.

Continuously monitors system metrics in the background, analyzes root causes and automatically executes predefined recovery playbooks when anomalies are detected.

Architecture

システムメトリクス
      │
      ▼
┌─────────────────┐
│ AnomalyDetector  │  ← Z スコア + EWMA による異常検知
│ (per metric)     │
└────────┬────────┘
         │ 異常検出
         ▼
┌─────────────────┐
│ RootCauseAnalyzer│  ← ルールベース根本原因分析
└────────┬────────┘
         │ FailureCategory
         ▼
┌─────────────────┐
│ Recovery         │  ← プレイブック選択・実行
│ Playbooks        │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Incident         │  ← インシデント記録・追跡
│ Tracking         │
└─────────────────┘

Core components

AnomalyDetector

Detect anomalies in metrics using Z-scores and EWMA (exponentially weighted moving average). When instantiating, only the window length is specified, and thresholds etc. are controlled by class constants.

<!-- from evospikenet.auto_recovery import AnomalyDetector -->

# Use default 60 second window
# Z_THRESHOLD=3.0, EWMA_ALPHA=0.15, MIN_SAMPLES=10 are defined as class attributes
# You can subclass and override it if necessary.
detector = AnomalyDetector(window=60)

# Update metric values and detect abnormalities (if it returns True, it is abnormal)
is_anomaly = detector.update("cpu_percent", 95.0)

Key properties:

Attribute	Default	Description
`AnomalyDetector.Z_THRESHOLD`	3.0	Z-score threshold for abnormality determination
`AnomalyDetector.EWMA_ALPHA`	0.15	EWMA smoothing factor
`AnomalyDetector.MIN_SAMPLES`	10	Minimum number of samples required for abnormality determination
`window` (constructor argument)	60	Number of samples to keep history

RootCauseAnalyzer

Determine fault category, confidence level, and description from collected metrics.

<!-- TODO: update or remove - import fail<!-- Need to know: Automatic conversion not possible — please fix manually -->rt RootCauseAnalyzer, FailureCategory -->

analyzer = RootCauseAnalyzer()
category, confidence, explanation = analyzer.analyze({
    "cpu_percent": 95.0,
    "memory_percent": 45.0,
    "db_connected": True,
    "error_rate": 0.05,
})
# → (FailureCategory.CPU_OVERLOAD, 0.80, "CPU at 95.0%")

Determinable fault categories and criteria:

Category	Description	Judgment conditions (analyze implementation)
`memory_exhaustion`	Memory exhaustion	memory_percent ≥ 85%
`oom_kill`	OOM kill risk	memory_percent ≥ 95%
`cpu_overload`	CPU overload	cpu_percent ≥ 95%
`disk_full`	Insufficient disk space	disk_percent ≥ 90%
`database_error`	DB connection failure	db_connected == False
`zenoh_disconnect`	Zenoh disconnect	zenoh_connected == False
`model_crash`	Model crash	model_ready == False
`unknown`	Other/Unknown	Other than the above, such as when the error rate is high

Recovery Playbooks

Defines an ordered list of automatic recovery actions to be attempted for each failure category.

Recovery Action	Description
`RESTART_SERVICE`	Restart service
`RELOAD_MODEL`	Reload model
`RESTORE_SNAPSHOT`	Restore from snapshot
`SCALE_DOWN`	Resource reduction
`CLEAR_CACHE`	Clear cache
`RECONNECT_ZENOH`	Zenoh reconnection
`RECONNECT_DB`	DB reconnection
`FREE_DISK`	Reserve free disk space
`NOTIFY_ONLY`	Notification only in case of emergency (to operator)

AutoRecoveryEngine

Monitor metrics and manage incident lifecycles in a background thread.

<!-- Module 'evospikenet' not found. Check moves/renames within the package -->
<!-<!-- Please note: Cannot convert automatically — please fix manually -->Automatically run when PI server starts)
auto_recovery_engine.start()

# Report metrics manually (return if incident is created)
incident = auto_recovery_engine.report_metrics(
    cpu_percent=45.0,
    memory_percent=60.0,
    disk_percent=55.0,
    db_connected=True,
    zenoh_connected=True,
    model_ready=True,
    error_rate=0.01,
)

# Get incident list
incidents = auto_recovery_engine.get_incidents()

# Incident status change
auto_recovery_engine.acknowledge_incident("incident-id")
auto_recovery_engine.resolve_incident("incident-id", "手動でプロセスを再起動")

REST API

GET `/api/recovery/status`

Returns the current status of the recovery engine.

Response example:```json { "total_incidents": 12, "open_incidents": 1, "acknowledged_incidents": 0, "resolved_incidents": 9, "auto_resolved_incidents": 2, "mttr_seconds": 145.2, "monitoring_interval_seconds": 30, "enabled_categories": ["cpu_overload", "memory_exhaustion", "database_error"] }

### GET `/api/recovery/incidents`

**Query parameters:** `status`, `severity`, `limit`

### GET `/api/recovery/incidents/{id}`

### POST `/api/recovery/incidents/{id}/acknowledge`

### POST `/api/recovery/incidents/{id}/resolve`

**Request body (optional):**```json
{ "resolution_note": "手動でサービスを再起動しました" }

POST `/api/recovery/trigger`

Manually trigger anomaly diagnosis based on metrics.

Request body:```json { "cpu_percent": 95.0, "memory_percent": 80.0, "disk_percent": 60.0, "db_connected": false, "zenoh_connected": true, "model_ready": true, "error_rate": 0.15 }

**Response (with incident):**```json
{
  "status": "incident_created",
  "incident": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "category": "database_error",
    "severity": "critical",
    "status": "open",
    "actions_taken": ["RECONNECT_DB", "NOTIFY_ONLY"]
  }
}

Response (normal):```json { "status": "no_anomaly_detected" }

---

## setting

`config/settings.yaml` (optional read in each project):

```yaml
auto_recovery:
  enabled: true
  monitoring_interval_seconds: 30   # AutoRecoveryEngine.MONITORING_INTERVAL
  state_file: "data/recovery/auto_recovery_state.json"  # AutoRecoveryEngine.STATE_FILE
  detector_window: 60              # AnomalyDetector history length
  thresholds:
    cpu_percent: 90.0
    memory_percent: 85.0
    disk_percent: 90.0
    error_rate: 0.1

Incident structure

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "category": "memory_exhaustion",
  "severity": "high",
  "status": "open",
  "message": "Memory usage abnormally high: 92.3%",
  "detected_at": "2026-02-20T10:00:00Z",
  "resolved_at": null,
  "ttd_seconds": null,
  "ttr_seconds": null,
  "actions_taken": ["CLEAR_CACHE", "NOTIFY_ONLY"],
  "root_cause": "memory_exhaustion",
  "metrics_snapshot": {
    "memory_percent": 92.3,
    "cpu_percent": 45.0
  }
}

Status transition:``` open → acknowledged → resolved └──────────────────→ auto_resolved

---

## Test

```bash
# unit test
pytest tests/unit/test_auto_recovery.py -v

# Integration test
pytest tests/integration/test_features_36_39_40_integration.py::TestAutoRecoveryEndpoints -v

# system test
pytest tests/system/test_features_36_39_40_system.py::TestE2EIncidentAuditFlow -v

# performance test
pytest tests/performance/test_features_36_39_40_performance.py::TestAutoRecoveryEnginePerformance -v

SDK API Reference - Feature 36
auto_recovery_sdk.py
Audit Log
Geographically distributed node management