Skip to content

Feature 36: Automatic recovery system

Author: Masahiro Aoki

Implementation date: February 20, 2026 Version: 1.0.0 Status: ✅ Implemented

overview

EvoSpikeNet's Automated Recovery System (Feature 36) combines AI-based anomaly detection and automated playbook execution to reduce mean time to recovery (MTTR) for system failures by 80%.

Continuously monitors system metrics in the background, analyzes root causes and automatically executes predefined recovery playbooks when anomalies are detected.


Architecture

システムメトリクス
      │
      ▼
┌─────────────────┐
│ AnomalyDetector  │  ← Z スコア + EWMA による異常検知
│ (per metric)     │
└────────┬────────┘
         │ 異常検出
         ▼
┌─────────────────┐
│ RootCauseAnalyzer│  ← ルールベース根本原因分析
└────────┬────────┘
         │ FailureCategory
         ▼
┌─────────────────┐
│ Recovery         │  ← プレイブック選択・実行
│ Playbooks        │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Incident         │  ← インシデント記録・追跡
│ Tracking         │
└─────────────────┘

Core components

AnomalyDetector

Detect anomalies in metrics using Z-scores and EWMA (exponentially weighted moving average). When instantiating, only the window length is specified, and thresholds etc. are controlled by class constants.

<!-- from evospikenet.auto_recovery import AnomalyDetector -->

# Use default 60 second window
# Z_THRESHOLD=3.0, EWMA_ALPHA=0.15, MIN_SAMPLES=10 are defined as class attributes
# You can subclass and override it if necessary.
detector = AnomalyDetector(window=60)

# Update metric values ​​and detect abnormalities (if it returns True, it is abnormal)
is_anomaly = detector.update("cpu_percent", 95.0)

Key properties:

Attribute Default Description
AnomalyDetector.Z_THRESHOLD 3.0 Z-score threshold for abnormality determination
AnomalyDetector.EWMA_ALPHA 0.15 EWMA smoothing factor
AnomalyDetector.MIN_SAMPLES 10 Minimum number of samples required for abnormality determination
window (constructor argument) 60 Number of samples to keep history

RootCauseAnalyzer

Determine fault category, confidence level, and description from collected metrics.

<!-- TODO: update or remove - import fail<!-- Need to know: Automatic conversion not possible  please fix manually -->rt RootCauseAnalyzer, FailureCategory -->

analyzer = RootCauseAnalyzer()
category, confidence, explanation = analyzer.analyze({
    "cpu_percent": 95.0,
    "memory_percent": 45.0,
    "db_connected": True,
    "error_rate": 0.05,
})
# → (FailureCategory.CPU_OVERLOAD, 0.80, "CPU at 95.0%")

Determinable fault categories and criteria:

Category Description Judgment conditions (analyze implementation)
memory_exhaustion Memory exhaustion memory_percent ≥ 85%
oom_kill OOM kill risk memory_percent ≥ 95%
cpu_overload CPU overload cpu_percent ≥ 95%
disk_full Insufficient disk space disk_percent ≥ 90%
database_error DB connection failure db_connected == False
zenoh_disconnect Zenoh disconnect zenoh_connected == False
model_crash Model crash model_ready == False
unknown Other/Unknown Other than the above, such as when the error rate is high

Recovery Playbooks

Defines an ordered list of automatic recovery actions to be attempted for each failure category.

Recovery Action Description
RESTART_SERVICE Restart service
RELOAD_MODEL Reload model
RESTORE_SNAPSHOT Restore from snapshot
SCALE_DOWN Resource reduction
CLEAR_CACHE Clear cache
RECONNECT_ZENOH Zenoh reconnection
RECONNECT_DB DB reconnection
FREE_DISK Reserve free disk space
NOTIFY_ONLY Notification only in case of emergency (to operator)

AutoRecoveryEngine

Monitor metrics and manage incident lifecycles in a background thread.

<!-- Module 'evospikenet' not found. Check moves/renames within the package -->
<!-<!-- Please note: Cannot convert automatically  please fix manually -->Automatically run when PI server starts)
auto_recovery_engine.start()

# Report metrics manually (return if incident is created)
incident = auto_recovery_engine.report_metrics(
    cpu_percent=45.0,
    memory_percent=60.0,
    disk_percent=55.0,
    db_connected=True,
    zenoh_connected=True,
    model_ready=True,
    error_rate=0.01,
)

# Get incident list
incidents = auto_recovery_engine.get_incidents()

# Incident status change
auto_recovery_engine.acknowledge_incident("incident-id")
auto_recovery_engine.resolve_incident("incident-id", "手動でプロセスを再起動")

REST API

GET /api/recovery/status

Returns the current status of the recovery engine.

Response example:```json { "total_incidents": 12, "open_incidents": 1, "acknowledged_incidents": 0, "resolved_incidents": 9, "auto_resolved_incidents": 2, "mttr_seconds": 145.2, "monitoring_interval_seconds": 30, "enabled_categories": ["cpu_overload", "memory_exhaustion", "database_error"] }

### GET `/api/recovery/incidents`

**Query parameters:** `status`, `severity`, `limit`

### GET `/api/recovery/incidents/{id}`

### POST `/api/recovery/incidents/{id}/acknowledge`

### POST `/api/recovery/incidents/{id}/resolve`

**Request body (optional):**```json
{ "resolution_note": "手動でサービスを再起動しました" }

POST /api/recovery/trigger

Manually trigger anomaly diagnosis based on metrics.

Request body:```json { "cpu_percent": 95.0, "memory_percent": 80.0, "disk_percent": 60.0, "db_connected": false, "zenoh_connected": true, "model_ready": true, "error_rate": 0.15 }

**Response (with incident):**```json
{
  "status": "incident_created",
  "incident": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "category": "database_error",
    "severity": "critical",
    "status": "open",
    "actions_taken": ["RECONNECT_DB", "NOTIFY_ONLY"]
  }
}

Response (normal):```json { "status": "no_anomaly_detected" }

---

## setting

`config/settings.yaml` (optional read in each project):

```yaml
auto_recovery:
  enabled: true
  monitoring_interval_seconds: 30   # AutoRecoveryEngine.MONITORING_INTERVAL
  state_file: "data/recovery/auto_recovery_state.json"  # AutoRecoveryEngine.STATE_FILE
  detector_window: 60              # AnomalyDetector history length
  thresholds:
    cpu_percent: 90.0
    memory_percent: 85.0
    disk_percent: 90.0
    error_rate: 0.1


Incident structure

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "category": "memory_exhaustion",
  "severity": "high",
  "status": "open",
  "message": "Memory usage abnormally high: 92.3%",
  "detected_at": "2026-02-20T10:00:00Z",
  "resolved_at": null,
  "ttd_seconds": null,
  "ttr_seconds": null,
  "actions_taken": ["CLEAR_CACHE", "NOTIFY_ONLY"],
  "root_cause": "memory_exhaustion",
  "metrics_snapshot": {
    "memory_percent": 92.3,
    "cpu_percent": 45.0
  }
}

Status transition:``` open → acknowledged → resolved └──────────────────→ auto_resolved

---

## Test

```bash
# unit test
pytest tests/unit/test_auto_recovery.py -v

# Integration test
pytest tests/integration/test_features_36_39_40_integration.py::TestAutoRecoveryEndpoints -v

# system test
pytest tests/system/test_features_36_39_40_system.py::TestE2EIncidentAuditFlow -v

# performance test
pytest tests/performance/test_features_36_39_40_performance.py::TestAutoRecoveryEnginePerformance -v