Feature 36: Automatic recovery system
Copyright: 2026 Moonlight Technologies Inc.
Author: Masahiro Aoki
Implementation date: February 20, 2026 Version: 1.0.0 Status: ✅ Implemented
overview
EvoSpikeNet's Automated Recovery System (Feature 36) combines AI-based anomaly detection and automated playbook execution to reduce mean time to recovery (MTTR) for system failures by 80%.
Continuously monitors system metrics in the background, analyzes root causes and automatically executes predefined recovery playbooks when anomalies are detected.
Architecture
システムメトリクス
│
▼
┌─────────────────┐
│ AnomalyDetector │ ← Z スコア + EWMA による異常検知
│ (per metric) │
└────────┬────────┘
│ 異常検出
▼
┌─────────────────┐
│ RootCauseAnalyzer│ ← ルールベース根本原因分析
└────────┬────────┘
│ FailureCategory
▼
┌─────────────────┐
│ Recovery │ ← プレイブック選択・実行
│ Playbooks │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Incident │ ← インシデント記録・追跡
│ Tracking │
└─────────────────┘
Core components
AnomalyDetector
Detect anomalies in metrics using Z-scores and EWMA (exponentially weighted moving average). When instantiating, only the window length is specified, and thresholds etc. are controlled by class constants.
<!-- from evospikenet.auto_recovery import AnomalyDetector -->
# Use default 60 second window
# Z_THRESHOLD=3.0, EWMA_ALPHA=0.15, MIN_SAMPLES=10 are defined as class attributes
# You can subclass and override it if necessary.
detector = AnomalyDetector(window=60)
# Update metric values and detect abnormalities (if it returns True, it is abnormal)
is_anomaly = detector.update("cpu_percent", 95.0)
Key properties:
| Attribute | Default | Description |
|---|---|---|
AnomalyDetector.Z_THRESHOLD |
3.0 | Z-score threshold for abnormality determination |
AnomalyDetector.EWMA_ALPHA |
0.15 | EWMA smoothing factor |
AnomalyDetector.MIN_SAMPLES |
10 | Minimum number of samples required for abnormality determination |
window (constructor argument) |
60 | Number of samples to keep history |
RootCauseAnalyzer
Determine fault category, confidence level, and description from collected metrics.
<!-- TODO: update or remove - import fail<!-- Need to know: Automatic conversion not possible — please fix manually -->rt RootCauseAnalyzer, FailureCategory -->
analyzer = RootCauseAnalyzer()
category, confidence, explanation = analyzer.analyze({
"cpu_percent": 95.0,
"memory_percent": 45.0,
"db_connected": True,
"error_rate": 0.05,
})
# → (FailureCategory.CPU_OVERLOAD, 0.80, "CPU at 95.0%")
Determinable fault categories and criteria:
| Category | Description | Judgment conditions (analyze implementation) |
|---|---|---|
memory_exhaustion |
Memory exhaustion | memory_percent ≥ 85% |
oom_kill |
OOM kill risk | memory_percent ≥ 95% |
cpu_overload |
CPU overload | cpu_percent ≥ 95% |
disk_full |
Insufficient disk space | disk_percent ≥ 90% |
database_error |
DB connection failure | db_connected == False |
zenoh_disconnect |
Zenoh disconnect | zenoh_connected == False |
model_crash |
Model crash | model_ready == False |
unknown |
Other/Unknown | Other than the above, such as when the error rate is high |
Recovery Playbooks
Defines an ordered list of automatic recovery actions to be attempted for each failure category.
| Recovery Action | Description |
|---|---|
RESTART_SERVICE |
Restart service |
RELOAD_MODEL |
Reload model |
RESTORE_SNAPSHOT |
Restore from snapshot |
SCALE_DOWN |
Resource reduction |
CLEAR_CACHE |
Clear cache |
RECONNECT_ZENOH |
Zenoh reconnection |
RECONNECT_DB |
DB reconnection |
FREE_DISK |
Reserve free disk space |
NOTIFY_ONLY |
Notification only in case of emergency (to operator) |
AutoRecoveryEngine
Monitor metrics and manage incident lifecycles in a background thread.
<!-- Module 'evospikenet' not found. Check moves/renames within the package -->
<!-<!-- Please note: Cannot convert automatically — please fix manually -->Automatically run when PI server starts)
auto_recovery_engine.start()
# Report metrics manually (return if incident is created)
incident = auto_recovery_engine.report_metrics(
cpu_percent=45.0,
memory_percent=60.0,
disk_percent=55.0,
db_connected=True,
zenoh_connected=True,
model_ready=True,
error_rate=0.01,
)
# Get incident list
incidents = auto_recovery_engine.get_incidents()
# Incident status change
auto_recovery_engine.acknowledge_incident("incident-id")
auto_recovery_engine.resolve_incident("incident-id", "手動でプロセスを再起動")
REST API
GET /api/recovery/status
Returns the current status of the recovery engine.
Response example:```json { "total_incidents": 12, "open_incidents": 1, "acknowledged_incidents": 0, "resolved_incidents": 9, "auto_resolved_incidents": 2, "mttr_seconds": 145.2, "monitoring_interval_seconds": 30, "enabled_categories": ["cpu_overload", "memory_exhaustion", "database_error"] }
### GET `/api/recovery/incidents`
**Query parameters:** `status`, `severity`, `limit`
### GET `/api/recovery/incidents/{id}`
### POST `/api/recovery/incidents/{id}/acknowledge`
### POST `/api/recovery/incidents/{id}/resolve`
**Request body (optional):**```json
{ "resolution_note": "手動でサービスを再起動しました" }
POST /api/recovery/trigger
Manually trigger anomaly diagnosis based on metrics.
Request body:```json { "cpu_percent": 95.0, "memory_percent": 80.0, "disk_percent": 60.0, "db_connected": false, "zenoh_connected": true, "model_ready": true, "error_rate": 0.15 }
**Response (with incident):**```json
{
"status": "incident_created",
"incident": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"category": "database_error",
"severity": "critical",
"status": "open",
"actions_taken": ["RECONNECT_DB", "NOTIFY_ONLY"]
}
}
Response (normal):```json { "status": "no_anomaly_detected" }
---
## setting
`config/settings.yaml` (optional read in each project):
```yaml
auto_recovery:
enabled: true
monitoring_interval_seconds: 30 # AutoRecoveryEngine.MONITORING_INTERVAL
state_file: "data/recovery/auto_recovery_state.json" # AutoRecoveryEngine.STATE_FILE
detector_window: 60 # AnomalyDetector history length
thresholds:
cpu_percent: 90.0
memory_percent: 85.0
disk_percent: 90.0
error_rate: 0.1
Incident structure
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"category": "memory_exhaustion",
"severity": "high",
"status": "open",
"message": "Memory usage abnormally high: 92.3%",
"detected_at": "2026-02-20T10:00:00Z",
"resolved_at": null,
"ttd_seconds": null,
"ttr_seconds": null,
"actions_taken": ["CLEAR_CACHE", "NOTIFY_ONLY"],
"root_cause": "memory_exhaustion",
"metrics_snapshot": {
"memory_percent": 92.3,
"cpu_percent": 45.0
}
}
Status transition:``` open → acknowledged → resolved └──────────────────→ auto_resolved
---
## Test
```bash
# unit test
pytest tests/unit/test_auto_recovery.py -v
# Integration test
pytest tests/integration/test_features_36_39_40_integration.py::TestAutoRecoveryEndpoints -v
# system test
pytest tests/system/test_features_36_39_40_system.py::TestE2EIncidentAuditFlow -v
# performance test
pytest tests/performance/test_features_36_39_40_performance.py::TestAutoRecoveryEnginePerformance -v
Related documents
- SDK API Reference - Feature 36
auto_recovery_sdk.py- Audit Log
- Geographically distributed node management