Resilience & Crash Handling

The FirstBreath Vision system is designed to be resilient to network failures, camera crashes, and container restarts. This page details the robust crash handling and auto-recovery mechanisms implemented in the camera-manager service.

Architecture

The resilience logic is centralized in the Camera Manager, which orchestrates the lifecycle of camera connections (threads) whether running in Batch or Distributed mode.

Key Mechanisms

1. Health Monitoring Loop

A dedicated background thread in manager.py polls the status of all active camera threads every 10 seconds. It uses an internal State Machine to decide when to intervene to avoid flapping (spamming restarts for minor issues).

Event-Driven Updates: To minimize database IO, the running_scripts table is ONLY updated when the camera status changes (e.g., RUNNING -> CRASHED).
Timestamping: When a change occurs, status_updated_at is set to NOW(). This precise timestamp allows the UI to calculate "Since when" the camera is in this state.
No Periodic Heartbeat: We deliberately avoid writing to the DB if the status is stable ("No news is good news"), significantly reducing database load.

2. Auto-Restart Strategy

The system automatically recovers from various failure modes:

Failure Mode	Detection Logic	Action
Thread Crash	`CameraReader` thread catches exception and sets `status='CRASHED'`	Immediate Remove + Add of the camera.
Stalled Stream	Status is `RUNNING` but `last_frame_time` > 60s	Systematic Restart (Assumes frozen connection).
Network Flap	Status is `RECONNECTING`	< 30s: No Action (Debounce) > 30s: Mark as `CRASHED` in DB > 60s: Force Restart

3. Startup Recovery (Persistence)

The camera-manager service is stateless in memory but stateful via the database.

On Startup: The service runs load_snapshot().
Logic: It queries the running_scripts table for any camera that is marked as running OR crashed.
Effect: If the container was restarted (upgrade, crash, manual restart), all previously active cameras are automatically re-initialized.

4. Graceful Shutdown

To support Startup Recovery effectively, we must know which cameras should be running.

On SIGTERM (Docker Stop): A signal handler intercepts the shutdown request.
Action: It executes UPDATE running_scripts SET status='crashed' WHERE status='running'.
Why?: Marking them as crashed ensures they are picked up by the Startup Recovery logic on the next boot. Cameras explicitly stopped by the user (stopped status) remain stopped.

Manual Control

Start: API sends Redis start -> Manager adds camera -> DB set to running.
Stop: API sends Redis stop -> Manager removes camera -> DB set to stopped.
Result: A manually stopped camera will not automatically restart on container reboot, which is the expected behavior.

Architecture​

Key Mechanisms​

1. Health Monitoring Loop​

2. Auto-Restart Strategy​

3. Startup Recovery (Persistence)​

4. Graceful Shutdown​

Manual Control​