Individual Node Failures
This section relates to failures limited to the host SnowFox runs on. It covers how the node can recover from these failures while the system as a whole may need to proceed further and independently to recover from a lost node.
SnowFox’s process death
Component Daemon, Manager, or Spawner process.
Root cause A SnowFox process is dead or terminated (may it be bug, kernel, or human request).
Side effects
- Clients connected to the node will be disconnected.
- Cluster loses access to the node.
- Node monitoring and management through SnowFox is unavailable.
- Pending requests will time out.
- The cluster may react assuming the entire host failed.
Mitigation
- Processes started by SnowFox are not tied to the manager.
As a result they will not be impacted by the failure. - Once the node comes back online it will resume monitoring the processes and update its internal status.
Internal errors
Component Any
Root cause
- Bugs in the code.
- Unexpected behaviour from other components.
- User error.
- Others …
Side effects
- Features may not work as expected.
- Most errors are ignored if not explicitly handled to prevent node failures.
- Processes may not be started or terminated.
- Requests may never return (neither successfully nor unsuccessfully).
Mitigation
- Timeouts should be used to protected against missed events from components.
- Retries and timers should be used to provide eventual consistency.
Local service failure
Component Agents
Root cause
- Bug in the agent.
- Unexpected response from the agent.
- User or kernel termination of the agent.
Side effects
- Features may become unavailable.
- Requests may timeout.
Mitigation
- Timeouts should be used to protected against missed events from components.
- Retries and timers should be used to provide eventual consistency.