Individual Node Failures | SnowFox

This section relates to failures limited to the host SnowFox runs on. It covers how the node can recover from these failures while the system as a whole may need to proceed further and independently to recover from a lost node.

SnowFox’s process death

Component Daemon, Manager, or Spawner process.

Root cause A SnowFox process is dead or terminated (may it be bug, kernel, or human request).

Side effects

Clients connected to the node will be disconnected.
Cluster loses access to the node.
Node monitoring and management through SnowFox is unavailable.
Pending requests will time out.
The cluster may react assuming the entire host failed.

Mitigation

Processes started by SnowFox are not tied to the manager.
As a result they will not be impacted by the failure.
Once the node comes back online it will resume monitoring the processes and update its internal status.

Internal errors

Component Any

Root cause

Bugs in the code.
Unexpected behaviour from other components.
User error.
Others …

Side effects

Features may not work as expected.
Most errors are ignored if not explicitly handled to prevent node failures.
Processes may not be started or terminated.
Requests may never return (neither successfully nor unsuccessfully).

Mitigation

Timeouts should be used to protected against missed events from components.
Retries and timers should be used to provide eventual consistency.

Local service failure

Component Agents

Root cause

Bug in the agent.
Unexpected response from the agent.
User or kernel termination of the agent.

Side effects

Features may become unavailable.
Requests may timeout.

Mitigation

Timeouts should be used to protected against missed events from components.
Retries and timers should be used to provide eventual consistency.