This section relates to failures limited to the host SnowFox runs on. It covers how the node can recover from these failures while the system as a whole may need to proceed further and independently to recover from a lost node.

SnowFox’s process death

Component Daemon, Manager, or Spawner process.

Root cause A SnowFox process is dead or terminated (may it be bug, kernel, or human request).

Side effects

  • Clients connected to the node will be disconnected.
  • Cluster loses access to the node.
  • Node monitoring and management through SnowFox is unavailable.
  • Pending requests will time out.
  • The cluster may react assuming the entire host failed.

Mitigation

  • Processes started by SnowFox are not tied to the manager.
    As a result they will not be impacted by the failure.
  • Once the node comes back online it will resume monitoring the processes and update its internal status.

Internal errors

Component Any

Root cause

  • Bugs in the code.
  • Unexpected behaviour from other components.
  • User error.
  • Others …

Side effects

  • Features may not work as expected.
  • Most errors are ignored if not explicitly handled to prevent node failures.
  • Processes may not be started or terminated.
  • Requests may never return (neither successfully nor unsuccessfully).

Mitigation

  • Timeouts should be used to protected against missed events from components.
  • Retries and timers should be used to provide eventual consistency.

Local service failure

Component Agents

Root cause

  • Bug in the agent.
  • Unexpected response from the agent.
  • User or kernel termination of the agent.

Side effects

  • Features may become unavailable.
  • Requests may timeout.

Mitigation

  • Timeouts should be used to protected against missed events from components.
  • Retries and timers should be used to provide eventual consistency.