Resiliency Levels

Created by Jason Hunt, last modified by Catherine Lefevre on Jan 25, 2021

Level Definitions

- Level 0: no redundancy
- Level 1: support manual failure detection & rerouting or recovery within a single site; tested to complete in 30 minutes
- Level 2: support automated failure detection & rerouting
  - within a single geographic site
  - stateless components: establish baseline measure of failed requests for a component failure within a site
  - stateful components: establish baseline of data loss for a component failure within a site
- Level 3: support automated failover detection & rerouting
  - across multiple sites
  - stateless components
    - improve on # of failed requests for component failure within a site
    - establish baseline for failed requests for site failure
  - stateful components
    - improve on data loss metrics for component failure within a site
    - establish baseline for data loss for site failure

Minimum Levels

Runtime Projects: Level 2 (stretch goal Level 3)
- NOTE: For Dublin, the building blocks will be put in place for Level 3 geo-redundancy, and a few projects will pilot it
All other Projects: Level 1 (stretch goal Level 2)

Guidance for Implementation

Level 2 resiliency within a single site can be easily implemented by project teams using OOM, Kubernetes clusters, and health checks.
CNI - OOM is introducing CNI which will allow for multi-site Kubernetes clusters (VxLAN or BGP). One can deploy pods and label them by geo location, which will be scheduled to corresponding labeled nodes. Labels would need to be defined in the Helm charts. See OOM-1506
Integration testing details TBD.

Contacts

OOM and Integration teams.

No labels