Levels of Redundancies

  • Level 0: no redundancy
  • Level 1: support manual failure detection & rerouting or recovery within a single site; expectations to complete in 30 minutes
  • Level 2: support automated failure detection & rerouting 
    • within a single geographic site
    • stateless components: establish baseline measure of failed requests for a component failure within a site 
    • stateful components: establish baseline of data loss for a component failure within a site
  • Level 3: support automated failover detection & rerouting 
    • across multiple sites 
    • stateless components 
      • improve on # of failed requests for component failure within a site 
      • establish baseline for failed requests for site failure 
    • stateful components 
      • improve on data loss metrics for component failure within a site 
      • establish baseline for data loss for site failure

Level 3 redundancy → Geo-Redundancy

Geo-redundancy types

active / standby 

cold standby

After health check failure detection, the administrator manually powers on the standby components and configures the all affected components. Stateful componentes are initialted with the latest backup.

warm standby

Resources of the standby components are allocated and will be periodically powered on for synchronization of the stateful components. After health check failure detection, the administrator manually configures the all affected components. 

hot standby

Resources of the standby components are allocated and powered on for periodic synchronization of the stateful components. After automatic health check failure detection, algorithms automatically configure the all affected components

active / active

Cluster

High-availability clusters synchronize data of the cluster nodes almost in real-time. If one node fails the cluster remains full functional.

Such setup requires low latency between the different geo-loactions, which is not likly in production deployments.

Stay independent and synch changes (???)

Let's assume there are two independent systems starting from scrach, and all the databases are filled with the same init data. The data in the databases are in-synch as long as all updates are the same on both systems. With respect to data from the network there are usually mechnismes to ensure this - Event doublication to both systems and auto-refresh using polling. The question is: how to synch SET requests from protals by users or other system northbound applications (e.g. planning tools).


Deployment specifics:

Valid cluster combinations, uses/pros/cons:
1-node "just akka-enabled features/bundles" (good for labs, testing, and JSON-RPC LEAP front-end)
3-node all-active, no geo.conf (best practice, can be geo-separated up to ~100-120ms at scale. Best recommended practice for everything except some BGP/PCEP use-cases)
6-node all-active, with geo.conf (allows for geo-maint -- but not ODL major upgrade, but computationally expensive for little benefit)
6-node 3-active, with geo.conf (operationally painful, no longer recommended, but what several major deployments are still using for now)


Valid container networking options (updated list):
A. Entirely flat network, using real "bind" port numbers + hostnames in akka.conf. Each container has a full IP stack and its own unique internal IP. Akka resolves FQDNs to these unique 3-6 internal IPs, and uses the "real" Akka port to bind each of them in full-mesh TCP connections. This mode supports consolidation in a single physical host, such as a Docker host with a shared bridge network (not the default one), in a lab.

B. NATted network, separate real "bind" ports/hostnames vs "public" ports/hostnames. Each container has a full IP stack and its own unique internal IP, *AND* must also resolve to its own unique external IP address as well. Anti-affinity must be used such that no two cluster members are ever scheduled onto the same physical host, nor resolve to the same external/internal IP. i.e., avoid any two containers resolving to the same IP address -- Akka cannot handle this.

In summary: "If they cannot all be together, then they must all be separate."


In all cases, persistent volume mounts and stateful sets must be used so cluster snapshots, journals, and particularly frontend/backend shard files maintain integrity across container lifecycle events. These are not to be considered "cloud-native," rather "VMs that can be run at-scale in stable container environments."

Note: The tools like geo_config and cluter_monitor can support both HTTP and HTTPS RestConf ports for communicating with Jolokia, however the HTTPS modes are very particular about FQDNs, certificate, ports, etc.


  • No labels