Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Info
titleCasablanca

Updated for Casablanca release.

Overview

With SDN-C deployed in a geo-redundant fashion (see Deployment of Geo-Redundant SDN-C), activity can be switched from one site to the other in one of two ways:

  • manually by the site operator
  • automatically via PROM, based on health of active site

In either case, after the failover has been completed and the sites have transitioned from 'standby' to 'active' and vice versa, the DNS entry for the SDN-C deployment is automatically updated (see Geo-Redundant SDN-C DnsSwitch) in order to provide clients with the correct SDN-C target for their messaging.

Manual (forced) failover

The manual option would be utilized by site operators wishing to force activity to a particular site so that they may proceed with performing maintenance or other activities on the other site without impacting service. Prior to carrying out this activity, it is suggested that the current role of the site(s) be determined (see SDN-C Site Role Detection).

From the Kubernetes master node in the site, simply run the sdnc.makeActive script:

Code Block
themeRDark
titlesdnc.makeActive
ubuntu@k8s-s2-master:~/oom/kubernetes/sdnc/resources/geo/bin$ ./sdnc.makeActive dev (release name)

This script will make use of kubectl to access the PROM pod and execute promoverride.py with the appropriate parameters to force PROM to switch activity to the local site.

Alternatively, the promoverride.py script could be executed directly from the PROM pod if so desired:

Code Block
themeRDark
titlepromoverride.py
root@dev-prom-6485f566fb-hdhzs:/app/config# ../promoverride.py -i sdnc02

Here, 'sdnc02' is the identifier specified during deployment for the site that is desired to become active. When using the promoverride.py script directly, you may switch activity to either of the sites (without having to be logged into the PROM pod on that site).

Automatic failover

The PROM instance in each SDN-C site is responsible for periodically ascertaining the health of the local site based on the health of each component (see SDN-C Site Health Determination). This information is published to MUSIC in order for the remote site to also be aware of this information.

If the local PROM instance determines that the site is currently 'standby' and the remote site has become unhealthy, it will proceed to automatically initiate failover procedures, making the local site 'active' while the remote site is reverted to 'standby' (provided it is in a good enough state to do so).

Earlier Releases

Beijing

The promoverride.py script for manual (forced) failover is not available in Beijing, nor is automatic failover orchestrated by PROM.

In order to carry out a site failover in Beijing, the operator would invoke the sdnc.failover script found on the Kubernetes master in the standby site:

Code Block
themeRDark
titlesdnc.failover
ubuntu@k8s-s2-master:~/oom/kubernetes/sdnc/resources/geo/bin$ ./sdnc.failover

Note: The sdnc.failover script in Beijing is limited to situations where the failure afflicting the active site is not catastrophic, meaning that most components in the active site are still available to be communicated with.