You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Next »

Casablanca

Updated for Casablanca release.

Overview

With SDN-C deployed in a geo-redundant fashion (see Deployment of Geo-Redundant SDN-C), activity can be switched from one site to the other in one of two ways:

  • manually by the site operator
  • automatically via PROM, based on health of active site

In either case, after the failover has been completed and the sites have transitioned from 'standby' to 'active' and vice versa, the DNS entry for the SDN-C deployment is updated (see Geo-Redundant SDN-C DnsSwitch) in order to provide clients with the correct SDN-C target for their messaging.

Manual (forced) failover

The manual option would be utilized by site operators wishing to force activity to a particular site so that they may proceed with performing maintenance or other activities on the other site without impacting service. Prior to carrying out this activity, it is suggested that the current role of the site(s) be determined (see SDN-C Site Role Detection).

From the Kubernetes master node in the site, simply run the sdnc.makeActive script:

sdnc.makeActive
ubuntu@k8s-s2-master:~/oom/kubernetes/sdnc/resources/geo/bin$ ./sdnc.makeActive dev (release name)

This script will make use of kubectl to access the PROM pod and execute promoverride.py with the appropriate parameters to force PROM to switch activity to the local site.

Alternatively, the promoverride.py script could be executed directly from the PROM pod if so desired:

promoverride.py
root@dev-prom-6485f566fb-hdhzs:/app/config# ../promoverride.py -i sdnc02

Here, 'sdnc02' is the identifier specified during deployment for the site that is desired to become active.

Automatic failover

The PROM instance in each SDN-C site is responsible for periodically ascertaining the health of the local site based on the health of each component (see SDN-C Site Health Determination). This information is published to MUSIC in order for the remote site to also be aware of this information.

If the local PROM instance determines that the site is currently 'standby' and the remote site has become unhealthy, it will proceed to automatically initiate failover procedures, making the local site 'active' while the remote site is reverted to 'standby' (provided it is in a good enough state to do so).


  • No labels