Casablanca

Updated for Casablanca release.

Overview

With SDN-C deployed in a geo-redundant fashion (see Deployment of Geo-Redundant SDN-C), activity can be switched from one site to the other in one of two ways:

  • manually by the site operator
  • automatically via PROM, based on health of active site

In either case, after the failover has been completed and the sites have transitioned from 'standby' to 'active' and vice versa, the DNS entry for the SDN-C deployment is automatically updated (see 2.1 Enable Remote Access to CoreDNS) in order to provide clients with the correct SDN-C target for their messaging.

Manual (forced) failover

The manual option would be utilized by site operators wishing to force activity to a particular site so that they may proceed with performing maintenance or other activities on the other site without impacting service. Prior to carrying out this activity, it is suggested that the current role of the site(s) be determined (see SDN-C Site Role Detection).

From the Kubernetes master node in the site, simply run the sdnc.makeActive script:

sdnc.makeActive
ubuntu@k8s-s2-master:~/oom/kubernetes/sdnc/resources/geo/bin$ ./sdnc.makeActive dev (release name)

This script will make use of kubectl to access the PROM pod and execute promoverride.py with the appropriate parameters to force PROM to switch activity to the local site.

Alternatively, the promoverride.py script could be executed directly from the PROM pod if so desired:

promoverride.py
root@dev-prom-6485f566fb-hdhzs:/app/config# ../promoverride.py -i sdnc02

Here, 'sdnc02' is the identifier specified during deployment for the site that is desired to become active. When using the promoverride.py script directly, you may switch activity to either of the sites (without having to be logged into the PROM pod on that site).

Automatic failover

The PROM instance in each SDN-C site is responsible for periodically ascertaining the health of the local site based on the health of each component (see SDN-C Site Health Determination). This information is published to MUSIC in order for the remote site to also be aware of this information.

If the local PROM instance determines that the local site is currently 'standby' (see SDN-C Site Role Detection) and the remote site has become unhealthy, it will proceed to automatically initiate failover procedures, making the local site 'active' while the remote site is reverted to 'standby' (provided it is in a good enough state to do so).

Catastrophic failover

In certain circumstances, a "simple" failover – in which most components in the failed site are still available to be manipulated – cannot be performed. If the ODL cluster in the failing site cannot be contacted from the site that is to become active or if the Kubernetes master node in the remote site cannot be contacted (which will prevent communication to the managed pods), a catastrophic failover will be performed.

In a "catastrophic" failover, the healthy SDN-C site needs to be reconfigured as a standalone site (without geo-redundancy). Part of the process will involve a Helm upgrade intended to reconfigure the site to remove geo-redundancy. This will require that the SDN-C pods in the site to be restarted (for the reconfiguration to take effect).

After suffering a catastrophic failure in a site and the other site being made non-geo-enabled, the geo pair can be reconstituted by following the site recovery procedure.

Detecting reconfguration to non-geo-redundancy

After a catastrophic failover, the operator may wish to confirm that the active site has actually reverted to a non-geo-redundant configuration. This can be done by connecting to an SDN-C pod in the site and looking for 'GEO' in the environment variables in use:

Non-geo deployment
ubuntu@k8s-master:~$ k8s exec dev-sdnc-0 -it sh
Defaulting container name to sdnc.
Use 'kubectl describe pod/dev-sdnc-0 -n onap' to see all of the containers in this pod.
# env | grep GEO
#
Geo deployment
ubuntu@k8s-master:~$ k8s exec dev-sdnc-0 -it sh
Defaulting container name to sdnc.
Use 'kubectl describe pod/dev-sdnc-0 -n onap' to see all of the containers in this pod.
# env | grep GEO
GEO_ENABLED=true
#

DNS updates

After a successful failover is performed, either manually or automatically, the sdnc.dnsswitch script found on the PROM pod will automatically be invoked. This script communicates with the CoreDNS pod and updates the SDN-C deployment's DNS record so that it points to the local (i.e. 'active') site. In the case where the failover was not completed successfully, this step is not carried out (since the site is very likely unable to process messaging).

The sdnc.dnsswitch script is intended to be utilized by PROM but could be run manually if so desired:

sdnc.dnsswitch
root@dev-prom-6485f566fb-hdhzs:/app/config# ./sdnc.dnsswitch



Earlier Releases

Beijing


Manual site failover

The promoverride.py script for manual (forced) failover is not available in Beijing, nor is automatic failover orchestrated by PROM.

In order to carry out a site failover in Beijing, the operator would invoke the sdnc.failover script found on the Kubernetes master in the standby site:

sdnc.failover
ubuntu@k8s-s2-master:~/oom/kubernetes/sdnc/resources/geo/bin$ ./sdnc.failover

Note: The sdnc.failover script in Beijing is limited to situations where the failure afflicting the active site is not catastrophic, meaning that most components in the active site are still available to be communicated with.

Manual DNS update

After successfully failing over a site, the operator would then be required to update the CoreDNS configuration so that the SDN-C hostname resolves to the appropriate site. (The sdnc.dnsswitch script method is not available in the Beijing release.)

Follow below steps for manual site failover. All steps need to be run on coredns master node.

Please note the configuration in all examples for reference:

coredns master node IP address: 10.147.101.135

primary site (site1) master node IP address: 10.147.99.140

secondary site (site2) master node IP address: 10.147.101.23

   

  1.   Verify coredns server, to get the existing mapping. (here it is pointing to primary site(site 1))
#verify the address for sdnc.example.com resolves to primary site presently
root@coredns-1:/dockerdata-nfs# nslookup sdnc.example.com
Server:         10.96.0.10
Address:        10.96.0.10#53
Name:   sdnc.example.com
Address: 10.147.99.140

     

     2. Edit zone file to comment out SDNC mapping to primary site (site1) and uncomment mapping to secondary site (site2)

root@coredns-1:~# vi /dockerdata-nfs/zone.db


     3. Edit coredns configmap to comment out SDNC mapping to primary site (site1) and uncomment mapping to secondary site (site2)

#Below command opens the codedns configmap for editing. Edit and save the file.
# Notice the A record for sdnc: "sdnc         IN  A  10.147.99.140" is commented out by appending ;; to the line (\n;;sdnc         IN  A  10.147.99.140\n)
# Notice the A record for sdnc: "sdnc         IN  A  10.147.101.23" is uncommented out by removing;; from the line (\nsdnc\t\t    IN A   10.147.101.23)
root@kubefed-1:~# kubectl edit configmap coredns -n kube-system -oyaml
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        log
        health
        kubernetes cluster.local 10.96.0.0/12 {
           pods insecure
        }
        file /dockerdata-nfs/zone.db example.com
        prometheus
        proxy . /etc/resolv.conf
        cache 30
    }
  zone.db: "$ORIGIN example.com.     ; designates the start of this zone file in the
    namespace\n$TTL 1h         ; default expiration time of all resource records without
    their own TTL value\nexample.com.  IN  SOA   ns.example.com. username.example.com.
    ( 2007120710 1d 2h 4w 1h )\nexample.com.  IN  NS    ns                    ; ns.example.com
    is a nameserver for example.com\nexample.com.  IN  NS    ns.somewhere.example.
    ; ns.somewhere.example is a backup nameserver for example.com\nexample.com.  IN
    \ A     10.147.101.135             ; IPv4 address for example.com\nns            IN
    \ A     10.247.5.11             ; IPv4 address for ns.example.com\nwww           IN
    \ CNAME example.com.          ; www.example.com is an alias for example.com\nwwwtest
    \      IN  CNAME www              ; wwwtest.example.com is another alias for www.example.com\nsdnc.example.com.
    \   IN      SRV    30202 10 10 example.com.\n;;site1\n;;sdnc         IN  A  10.147.99.140\n;;site2\nsdnc\t\t
    IN A   10.147.101.23"
kind: ConfigMap
metadata:
  creationTimestamp: 2018-02-28T20:13:03Z
  name: coredns
  namespace: kube-system
  resourceVersion: "102077"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: c8489771-1cc3-11e8-a0cb-fa163eabcb60

configmap "coredns" edited


     4. Note that there is a cache time configured in configmap. Wait for some time (30 seconds here) and then send signal to refresh the settings for coredns.

#substitute the coredns pod name before execution
root@coredns-1:~# kubectl exec -n kube-system <coredns-pod-name> -- kill -SIGUSR1 1


     5. Verify the "sdnc.example.com" domain points to secondary site now.

#verify the address for sdnc.example.com resolves to secondary site now
root@kubefed-1:/dockerdata-nfs# nslookup sdnc.example.com
Server:         10.96.0.10
Address:        10.96.0.10#53
Name:   sdnc.example.com
Address: 10.147.101.23

It may take some time to refresh the address for DNS resolver depending on configured cache time. Send the refresh signal again (in step 4) after sometime if you're not able to see the update.

  • No labels