Introduction

The ONAP Operations Manager provides a set of capabilities that facilitate Carrier Grade deployments of ONAP. ONAP deployments need to be capable of offering service while under adverse conditions typically with overall availability measured at five-nines or 99.999% uptime or about 5 minutes of downtime per year. This requirement might be strict for an orchestration system, but keep in mind that ONAP’s closed loop control system could be providing monitoring a control for one or more critical VNFs that need to meet stringent up-time requirements as found in the TL 9000 Quality Management System Measurements Handbook.

The Road to High Availability

The progression of the ONAP project towards a fully Carrier Grade has started and will continue over the Beijing or possibly even subsequent releases. The steps along this progression are roughly as follows:

For each of these steps the following sections describe the requirements in more detail and the technologies used to achieve it.

Highly Available Kubernetes Deployments

There is a high degree of variability possible in the deployment of Kubernetes. In some cases it may be installed and managed by hand, done with 3rd party tools like Rancher or even provided by a cloud provider like Microsoft Azure Container Service - Kubernetes has a description of the options here. Kubernetes provides guidance on creating deployments that may be suitable for carrier grade deployments of ONAP on their Building High-Availability Clusters wiki page.

Reliable and Repeatable Deployment

During the Amsterdam release OOM provided a set of capabilities to deploy some or all the ONAP components rapidly and efficiently as a cloud native application with the Kubernetes container orchestration system (note that DCAE is an exception here as DCAE provides its own orchestration system). Each of the components has a deployment specification that describes not only the containers and the container requirements but the relationships or dependencies between the containers. These dependencies dictate the order in-which the containers are started for the first time such that such dependencies are always met without arbitrary sleep times between container startups. For example, the SDC back-end container requires the Elastic-Search, Cassandra and Kibana containers within SDC to be ready and is also dependent on DMaaP (or the message-router) to be ready before becoming fully operational.

Prior to having more advantage carrier grade features available, the ability to at least be able to re-deploy ONAP (or a subset of) reliably provides a level of confidence that should an outage occur the system can be brought back on-line predictably.

Backup and Restore

A critical factor in being able to recover from an ONAP outage is to ensure that critical state isn't lost after a failure. Much like ephemeral storage on VMs; any state information stored within a container will be lost once the container is restarted - containers are managed as Cattle not Pets. To ensure that critical state information is retained after a failure, the OOM deployment specifications for the ONAP components use the Kubernetes concept of a Persistent Volumes, an external storage facility that has its own lifecycle. Many different types of storage are supported by this capability such as: GCEPersistentDisk, AWSElasticBlockStore, AzureFile, AzureDisk, FC (Fibre Channel), FlexVolume, Flocker, NFS, iSCSI, RBD (Ceph Block Device), CephFS, Cinder (OpenStack block storage), Glusterfs, VsphereVolume, Quobyte Volumes, HostPath (Single node testing only), VMware Photon, Portworx Volumes, ScaleIO Volumes, and StorageOS.

As critical state is stored outside of the ONAP containers on a storage media specific to the cloud environment, specific instructions on how to backup and restore such storage is outside of the scope of ONAP.

Health Monitoring

All highly available systems include at least one facility to monitor the health of components within the system. Such health monitors are often used as inputs to distributed coordination systems (such as etcd or zookeeper) and monitoring systems (such as nagios or zabbix). Within ONAP Consul is the monitoring system of choice and deployed by OOM in two parts. A three-way, centralized Consul server cluster is deployed as a highly available monitor of all of the ONAP components. The Consul server provides an user interface that allows a user to graphically view the current health status of all of the ONAP components for which agents have been created. Monitoring of ONAP components is configured in the agents within JSON files and stored in gerrit under the consul-agent-config.

Initially the Consul agents are using the same health monitoring facilities as the robot test infrastructure which are typically just validating that the end-point if reachable. Some health checks already support more advanced checking - such as validating that a database is able to create, update and delete an entry. Consul exposes an API that allows external agents to use the results of the health check such as the Kubernetes "liveness" probes described below.

Component Recoverability

OOM deploys ONAP with Kubernetes as described by deployment specifications as described earlier. These same deployment specifications are also used to implement automatic recoverability of ONAP components when individual components fail. Once ONAP is deployed, a "liveness" probe starts checking the health of the components after a specified startup time. These liveness probes can simply check that a port is available, that a built-in health check is reporting good health, or that the Consul health check is positive. Should a liveness probe indicate a failed container it will be restarted as described in the deployment specification. Should the deployment specification indicate that there are one or more dependencies to this container or component (for example a dependency on a database) the dependency will be satisfied before the container/component is restarted. This mechanism ensures that after a failure all of the ONAP components restart successfully. Note that during the Amsterdam release deployment specification were created for all ONAP component but not all of these deployment specifications are restartable (idempotent) so further work is required during the Beijing release to ensure recoverability of all the ONAP components.

Space shortcuts

Page tree

Introduction

The Road to High Availability

Highly Available Kubernetes Deployments

Reliable and Repeatable Deployment

Backup and Restore

Health Monitoring

Component Recoverability

Centralized Logging

Intra Component Clustering

Anti-Affinity Rules

ONAP S/W Upgrades & Rollbacks

Geo-Redundant Deployments

Space shortcuts

Page tree

OOM for Carrier Grade Deployments

Introduction

The Road to High Availability

Highly Available Kubernetes Deployments

Reliable and Repeatable Deployment

Backup and Restore

Health Monitoring

Component Recoverability

Centralized Logging

Intra Component Clustering

Anti-Affinity Rules

ONAP S/W Upgrades & Rollbacks

Geo-Redundant Deployments