Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Jira Charts
borderfalse
showinforfalse
serverONAP JIRA
jql%22Epic%20Link%22%20in%20(%22OOM-10%22)
statTypestatuses
chartTypepie
width
isAuthenticatedtrue
serverId425b2b0a-557c-3c0c-b515-579789cceedb

Monitor

All highly available systems include at least one facility to monitor the health of components within the system.  Such health monitors are often used as inputs to distributed coordination systems (such as etcdzookeeper, or consul) and monitoring systems (such as nagios or zabbix). OOM provides two mechanims to monitor the real-time health of an ONAP deployment:

  • a Consul GUI for a human operator or downstream monitoring systems and Kubernetes liveness probes that enable automatic healing of failed containers

...

  • , and
  • a set of liveness probes which feed into the Kubernetes manager which are described in the

...

  • Heal section.

OOM deploys a 3 instance Consul server cluster that provides a real-time health monitoring capability for all of the ONAP components.  For each of the ONAP components a Consul health check has been created, here is an example from the AAI model loader:

Within ONAP Consul is the monitoring system of choice and deployed by OOM in two parts:

      • a three-way, centralized Consul server cluster is deployed as a highly available monitor of all of the ONAP components,and
      • a number of Consul agents. 

The Consul server provides a user interface that allows a user to graphically view the current health status of all of the ONAP components for which agents have been created - a sample from the ONAP Integration labs follows.  Monitoring of ONAP components is configured in the agents within JSON files and stored in gerrit under the consul-agent-config, here is an example from the AAI model loader:

Code Block
theme
Code Block
themeMidnight
titleaai-model-loader-health.json
collapsetrue
{
  "service": {
    "name": "A&AI Model Loader",
    "checks": [
      {
        "id": "model-loader-process",
        "name": "Model Loader Presence",
        "script": "/consul/config/scripts/model-loader-script.sh",
        "interval": "15s",
        "timeout": "1s"
      }
    ]
  }
}

...

Jira Charts
borderfalse
showinforfalse
serverONAP JIRA
jql%22Epic%20Link%22%20in%20(%22OOM-7%22)
statTypestatuses
chartTypepie
width
isAuthenticatedtrue
serverId425b2b0a-557c-3c0c-b515-579789cceedb

Image Added

Heal

OOM deploys ONAP with Kubernetes defined by deployment specifications as mentioned earlier.  These same deployment specifications are also used to implement automatic recoverability of ONAP components when individual components fail. Once ONAP is deployed, a "liveness" probe starts checking the health of the components after a specified startup time.  These liveness probes can simply check that a port is available, that a built-in health check is reporting good health, or that the Consul health check is positive. Should a liveness probe indicate a failed container it will be terminated and a replacement will be started in its place - containers are ephemeral. Should the deployment specification indicate that there are one or more dependencies to this container or component (for example a dependency on a database) the dependency will be satisfied before the replacement container/component is started. This mechanism ensures that, after a failure, all of the ONAP components restart successfully.  

Liveness probes are used by the Kubernetes manager to monitor the real-time health of the containers in an ONAP deployment.  If the liveness probe fails, Kubernetes will kill off the failed container and start a new container to replace it. The liveness probe have same parameters as the readiness probes described in the Deploy section. For example, to monitor the SDNC component has following liveness probe can be found in the SDNC DB deployment specification:

Code Block
themeMidnight
titlesdnc db liveness probe
collapsetrue
livenessProbe:
  exec:
    command: ["mysqladmin", "ping"]
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5 timeoutSeconds: 5

The 'initialDelaySeconds' control the period of time between the readiness probe succeeding and the liveness probe starting. 'periodSeconds' and 'timeoutSeconds' control the actual operation of the probe.

Note that containers are inherently ephemeral so the healing action destroys failed containers and any state information within it.  To avoid a loss of state, a persistent volume should be used to store all data that needs to be persisted over the re-creation of a container.  Persistent volumes have been created for the database components of each of the projects and the same technique can be used for all persistent state information.

Note that, during the Amsterdam release, deployment specifications were created for all ONAP components but not all of these deployment specifications are restartable (idempotent).  Further work is required during the Beijing release to ensure recoverability of all the ONAP components.

Clustering and Scaling

In-order to avoid the loss of an entire ONAP component by the failure of a single container, OOM enables the creation of clusters of pods hidden behind a load balancer in the form of a service.  As an example of what can be done, please refer to the  SDN-C Clustering on Kubernetes wiki page.

...