Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added a section on health monitoring and component recoverability

...

All highly available systems include at least one facility to monitor the health of components within the system.  Such health monitors are often used as inputs to distributed coordination systems (such as etcd or zookeeper) and monitoring systems (such as nagios or zabbix).  Within ONAP Consul is the monitoring system of choice and deployed by OOM in two parts.  A three-way, centralized Consul server cluster is deployed as a highly available monitor of all of the ONAP components.  The Consul server provides an user interface that allows a user to graphically view the current health status of all of the ONAP components for which agents have been created.  Monitoring of ONAP components is configured in the agents within JSON files and stored in gerrit under the consul-agent-config.

Initially the Consul agents are using the same health monitoring facilities as the robot test infrastructure which are typically just validating that the end-point if reachable.  Some health checks already support more advanced checking - such as validating that a database is able to create, update and delete an entry. Consul exposes an API that allows external agents to use the results of the health check such as the Kubernetes "liveness" probes described below. 

Component Recoverability

OOM deploys ONAP with Kubernetes as described by deployment specifications as described earlier.  These same deployment specifications are also used to implement automatic recoverability of ONAP components when individual components fail. Once ONAP is deployed, a "liveness" probe starts checking the health of the components after a specified startup time.  These liveness probes can simply check that a port is available, that a built-in health check is reporting good health, or that the Consul health check is positive. Should a liveness probe indicate a failed container it will be restarted as described in the deployment specification.  Should the deployment specification indicate that there are one or more dependencies to this container or component (for example a dependency on a database) the dependency will be satisfied before the container/component is restarted. This mechanism ensures that after a failure all of the ONAP components restart successfully.  Note that during the Amsterdam release deployment specification were created for all ONAP component but not all of these deployment specifications are restartable (idempotent) so further work is required during the Beijing release to ensure recoverability of all the ONAP components.

Centralized Logging

Intra Component Clustering

...