Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

DONE

We will use vFW use case as the baseline to test this:

Pre-requirement: Instantiate a vFW with closed loop running.

  • Error detection is very fast: less than 1 second
  • Recovery:
    • Kill docker container, it normally takes less than 1 minute to get the system in normal state. (SDNC, APPC will take up 5 minutes)
    • Delete the pod, it normally takes much longer to get back specially for SDNC, APPC (up to 15 minutes). 
  • Note: Helm upgrade sometimes messed up the whole system, which will turn the system into un-useable status. However, we think this may not be a normal use case for production env.
Crash
Time (EDT)Categories

Sub-Categories

(In Error Mode Component)

Time to Detect Failure and RepairPass?Notes

VNF vFW-VNFMaybe leave the first three to project team?vSinkOpenstack error (Out of memory??)

ONAP Component

Check:

  1. Is Memory Data Kept?
  2. Is File System/db Data Kept?
A&AIAPPCDCAEDMaaPMultiVIMPortalPolicyPolicy documentation: Policy on OOMSDNCSOVFCVIDVNF Onboarding and DistributionSDC< 5 minutesPass

Timing?? 30 minutes.  Using  a script kills those components randomly, and continue onboarding VNFs.

ete-k8s.sh onap healthdist

After kicking off the command; waiting for 1 minutes; killed SDC;

The first one was failed; then we did redistribute, it was success.


SO< 5 minutesPass

After kicking off the command; waiting for 1 minutes; killed SO;

The first one was failed; then we did redistribute, it was success.


A&AI< 5 minutesPass
  1. Killed aai-modelloader; it finished the task in 3:04 minutes
  2. Killed two aai-cassandra pods; it finished the task in ~1 minutes.

SDNC< 8 minutesPass
  1. Run preload using scripts

Delete SDNC

pod, it took very very long time to get back, it might because of the network issues. And we got a very "weird" system, SDC gives us the following error:

Image Added

< 5 minutesPass
  1. Deleted one of the SDNC container: eg. sdnc-0.

2. Run health and preload

check with BrainDCAECheck with Ron for blueprint distribution



VNF InstantiationSDCete-k8s.sh onap instantiate  (could we de-couple with onboarding part??)VIDSOA&AISDNCMultiVIMClosed LoopDCAEPre define manually this closed loopDMaaPPolicyPolicy documentation: Policy on OOMA&AIAPPC< 2 secondsPassTested with manually kill the docker container

VID< 1 minutePass
  1. kubectl delete pod dev-vid-6d66f9b8c-9vdlt -n onap # back in 1 minute
  2. kubectl delete pod dev-vid-mariadb-fc95657d9-wqn9s -n onap   # back in 1 minute

SO5 minutesPassso pod restarted as part of hard rebooting 2 k8s VMs out of 9

A&AI20 minutesPass

restarted aai-model-loader, aai-hbase, and aai-sparky-be due to hard rebooting 2 more k8s VMs

probably took extra time due to many other pods restarting at the same time and taking time to converge


SDNC5 minutesPasssdnc pods restarted as part of hard rebooting 2 k8s VMs out of 9

MultiVIM< 5 minutesPassdeleted multicloud pods and verified that new pods that come up can orchestrate VNFs as usual

Closed Loop

(Pre-installed manually)

DCAE

< 5 minutes

Pass

Deleted dep-dcae-ves-collector-767d745fd4-wk4ht. No discernible interruption to closed loop. Pod restarted in 1 minute.

Deleted dep-dcae-tca-analytics-d7fb6cffb-6ccpm. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-dcae-db-0. Closed loop failed after about 1 minute. Pod restarted in 2 minutes. Closed loop started suffering from intermittent packet gaps and only recovered after rebooting the packet generator. Most likely suspect is intermittent network or issues within the packet generator.

Deleted dev-dcae-redis-0. No discernible interruption to closed loop. Pod restarted in 2 minutes.


DMaaP10 secondsPassDeleted dev-dmaap-bus-controller-657845b569-q7fr2. No discernible interruption to closed loop. Pod restarted in 10 seconds.

Policy

(Policy documentation: Policy on OOM)

15 minutesPass

Deleted dev-pdp-0. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-drools-0. Closed loop failed immediately. Pod restarted in 2 minutes. Closed loop recovered in 15 minutes.

Deleted dev-pap-5c7995667f-wvrgr. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-policydb-5cddbc96cf-hr4jr. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-nexus-7cb59bcfb7-prb5v. No discernible interruption to closed loop. Pod restarted in 2 minutes.


A&AINeverFail

Deleted aai-modelloader. Closed loop failed immediately. Pod restarted in < 5 minutes. Closed loop never recovered.

--- the rest done on a different instance ---

Deleted dev-aai-55b4c4f4d6-c6hcj. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-aai-babel-6f54f4957d-h2ngd. No discernible interruption to closed loop. Pod restarted in < 5 minutes.

Deleted dev-aai-cassandra-0. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-aai-data-router-69b8d8ff64-7qvjl. After two minutes all packets were shut off, recovered in 5 minutes (maybe intermittent network or packet generator issue). Pod restarted in 2 minutes.

Deleted dev-aai-hbase-5d9f9b4595-m72pf. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-aai-resources-5f658d4b64-66p7b. Closed loop failed immediately. Pod restarted in 2 minutes. Closed loop never recovered.


APPC (3-node cluster)20 minutesPass

Deleted dev-appc-0. Closed loop failed immediately. dev-appc-0 pod restarted in 15 minutes. Closed loop recovered in 20 minutes.

Deleted dev-appc-cdt-57548cf886-8z468. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-appc-db-0. No discernible interruption to closed loop. Pod restarted in 3 minutes.

Requirement

Area

Priority

Min. Level

Stretch Goal

Level Descriptions (Abbreviated)

Resiliency

High

Level 2 – run-time projects
Level 1 – remaining projects

Level 3 – run-time projects
Level 2 – remaining projects

•1 – manual failure and recovery (< 30 minutes)
•2 – automated detection and recovery (single site) (<30 miutesminutes)
•3 – automated detection and recovery (geo redundancy)