DONE
We will use vFW use case as the baseline to test this:
Pre-requirement: Instantiate a vFW with closed loop running.
- Error detection is very fast: less than 1 second
- Recovery:
- Kill docker container, it normally takes less than 1 minute to get the system in normal state. (SDNC, APPC will take up 5 minutes)
- Delete the pod, it normally takes much longer to get back specially for SDNC, APPC (up to 15 minutes).
- Note: Helm upgrade sometimes messed up the whole system, which will turn the system into un-useable status. However, we think this may not be a normal use case for production env.
Time (EDT) | Categories | Sub-Categories (In Error Mode Component) | Time to Detect Failure | Time to and Repair | Pass? | Notes | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VNF | CrashvFW-VNF | Maybe leave the first three to project team? | vSink | Openstack error (Out of memory??) | Onboarding and Distribution | SDC | < 5 minutes | Pass | ONAP Component Check:
| A&AI | APPC | DCAE | DMaaP | MultiVIM | Portal | Policy | SDNC | SDC | SO | VFC | VID | VNF Onboarding | SDC | Timing?? 30 minutes. Using a script kills those components randomly, and continue onboarding VNFs. ete-k8s.sh onap distributehealthdist After kicking off the command; waiting for 1 minutes; killed SDC; The first one was failed; then we did redistribute, it was success. | |
SO | < 5 minutes | Pass | After kicking off the command; waiting for 1 minutes; killed SO; The first one was failed; then we did redistribute, it was success. SO | ||||||||||||||||||||||
A&AI | < 5 minutes | Pass |
| ||||||||||||||||||||||
SDNC | < 8 minutes | Pass |
Delete SDNC pod, it took very very long time to get back, it might because of the network issues. And we got a very "weird" system, SDC gives us the following error: | ||||||||||||||||||||||
< 5 minutes | Pass |
2. Run health and preload | APPC | SDNC | DCAE | Policy | |||||||||||||||||||
VNF Instantiation | SDC | ete-k8s.sh onap instantiate (could we de-couple with onboarding part??) | VID | SO | A&AI | SDNC | MultiVIM | Closed Loop | DCAE | Pre define manually this closed loop | DMaaP | Policy | A&AI | APPC | < 2 seconds | Pass | Tested with manually kill the docker container | ||||||||
VID | < 1 minute | Pass |
| ||||||||||||||||||||||
SO | 5 minutes | Pass | so pod restarted as part of hard rebooting 2 k8s VMs out of 9 | ||||||||||||||||||||||
A&AI | 20 minutes | Pass | restarted aai-model-loader, aai-hbase, and aai-sparky-be due to hard rebooting 2 more k8s VMs probably took extra time due to many other pods restarting at the same time and taking time to converge | ||||||||||||||||||||||
SDNC | 5 minutes | Pass | sdnc pods restarted as part of hard rebooting 2 k8s VMs out of 9 | ||||||||||||||||||||||
MultiVIM | < 5 minutes | Pass | deleted multicloud pods and verified that new pods that come up can orchestrate VNFs as usual | ||||||||||||||||||||||
Closed Loop (Pre-installed manually) | DCAE | < 5 minutes | Pass | Deleted dep-dcae-ves-collector-767d745fd4-wk4ht. No discernible interruption to closed loop. Pod restarted in 1 minute. Deleted dep-dcae-tca-analytics-d7fb6cffb-6ccpm. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-dcae-db-0. Closed loop failed after about 1 minute. Pod restarted in 2 minutes. Closed loop started suffering from intermittent packet gaps and only recovered after rebooting the packet generator. Most likely suspect is intermittent network or issues within the packet generator. Deleted dev-dcae-redis-0. No discernible interruption to closed loop. Pod restarted in 2 minutes. | |||||||||||||||||||||
DMaaP | 10 seconds | Pass | Deleted dev-dmaap-bus-controller-657845b569-q7fr2. No discernible interruption to closed loop. Pod restarted in 10 seconds. | ||||||||||||||||||||||
Policy (Policy documentation: Policy on OOM) | 15 minutes | Pass | Deleted dev-pdp-0. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-drools-0. Closed loop failed immediately. Pod restarted in 2 minutes. Closed loop recovered in 15 minutes. Deleted dev-pap-5c7995667f-wvrgr. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-policydb-5cddbc96cf-hr4jr. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-nexus-7cb59bcfb7-prb5v. No discernible interruption to closed loop. Pod restarted in 2 minutes. | ||||||||||||||||||||||
A&AI | Never | Fail | Deleted aai-modelloader. Closed loop failed immediately. Pod restarted in < 5 minutes. Closed loop never recovered. --- the rest done on a different instance --- Deleted dev-aai-55b4c4f4d6-c6hcj. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-aai-babel-6f54f4957d-h2ngd. No discernible interruption to closed loop. Pod restarted in < 5 minutes. Deleted dev-aai-cassandra-0. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-aai-data-router-69b8d8ff64-7qvjl. After two minutes all packets were shut off, recovered in 5 minutes (maybe intermittent network or packet generator issue). Pod restarted in 2 minutes. Deleted dev-aai-hbase-5d9f9b4595-m72pf. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-aai-resources-5f658d4b64-66p7b. Closed loop failed immediately. Pod restarted in 2 minutes. Closed loop never recovered. | ||||||||||||||||||||||
APPC (3-node cluster) | 20 minutes | Pass | Deleted dev-appc-0. Closed loop failed immediately. dev-appc-0 pod restarted in 15 minutes. Closed loop recovered in 20 minutes. Deleted dev-appc-cdt-57548cf886-8z468. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-appc-db-0. No discernible interruption to closed loop. Pod restarted in 3 minutes. |
Requirement
Area | Priority | Min. Level | Stretch Goal | Level Descriptions (Abbreviated) |
Resiliency | High | Level 2 – run-time projects | Level 3 – run-time projects | •1 – manual failure and recovery (< 30 minutes) |