Beijing Release Resiliency Testing Status

In progress

We will use vFW use case as the baseline to test this:

Pre-requirement: Instantiate a vFW with closed loop running.

Error detection is very fast: less than 1 second
Recovery:
- Kill docker container, it normally takes less than 1 minute to get the system in normal state. (SDNC, APPC will take up 5 minutes)
- Delete the pod, it normally takes much longer to get back specially for SDNC, APPC (up to 15 minutes).
Note: Helm upgrade sometimes messed up the whole system, which will turn the system into un-useable status. However, we think this may not be a normal use case for production env.

Categories	Sub-Categories (In Error Mode Component)	Time to Detect Failure and Repair	Pass?	Notes
VNF Onboarding and Distribution	SDC	< 5 minutes	Pass	Timing?? 30 minutes. Using a script kills those components randomly, and continue onboarding VNFs. ete-k8s.sh onap healthdist After kicking off the command; waiting for 1 minutes; killed SDC; The first one was failed; then we did redistribute, it was success.
	SO	< 5 minutes	Pass	After kicking off the command; waiting for 1 minutes; killed SO; The first one was failed; then we did redistribute, it was success.
	A&AI	< 5 minutes	Pass	Killed aai-modelloader; it finished the task in 3:04 minutes Killed two aai-cassandra pods; it finished the task in ~1 minutes.
	SDNC	< 8 minutes	Pass	Run preload using scripts Delete SDNC pod, it took very very long time to get back, it might because of the network issues. And we got a very "weird" system, SDC gives us the following error:
	SDNC	< 5 minutes	Pass	Deleted one of the SDNC container: eg. sdnc-0. 2. Run health and preload
VNF Instantiation	SDC	< 2 seconds	Pass	Tested with manually kill the docker container
	VID	< 1 minute	Pass	kubectl delete pod dev-vid-6d66f9b8c-9vdlt -n onap # back in 1 minute kubectl delete pod dev-vid-mariadb-fc95657d9-wqn9s -n onap # back in 1 minute
	SO	5 minutes	Pass	so pod restarted as part of hard rebooting 2 k8s VMs out of 9
	A&AI	20 minutes	Pass	restarted aai-model-loader, aai-hbase, and aai-sparky-be due to hard rebooting 2 more k8s VMs probably took extra time due to many other pods restarting at the same time and taking time to converge
	SDNC	5 minutes	Pass	sdnc pods restarted as part of hard rebooting 2 k8s VMs out of 9
	MultiVIM	< 5 minutes	Pass	deleted multicloud pods and verified that new pods that come up can orchestrate VNFs as usual
Closed Loop (Pre-installed manually)	DCAE
	DMaaP
	Policy (Policy documentation: Policy on OOM)	15 minutes	Pass	Deleted dev-pdp-0. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-drools-0. Closed loop failed immediately. Pod restarted in 2 minutes. Closed loop recovered in 15 minutes. Deleted dev-pap-5c7995667f-wvrgr. No discernible interruption to closed loop. Pod restarted in 2 minutes. Deleted dev-policydb-5cddbc96cf-hr4jr. No discernible interruption to closed loop. Pod restarted in 2 minutes.
	A&AI	Never (observed for > 1 hour)	Fail	Deleted aai-modelloader. Closed loop failed immediately. Even though aai-modelloader container restarted within a couple of minutes (when restarted on a VM that already has the image), closed loop never recovered.
	APPC (3-node cluster)	20 minutes	Pass	Deleted dev-appc-0. Closed loop failed immediately. dev-appc-0 pod restarted in 15 minutes. Closed loop recovered in 20 minutes.

Requirement

Area	Priority	Min. Level	Stretch Goal	Level Descriptions (Abbreviated)
Resiliency	High	Level 2 – run-time projects Level 1 – remaining projects	Level 3 – run-time projects Level 2 – remaining projects	•1 – manual failure and recovery (< 30 minutes) •2 – automated detection and recovery (single site) (<30 minutes) •3 – automated detection and recovery (geo redundancy)

Space shortcuts

Page tree

In progress