You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 27 Next »


The intent of the 72 hour stability test is not to exhaustively test all functions but to run a steady load against the system and look for issues like memory leaks that aren't found in the short duration install and functional testing during the development cycle.

This page will collect notes on the 72 hour stability test run for Frankfurt.

See El Alto Stability Run Notes for comparison to previous runs.


Summary of Results


WORK IN PROGRESS



Setup


The integration-longevity tenant in Intel/Windriver environment was used for the 72 hour tests.

The onap-ci job for  "Project windriver-longevity-release-manual" was used for the deployment with the OOM set to frankfurt and Integration branches set to master. Integraiton master was used so we could catch the latest updates to integration scripts and vnf heat templates.

The jenkins job needs a couple of updates for each release:

  1. Set the integration branch to 'origin/master'
  2. Modify the parameters to deploy.sh to specify "-i master" and "-o frankfurt" to get integration master an oom frankfurt clones onto the nfs server.

The path for robot logs on dockerdata-nfs  changed in Frankfurt so the /dev-robot/   becomes /dev/robot

The stability tests used robot container image  1.6.1-STAGING-20200519T201214Z


robot container updates:

API_TYPE was set to GRA_API since we have deprecated VNF_API.



Shakedown consists of creating some temporary tags for stability72hrvLB, stability72hrvVG,stability72hrVFWCL to make sure each sub test ran successfully (including cleanup) in the environment before the jenkins job started with the higher level testsuite tag stability72hr that covers all three test types.


Clean out the old buid jobs using a jenkins console script (manage jenkins)

def jobName = "windriver-longevity-stability72hr"

def job = Jenkins.instance.getItem(jobName)

job.getBuilds().each { it.delete() }

job.nextBuildNumber = 1

job.save()


appc.properties updated to apply the fix for DMaaP message processing to call http://localhost:8181 for the streams update.


VNF Orchestration Tests

This test uses the onap-ci job "Project windriver-longevity-stability72hr" to automatically onboard, distribute and instantiate the ONAP opensource test VNFs vLB, vVG and vFWCL.

The scripts run validation tests after the install.

The scripts then delete the VNFs and cleans up the environment for the next run.

The script tests AAF, DMaaP, SDC, VID, AAI, SO, SDNC, APPC with the open source VNFs.


There was a problem with the robot scripts for vLB where it was not finding the base_lb.yaml file in the artifacts due to a change in the structure. A two line change to the vnf orchestration script to look for the 'heat3' key was made to resolve the issue. A Jira was created to track the changes to the robot scrips.   INT-1598 - Getting issue details... STATUS


These tests started at jenkins job #1


Each test run generates over 500 MB of data on the test through robot framework.


Each test run also runs the kubectl top nodes command to see cpu and memory utilization across the k8 cluster.

We periodically will run the top pods command as well to check on the top memory and cpu using pods.

http://10.12.6.182:8080/jenkins/job/windriver-longevity-stability72hr/

Test #CommentMessage

k8 utilization

Wed May 20 18:45:15 UTC 2020

Memory:
root@long-nfs:~/oom/kubernetes/robot# kubectl -n onap top pods | sort -rn -k 3 | head -25
dev-appc-0 7m 2901Mi
dev-portal-cassandra-59f5cb4cf5-9phmg 159m 2777Mi
dev-appc-2 10m 2705Mi
dev-appc-1 19m 2681Mi
dev-cassandra-0 73m 2417Mi
dev-cassandra-2 48m 2394Mi
dev-cassandra-1 70m 2391Mi
dev-sdnc-2 71m 1868Mi
dev-policy-59f48bd84b-q2fp8 7m 1820Mi
dev-sdnc-0 139m 1627Mi
dev-sdnc-1 26m 1574Mi
dev-vid-5b7558dcdc-rx2d7 9m 1510Mi
dev-clamp-dash-es-6cb85979b5-cvrcs 32m 1480Mi
dev-awx-0 244m 1434Mi
dev-aai-elasticsearch-55b56f855c-f5pp5 2m 1422Mi
dev-sdc-be-77d55774f5-zkfrt 6m 1381Mi
dev-dcae-cloudify-manager-6f854859f9-ctdcv 90m 1312Mi
dep-dcae-tca-analytics-55dbd5cd9d-fsm89 511m 1262Mi
dev-aaf-cass-7d55bfc874-sqcdq 6m 1244Mi
dev-aai-traversal-847c4c6994-qbpst 3m 956Mi
dev-so-bpmn-infra-7b58b75b76-n59sf 5m 953Mi
dev-message-router-zookeeper-2 2m 946Mi
dev-aai-resources-74dd6994d4-nh24m 5m 869Mi
dev-aai-graphadmin-65db8cfc67-svvkd 2m 836Mi
dev-music-cassandra-2 147m 801Mi

#1

TOOLING

Startup issues - modified customer uuid to shorten the string in the tooling since it looked like robot selenium was having trouble "seeing" the string in the drop down.

vDNS: NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_aaaf3926-d765-4c47-93b9-857e674d2d01

vvG: NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_08f8a099-3e2b-480f-8153-5b4173d9394a

vFW: Succeeded


#4

ENV

${vnf} = vFWCLvPKG

Robot heat bridge run after the deployment failed trying to find the stack in openstack usually means that openstack was slow in deploying the VNF. Heatbridge had succeeded for the vFWCLvSNK inside the same service instantiate.

Keyword 'Get Deployed Stack' failed after retrying for 10 minutes. The last error was: KeyError: 'stack'
#13

ENV

${vnf} = vFWCLvPKG

Robot heat bridge run after the deployment failed trying to find the stack in openstack usually means that openstack was slow in deploying the VNF. Heatbridge had succeeded for the vFWCLvSNK inside the same service instantiate.

Keyword 'Get Deployed Stack' failed after retrying for 10 minutes. The last error was: KeyError: 'stack'
#14

TOOLING or ENV

vDNS and vVG robot script couldnt find elements on the GUI drop downs. Likely transient networking issues. vFW succeeded and all three are in the test run (vDNS, vVG, vFW in that order).

vDNS : Keyword 'Wait For Model' failed after retrying for 3 minutes. The last error was: Element 'xpath=//tr[td/span/text() = 'vLB 2020-05-20 13-06-03']/td/button[contains(text(),'Deploy')]' not visible after 1 minute.


vVG: NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_9f739343-cbc7-4ee4-8697-ea52f06e7796


vFW Succeeded

#15

TOOLING

Virtual Volume Group - Failure in robot selenium to find customer in search window. Timing issue.

NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_26e85655-1f44-4e7e-8cd2-e9fab290af01
#17

ENV   or TOOLING

Failure in robot selenium at second VNF in service package. Likely tuning of robot needed waiting for the module name to appear in the drop down under transient conditions.

Element 'xpath=//div[contains(.,'Ete_vFWCLvPKG_f716b1bd_1')]/div/button[contains(.,'Add VF-Module')]' did not appear in 1 minute.
#18

ENV

K8 worker node problem . kubectl top nodes listed k8s-04 as unkown.

k8s-04 is on 10.12.6.0 which could be contributing factor - .0 and .32 addresses in windriver have suspect behavoir.

Worker down caused a set of containers to be restarted which is the right behavoir from a k8 standpoint. Test could not run while robot container was down.

12:00:25 Instantiate Virtual DNS GRA command terminated with exit code 137
12:22:22 + retval=137
12:22:22 ++ echo 'kubectl exec -n onap dev-robot-56c5b65dd-dkks4 -- ls -1t /share/logs | grep stability72hr | head -1'
12:22:22 ++ ssh -i /var/lib/jenkins/.ssh/onap_key ubuntu@10.12.5.205 sudo su
12:22:25 error: unable to upgrade connection: container not found ("robot")

#19

#20

TOOLING

k8 restarted robot pod. Manual fixes to vnf_orchestration_test_template to fix heat3 parsing issues were removed.

reapplied manual fixes so parsing sdc artifacts to find the base_vlb resource succeeded again.

Unable to find catalog resource for vLB base_vlb'


#32

TOOLING 

Robot script did not find subscriber name in search results

Likely timing issue that robot is too fast in looking for json data in the drop down before it is fully loaded.

Create Service Instance → vid_interface . Click On Element When Visible //select[@prompt='Select Subscriber Name'
#35

ENV

vDNS instantiate failed at openstack stage. Potentially slowed openstack caused SO to resubmit a request that subsequently became a duplicate from openstack perspective.

Looks like functional bug with SO to Openstack issue triggered by the environment not stability related.

CREATE failed: Conflict: resources.vlb_0_onap_private_port_0: IP address 10.0.211.24 already allocated in subnet be057760-1ffa-4827-a6df-75d355c4d45a\nNeutron server returns request_ids: ['req-ca6e5f39-7462-47c6-aaa8-9653783828cb']

#37

ENV

vG and vFW failed on VID screen errors looking for data items. Investigation shows that aai-traversal pod restarted. Looks like slow networking caused the pod to be redeployed but not conclusive. Initially so, vid failed healtch check until aai traversal was up then both passed healthcheck.



Thu May 21 12:33:45 UTC 2020

Memory:

root@long-nfs:/home/ubuntu# kubectl -n onap top pod | sort -rn -k3 | head -20
dev-appc-0 7m 2834Mi
dev-portal-cassandra-59f5cb4cf5-9phmg 152m 2780Mi
dev-appc-1 19m 2700Mi
dev-appc-2 10m 2694Mi
dev-cassandra-2 15m 2449Mi
dev-cassandra-1 21m 2434Mi
dev-vid-5b7558dcdc-rx2d7 16m 1786Mi
dev-sdnc-2 64m 1664Mi
dev-sdnc-0 131m 1631Mi
dev-sdc-be-77d55774f5-zkfrt 9m 1578Mi
dev-sdnc-1 29m 1566Mi
dev-awx-0 291m 1524Mi
dev-clamp-dash-es-6cb85979b5-cvrcs 37m 1496Mi
dep-dcae-tca-analytics-55dbd5cd9d-fsm89 664m 1318Mi
dev-dcae-cloudify-manager-6f854859f9-ctdcv 76m 1302Mi
dev-aaf-cass-7d55bfc874-sqcdq 5m 1250Mi
dev-cds-blueprints-processor-7fd988d584-mvdkz 40m 1228Mi
dev-message-router-zookeeper-1 5m 1127Mi
dev-message-router-zookeeper-0 6m 1023Mi
dev-so-bpmn-infra-7b58b75b76-n59sf 8m 941Mi




#38

ENV

vDNS - Timeout waiting for model to be visible via Deploy button in VID

vVG and vFW Succeeded

Transient Slowness since the 2nd and 3rd VNF succeeded.

Keyword 'Wait For Model' failed after retrying for 3 minutes. The last error was: TypeError: object of type 'NoneType' has no len()
#47

TOOLING

vDNS - Seleinum error seeing the Subscriber Name

vVG and vFW worked.

Transient

vid_interface . Click On Element When Visible //select[@prompt='Select Subscriber Name']

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document


Fri May 22 03:41:11 UTC 2020

root@long-nfs:/home/ubuntu# kubectl -n onap top pod | sort -nr -k 3 | head -20
dev-appc-0 7m 2839Mi
dev-portal-cassandra-59f5cb4cf5-9phmg 127m 2781Mi
dev-appc-2 11m 2702Mi
dev-appc-1 39m 2576Mi
dev-cassandra-2 62m 2517Mi
dev-cassandra-1 69m 2502Mi
dev-cassandra-0 64m 2433Mi
dev-vid-5b7558dcdc-rx2d7 10m 2050Mi
dev-policy-59f48bd84b-6h4xt 23m 1892Mi
dev-sdnc-0 154m 1622Mi
dev-sdnc-2 89m 1586Mi
dev-sdnc-1 25m 1566Mi
dev-awx-0 351m 1525Mi
dev-clamp-dash-es-6cb85979b5-cvrcs 52m 1504Mi
dev-pdp-0 4m 1434Mi
dev-aai-elasticsearch-55b56f855c-qbzfl 11m 1428Mi
dep-dcae-tca-analytics-55dbd5cd9d-fsm89 452m 1380Mi
dev-dcae-cloudify-manager-6f854859f9-ctdcv 88m 1345Mi
dev-cds-blueprints-processor-7fd988d584-mvdkz 38m 1286Mi
dev-sdc-be-77d55774f5-zkfrt 7m 1253Mi



#53

ENV

vDNS instantiate failed at openstack stage. Potentially slowed openstack caused SO to resubmit a request that subsequently became a duplicate from openstack perspective.

Looks like functional bug with SO to Openstack issue triggered by the environment not stability related.

vVG and vFW Succeeded in same test.

STATUS: Received vfModuleException from VnfAdapter: category='INTERNAL' message='Exception during create VF org.onap.so.openstack.utils.StackCreationException: Stack Creation Failed Openstack Status:

CREATE_FAILED Status Reason: Resource CREATE failed: Conflict: resources.vlb_0_onap_private_port_0: IP address 10.0.250.24 already allocated in subnet be057760-1ffa-4827-a6df-75d355c4d45a\nNeutron server returns request_ids


Fri May 22 09:35:28 UTC 2020

root@long-nfs:/home/ubuntu# kubectl -n onap top pod | sort -nr -k 3 | head -20
dev-appc-0 6m 2837Mi
dev-portal-cassandra-59f5cb4cf5-9phmg 125m 2792Mi
dev-appc-2 10m 2704Mi
dev-appc-1 26m 2568Mi
dev-cassandra-1 71m 2501Mi
dev-cassandra-2 62m 2499Mi
dev-cassandra-0 54m 2448Mi
dev-vid-5b7558dcdc-rx2d7 9m 2074Mi
dev-policy-59f48bd84b-6h4xt 19m 1880Mi
dev-sdnc-0 108m 1620Mi
dev-sdnc-2 71m 1586Mi
dev-sdnc-1 29m 1568Mi
dev-awx-0 239m 1523Mi
dev-clamp-dash-es-6cb85979b5-cvrcs 44m 1513Mi
dev-sdc-be-77d55774f5-zkfrt 39m 1439Mi
dev-pdp-0 3m 1436Mi
dev-aai-elasticsearch-55b56f855c-qbzfl 6m 1423Mi
dep-dcae-tca-analytics-55dbd5cd9d-fsm89 425m 1391Mi
dev-cds-blueprints-processor-7fd988d584-mvdkz 27m 1375Mi
dev-dcae-cloudify-manager-6f854859f9-ctdcv 86m 1311Mi


root@long-nfs:/home/ubuntu# kubectl -n onap top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
long-k8s-01 699m 8% 15077Mi 94%
long-k8s-02 1688m 21% 13367Mi 83%
long-k8s-03 166m 2% 6085Mi 38%
long-k8s-04 919m 11% 14554Mi 91%
long-k8s-05 636m 7% 12823Mi 80%
long-k8s-06 905m 11% 14291Mi 89%
long-k8s-07 480m 6% 8883Mi 55%
long-k8s-08 842m 10% 13220Mi 82%
long-k8s-09 1692m 21% 5594Mi 35%
long-orch-1 228m 11% 1454Mi 37%
long-orch-2 212m 10% 1350Mi 35%
long-orch-3 129m 6% 1260Mi 32%

#58

ENV

ODL cluster communication error on vFW preload. This type of error usually is associated with network latency issues between nodes. Akka configuration should be evaluated to loosen up the timeout settings for public cloud or other slow environments. Discuss with Dan

O Get Request using : alias=sdnc, uri=/restconf/config/VNF-API:preload-vnfs/vnf-preload-list/Vfmodule_Ete_vFWCLvFWSNK_e401f06d_0/VfwclVfwsnkA143de8bE20f..base_vfw..module-0, headers={'X-FromAppId': 'robot-ete', 'X-TransactionId': '922f999d-2444-4bcd-b5ad-60fbf553735d', 'Content-Type': 'application/json', 'Accept': 'application/json'} json=None

04:36:17.031 INFO Received response from [sdnc]: {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Error executeRead ReadData for path /(org:onap:sdnctl:vnf?revision=2015-07-20)preload-vnfs/vnf-preload-list/vnf-preload-list[{(org:onap:sdnctl:vnf?revision=2015-07-20)vnf-type=VfwclVfwsnkA143de8bE20f..base_vfw..module-0, (org:onap:sdnctl:vnf?revision=2015-07-20)vnf-name=Vfmodule_Ete_vFWCLvFWSNK_e401f06d_0}]","error-info":"Shard member-2-shard-default-config currently has no leader. Try again later."}]}}

https://{{sdnc_ssl_port}}/jolokia/read/org.opendaylight.controller:type=DistributedOperationalDatastore,Category=ShardManager,name=shard-manager-operational


cluster health
{
    "request": {
        "mbean": "org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore",
        "type": "read"
    },
    "value": {
        "LocalShards": [
            "member-3-shard-default-operational",
            "member-3-shard-prefix-configuration-shard-operational",
            "member-3-shard-topology-operational",
            "member-3-shard-entity-ownership-operational",
            "member-3-shard-inventory-operational",
            "member-3-shard-toaster-operational"
        ],
        "SyncStatus": true,
        "MemberName": "member-3"
    },
    "timestamp": 1590141147,
    "status": 200
}




Interim Status on VNF Orchestration


Notice the improved test duration after the K8 node automated reconfiguration to move loads off k8s-04.

We will run final numbers at the end of the test but most of the problems appear to be environment and tooling issues.



Closed Loop Tests

This test uses the onap-ci job "Project windriver-longevity-vfwclosedloop".

The test uses the robot test script "demo-k8s.sh vfwclosedloop ". The script sets the number of streams on the vPacket Generator to 10 , waits for the change from 10 set sreams to 5 streams by the control loop then sets the stream to 1 and again waits for the 5 streams.

Success tests the loop from VNF through  DCAE, DMaaP, Policy, AAI , AAF and APPC.

In the jenkins job:

Modify the NFS_IP and PKG_IP   in the jenkins job to point to the current nfs server and packet generator  in the tenant

NFS_IP=10.12.5.205

PKG_IP=10.12.5.247


Initially the policy in TCA Key Value store was not in synch with Policy due to the instantiation of the Demo VNF issue.

Since consul-server-ui is not enabled by default , we had to edit the service to expose the consul-server-ui as a NodePort and then go to the ui page to edit the ControlLoop vFW policy to use the same model-invariant-id that was used with the instantiate so A&AI query would succeed.

http://10.12.5.185:32512/ui/#/dc1/kv/dcae-tca-analytics/edit  (node the nodeport was epheremal)

closedLoopControlName was edited in two places (for Hi and Low) to specify "ControlLoop-vFirewall-cdf42e53-b49b-4d9f-a621-fa9521111615". "cdf42e53-b49b-4d9f-a621-fa9521111615" was the new , matching model-invariant-id.


The tests start with #1

http://10.12.6.182:8080/jenkins/job/windriver-longevity-vfwclosedloop/

Test #CommentMessage
0 - 20No errors
21-40No errors


Interim Status on closed loop testing ~30% through stability run



  • No labels