Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

References


JiraNotesDecisionStatus
1

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2146



Managed by Daniel Hanrahan See short term solution below

2

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2156

Very likely related to #1Won't investigate separately, apply short term solutions mentioned below to #1 and test again Blocked by #1Separate fix but probably will contribute to #1 too.
CPS Team can close this once deployment documentation has been updated to reflect this
3

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2122

Very likely related to #1Won't investigate separately, apply short term solutions mentioned below to #1 and test again Blocked by #1Not reproducible. Doesn't seem to be a NCMP Server issue, posisble just a once-off general r(networking) resource issue. Will be closed
4

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2150

Indirectly related to #1Ticket relates to an incorrect timeout limit, not the timeout itself. A separate solution was already proposed and being tested. 

Managed by Priyank Maheshwari solution currently being tested

5

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2139

Not related
Investigated by Levente Csanyi 

...

Issues & Decisions

Long term solutions

#IssueNotes Decision
1Remove Hazelcast from NCMP Module SyncImplementation proposal TBACPS-2161: Remove Hazelcast from NCMP Module Sync
2Java Streams API for CPS and NCMPCPS-2146 Using Java Streams to reduce memory consumption in CPS and NCMP
3Investigate if ODL Yang Parser has a memory leak (if so, likely only a
minor issue as CPS has its own cache wrapping the ODL Yang Parser)

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2000


4Remove Hazelcast for Trust Level

5Remove use of Postgres arrays in Respository methodsCPS-1574: Remove 32K limit from DB operations (See Proposal 1)
6Replace Hibernate with JDBC (via Spring Data JDBC)

7Review memory use during UPDATE operationsStudy TBA
8Investigate memory usage of Yang Resource repositoryStudy TBA

...

Decisions for Short Term

#4 application.yml / Environment variableCsaba Kocsis ETH will test and report back to CPS

IssueNotes Decision
1

Increase memory resources of NCMP (helm chart)

Memory resources of CPS/NCMP pod should be increased to 4GB, 5GB, etc. to determine if the OOME for CPS-2146 is fixed.

Csaba Kocsis ETH will test and report back to CPS

2

Increase shared_buffer allocation in Postgres config

Csaba Kocsis ETH will test and report back to CPS

There is preliminary confirmation that this alone fixes CPS-2156, more testing is in progress. This was confirmed to fix CPS-2156.

ETH has already implemented fix.
EST has updated CPS documentation: https://gerrit.onap.org/r/c/cps/+/137534

3

NCMP will implement throttling / rate limiting for Rest API (e.g. 503 HTTP response)

Requires determining maximum request rate, e.g. compare previous successful versus failing tests (e.g. 3.4.2 vs 3.4.6) to determine throttling.

A poc of rate limiting has been created: 20747:
WIP Rate limit NCMP Rest requests | https://gerrit.nordix.org/c/onap/cps/+/20747

  1. Daniel Hanrahan will report statistics of previous passing test (# request per seconds for 20K registration)
    and compare with request rate after performance improvements and failing user cases
  2. Daniel Hanrahan  will investigate how server can report/reject when requests load is too high
  3. Need to agree with stakeholder acceptable limits of request load (per interface?)
4

Rest client (for load tests) will throttle


Depend on outcome of #3 above
5

Lower thread count for Module Sync

This can be done using variable NCMP_MODULES_SYNC_WATCHDOG_ASYNC_EXECUTOR_PARALLELISM_LEVEL (default 10)

Csaba Kocsis ETH will test and report back to CPS

6

Review Hazelcast configuration

Hazelcast is configured to have multiple backups which are not needed in a deployment with only 2 NCMP instances (2 instances requires only 2 copies across the cluster). Testing has shown that having appropriate amount of backups to suit cluster size reduces heap usage by around 100MB during  20K CM handle registration.

Daniel Hanrahanhas provided a patch to reduce memory consumption: https://gerrit.onap.org/r/c/cps/+/137517

Background

CPS and NCMP have much higher memory consumption than required. Regarding NCMP specifically, it has some in-memory data structures that grow linearly with the number of CM-handles.

...

One avenue worth further investigation is a series of recent performance improvements to CPS and NCMP introduced around 3.4.2:

VersionJiraCommentExample performance test
3.4.2

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-1795

Improved time performance of CPS store operations (2x or more).org.onap.cps.integration.performance.cps.WritePerfTest#Writing openroadm data has linear time.
3.4.3

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2018

Improved time performance of CPS update operations (2x in some cases, stacks with CPS-1795)2x in some cases, stacks with CPS-1795).org.onap.cps.integration.performance.cps.UpdatePerfTest#Replace single data node and descendants: #scenario.
3.4.3

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2019

Improved time performance of saving CM handles (over 4x faster, stacks with CPS-1795)-1795).See https://gerrit.onap.org/r/c/cps/+/136932
The code was changed to remove the slower API, and production code uses the 4x faster API.
3.4.3

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2087

Improved time performance of CPS queries (5-10x).org.onap.cps.integration.performance.ncmp.CmHandleQueryPerfTest#CM-handle is looked up by alternate-id.
3.4.6

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2126

Removed Spring Security, which greatly reduced overhead on Rest requests (over 10x).K6 test will be added as part of CPS-1975

Cumulatively, both read and write speeds are up to 10x faster than previous versions, and overhead on Rest requests is over 10x lower. It is very possible that these improvements are adversely affecting memory usage during load tests.

...

ComponentHazelcast StructureTypePurposeRecommendationImplementation ProposalNotes
CPSanchorDataCache

Map<String, AnchorDataCacheEntry>


Needs further analysis

NCMPmoduleSyncWorkQueue

BlockingQueue<DataNode>


Remove
TBACPS-2161: Remove Hazelcast from NCMP Module Sync

Entire CM handles are stored in work queue for module sync. This creates very high memory usage during CM handle registration. The use of this blocking queue likely causes issues with load balancing during module sync also.

A PoC was constructed: WIP Remove hazelcast map for module sync | https://gerrit.nordix.org/c/onap/cps/+/20724

NCMPmoduleSyncStartedOnCmHandles

Map<String, Object>


Remove
TBACPS-2161: Remove Hazelcast from NCMP Module SyncOne entry is stored in memory per CM handle in ADVISED state.
NCMPdataSyncSemaphores

Map<String, Boolean>


No immediate action, see notes
Low priority - this map is only populated if data sync is enabled for a CM handle. If the feature is used, it will store one entry per CM handle with data sync enabled.
NCMPtrustLevelPerCmHandle

Map<String, TrustLevel>


Remove
TBAOne entry is stored in memory per CM handle. This is directly implicated in logs supplied in investigation of out-of-memory errors in CPS-2146
NCMPtrustLevelPerDmiPlugin

Map<String, TrustLevel>


Low risk, see notes
Low priority - there are only small number of DMIs, so this structure will not grow so large. However, if trustLevelPerCmHandle is being removed, this structure may be removed as part of the same solution.
NCMPcmNotificationSubscriptionCache

Map<String, Map<String, DmiCmNotificationSubscriptionDetails>>


Will need further analysis in future; see notes
This is low priority, as the CM subscription feature is not fully implemented, thus is not in use. It is unclear how much data will be stored in the structure. It is presumed to be low, as this structure will only hold pending subscriptions.

...