CPS-2146: Analysis of Out of Memory and related Errors in NCMP

References

	Jira	Notes	Decision	Status
1	CPS-2146 - Getting issue details... STATUS			Managed by Daniel Hanrahan See short term solution below
2	CPS-2156 - Getting issue details... STATUS	Very likely related to #1	Won't investigate separately, apply short term solutions mentioned below to #1 and test again	Separate fix but probably will contribute to #1 too. CPS Team can close this once deployment documentation has been updated to reflect this
3	CPS-2122 - Getting issue details... STATUS	~~Very likely related to #1~~	Won't investigate separately, apply short term solutions mentioned below to #1 and test again	Not reproducible. Doesn't seem to be a NCMP Server issue, posisble just a once-off general r(networking) resource issue. Will be closed
4	CPS-2150 - Getting issue details... STATUS	Indirectly related to #1	Ticket relates to an incorrect timeout limit, not the timeout itself.	Managed by Priyank Maheshwari solution currently being tested
5	~~CPS-2139 - Getting issue details... STATUS~~	Not related		Investigated by Levente Csanyi

Assumptions

#	Assumption	Notes
1	Proposed solution is not a quick fix, but allows for future scaling of NCMP.

Issues & Decisions

Long term solutions

#	Issue	Notes
1	Remove Hazelcast from NCMP Module Sync	CPS-2161: Remove Hazelcast from NCMP Module Sync
2	Java Streams API for CPS and NCMP	CPS-2146 Using Java Streams to reduce memory consumption in CPS and NCMP
3	Investigate if ODL Yang Parser has a memory leak (if so, likely only a minor issue as CPS has its own cache wrapping the ODL Yang Parser)	CPS-2000 - Getting issue details... STATUS
4	Remove Hazelcast for Trust Level
5	Remove use of Postgres arrays in Respository methods	CPS-1574: Remove 32K limit from DB operations (See Proposal 1)
6	Replace Hibernate with JDBC (via Spring Data JDBC)
7	Review memory use during UPDATE operations	Study TBA
8	Investigate memory usage of Yang Resource repository	Study TBA

<Note. use green for closed issues, yellow for important ones if needed>

Decisions for Short Term

	Issue	Notes	Decision
1	Increase memory resources of NCMP (helm chart)	Memory resources of CPS/NCMP pod should be increased to 4GB, 5GB, etc. to determine if the OOME for CPS-2146 is fixed.	Csaba Kocsis ETH will test and report back to CPS
2	~~Increase shared_buffer allocation in Postgres config~~	~~There is preliminary confirmation that this alone fixes CPS-2156, more testing is in progress.~~ This was confirmed to fix CPS-2156.	ETH has already implemented fix. EST has updated CPS documentation: https://gerrit.onap.org/r/c/cps/+/137534
3	NCMP will implement throttling / rate limiting for Rest API (e.g. 503 HTTP response)	Requires determining maximum request rate, e.g. compare previous successful versus failing tests (e.g. 3.4.2 vs 3.4.6) to determine throttling. A poc of rate limiting has been created: 20747: WIP Rate limit NCMP Rest requests \| https://gerrit.nordix.org/c/onap/cps/+/20747	Daniel Hanrahan will report statistics of previous passing test (# request per seconds for 20K registration) and compare with request rate after performance improvements and failing user cases Daniel Hanrahan will investigate how server can report/reject when requests load is too high Need to agree with stakeholder acceptable limits of request load (per interface?)
4	Rest client (for load tests) will throttle		Depend on outcome of #3 above
5	Lower thread count for Module Sync	This can be done using variable NCMP_MODULES_SYNC_WATCHDOG_ASYNC_EXECUTOR_PARALLELISM_LEVEL (default 10)	Csaba Kocsis ETH will test and report back to CPS
6	~~Review Hazelcast configuration~~	Hazelcast is configured to have multiple backups which are not needed in a deployment with only 2 NCMP instances (2 instances requires only 2 copies across the cluster). Testing has shown that having appropriate amount of backups to suit cluster size reduces heap usage by around 100MB during 20K CM handle registration.	Daniel Hanrahanhas provided a patch to reduce memory consumption: https://gerrit.onap.org/r/c/cps/+/137517

Background

CPS and NCMP have much higher memory consumption than required. Regarding NCMP specifically, it has some in-memory data structures that grow linearly with the number of CM-handles.

Regarding CPS-core, there is a more fundamental problem in that CPS path queries could return any amount of data - it will be unknown to the application until a query is executed. Some solutions will be proposed for CPS path queries to reduce memory use.

This study and implementation proposals will target concrete steps to reduce memory consumption.

Analysis

A number of issues leading to high memory usage have been identified.

NCMP CM Handle Queries

NCMP CM Handle Queries are directly implicated in CPS-2146, as the Out Of Memory errors occurs during NCMP Search and ID Search functions.

See CPS-2146 Using Java Streams to reduce memory consumption in CPS and NCMP for analysis & solution to reduce memory consumption during these operations.

Recent performance improvements

One avenue worth further investigation is a series of recent performance improvements to CPS and NCMP introduced around 3.4.2:

Version	Jira	Comment	Example performance test
3.4.2	CPS-1795 - Getting issue details... STATUS	Improved time performance of CPS store operations (2x or more).	org.onap.cps.integration.performance.cps.WritePerfTest#Writing openroadm data has linear time.
3.4.3	CPS-2018 - Getting issue details... STATUS	Improved time performance of CPS update operations (2x in some cases, stacks with CPS-1795).	org.onap.cps.integration.performance.cps.UpdatePerfTest#Replace single data node and descendants: #scenario.
3.4.3	CPS-2019 - Getting issue details... STATUS	Improved time performance of saving CM handles (over 4x faster, stacks with CPS-1795).	See https://gerrit.onap.org/r/c/cps/+/136932 The code was changed to remove the slower API, and production code uses the 4x faster API.
3.4.3	CPS-2087 - Getting issue details... STATUS	Improved time performance of CPS queries (5-10x).	org.onap.cps.integration.performance.ncmp.CmHandleQueryPerfTest#CM-handle is looked up by alternate-id.
3.4.6	CPS-2126 - Getting issue details... STATUS	Removed Spring Security, which greatly reduced overhead on Rest requests (over 10x).	K6 test will be added as part of CPS-1975

Cumulatively, both read and write speeds are up to 10x faster than previous versions, and overhead on Rest requests is over 10x lower. It is very possible that these improvements are adversely affecting memory usage during load tests.

Postgresql configuration does not have appropriate values for memory allocation

CPS-2156 - Getting issue details... STATUS

As I checked the root cause of the PSQL error is most probably the wrong configuration of shared_buffers (https://www.postgresql.org/docs/13/runtime-config-resource.html#GUC-SHARED-BUFFERS ) according to the documentation the optimal value for this configuration should be between 25% and 40% of the memory usage of the DB (request is 1GB, limit is 3GB), and the default value 128 MB.

Due to the system load increasing from recent performance improvements, the database is also under increased load. The DB configuration will need to updated to reflect this (note: the current and previous DB configuration was incorrect).

Hazelcast

The use of Hazelcast (an In-Memory Data Grid) has been identified as a particular source of high memory usage. Some points of interest:

In NCMP, Hazelcast is not used as a cache, so idle eviction is not used, and the structures are configured to have 3 backups. It follows that scaling up the deployment (e.g. Kubernetes auto-scaling) would not help in a low-memory situation, as the new instances would have also be storing the whole structure.
Given Hazelcast is configured for synchronous operation, it is likely to have worse performance than a database solution.
There are additional reasons to avoid Hazelcast, since as a distributed asynchronous system, it cannot give strong consistency guarantees like an ACID database - it is prone to split brain among other issues.
I advise against the use of Hazelcast for future development in NCMP - CPS API should be used.

The following is an overview of Hazelcast structures in CPS and NCMP, along with recommendations.

Component	Hazelcast Structure	Type	Recommendation	Implementation Proposal	Notes
CPS	anchorDataCache	Map<String, AnchorDataCacheEntry>	Needs further analysis
NCMP	moduleSyncWorkQueue	BlockingQueue<DataNode>	Remove	CPS-2161: Remove Hazelcast from NCMP Module Sync	Entire CM handles are stored in work queue for module sync. This creates very high memory usage during CM handle registration. The use of this blocking queue likely causes issues with load balancing during module sync also. A PoC was constructed: WIP Remove hazelcast map for module sync \| https://gerrit.nordix.org/c/onap/cps/+/20724
NCMP	moduleSyncStartedOnCmHandles	Map<String, Object>	Remove	CPS-2161: Remove Hazelcast from NCMP Module Sync	One entry is stored in memory per CM handle in ADVISED state.
NCMP	dataSyncSemaphores	Map<String, Boolean>	No immediate action, see notes		Low priority - this map is only populated if data sync is enabled for a CM handle. If the feature is used, it will store one entry per CM handle with data sync enabled.
NCMP	trustLevelPerCmHandle	Map<String, TrustLevel>	Remove	TBA	One entry is stored in memory per CM handle. This is directly implicated in logs supplied in investigation of out-of-memory errors in CPS-2146
NCMP	trustLevelPerDmiPlugin	Map<String, TrustLevel>	Low risk, see notes		Low priority - there are only small number of DMIs, so this structure will not grow so large. However, if trustLevelPerCmHandle is being removed, this structure may be removed as part of the same solution.
NCMP	cmNotificationSubscriptionCache	Map<String, Map<String, DmiCmNotificationSubscriptionDetails>>	Will need further analysis in future; see notes		This is low priority, as the CM subscription feature is not fully implemented, thus is not in use. It is unclear how much data will be stored in the structure. It is presumed to be low, as this structure will only hold pending subscriptions.

Use of Postgres Arrays in Repository methods

Use of Postgres arrays in JpaRepository methods may be using too much memory. Though it is currently unclear how much of a contributor this is to Out Of Memory errors, it appears in the logs from CPS-2146.

See CPS-1574: Remove 32K limit from DB operations for history of this implementation choice - an alternate solution using batching was proposed.

For example, from the logs of CPS-2146, see this stack trace:

2024-02-28T05:18:25.049Z@eric-oss-ncmp-04@ncmp@Connection leak detection triggered for org.postgresql.jdbc.PgConnection@b358fc9 on thread qtp1699794502-7604, stack trace follows, logger: com.zaxxer.hikari.pool.ProxyLeakTask, thread_name: CpsDatabasePool housekeeper, stack_trace: java.lang.Exception: Apparent connection leak detected
 org.onap.cps.spi.repository.YangResourceRepository.findAllModuleReferencesByDataspaceAndModuleNames(YangResourceRepository.java:111)
 org.onap.cps.spi.impl.CpsAdminPersistenceServiceImpl.validateDataspaceAndModuleNames(CpsAdminPersistenceServiceImpl.java:206)
 org.onap.cps.spi.impl.CpsAdminPersistenceServiceImpl.queryAnchors(CpsAdminPersistenceServiceImpl.java:143)
 org.onap.cps.api.impl.CpsAnchorServiceImpl.queryAnchorNames(CpsAnchorServiceImpl.java:90)
 org.onap.cps.ncmp.api.impl.inventory.InventoryPersistenceImpl.getCmHandleIdsWithGivenModules(InventoryPersistenceImpl.java:174)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyCmHandleQueryServiceImpl.executeModuleNameQuery(NetworkCmProxyCmHandleQueryServiceImpl.java:167)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyCmHandleQueryServiceImpl.executeQueries(NetworkCmProxyCmHandleQueryServiceImpl.java:256)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyCmHandleQueryServiceImpl.queryCmHandleIds(NetworkCmProxyCmHandleQueryServiceImpl.java:71)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyCmHandleQueryServiceImpl.queryCmHandles(NetworkCmProxyCmHandleQueryServiceImpl.java:95)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyDataServiceImpl.executeCmHandleSearch(NetworkCmProxyDataServiceImpl.java:215)
 org.onap.cps.ncmp.rest.controller.NetworkCmProxyController.searchCmHandles(NetworkCmProxyController.java:253)

Note: the message about Connection leak in postgres may also indicate a memory issue in the DB.

The code causing the exception in YangResourceRepository is:

    default Set<YangResourceModuleReference> findAllModuleReferencesByDataspaceAndModuleNames(
        final String dataspaceName, final Collection<String> moduleNames) {
        return findAllModuleReferencesByDataspaceAndModuleNames(dataspaceName, moduleNames.toArray(new String[0]));
    }

OpenDaylight Yang Parser & YangTextSchemaSourceSetCache

It was previously suspected that the 3rd party OpenDaylight Yang Parser may have a memory leak. See comments on CPS-2000:

CPS-2000 - Getting issue details... STATUS

While CPS has a cache called YangTextSchemaSourceSetCache to avoid invoking the OpenDaylight Yang Parser, the Yang Parser's internal cache may be causing a memory leak. This requires immediate investigation.

Hibernate Entity Cache

Hibernate has an Entity Cache, which can grow large during transactions. While most of CPS-core's Spring JpaRepository methods are using Native SQL, the Entity Manager is still caching in some cases. I propose the removal of Hibernate be investigated as part of a long term solution. (This is not as much work as it sounds: CPS is not directly reliant on Hibernate/JPA - rather Spring Data JPA is used. This could be replaced with Spring Data JDBC with relatively small code changes.)

Note this change is blocked by CPS-1673. The use of OneToMany mapping in FragmentEntity appears to be only place where CPS is currently reliant on functionality provided by JPA.
CPS-1673 - Getting issue details... STATUS

Memory usage during UPDATE operations

CPS current implementation of update operations involves reading the existing data from DB, applying changes, and storing again. As update operations involve reading data, the current implementations of update operations should be reviewed to ensure they are internally breaking large requests into batches to restrict memory use.

Memory usage of Yang Repository SQL

The SQL statement to get Yang Resources from Module References (org.onap.cps.spi.repository.YangResourceNativeRepositoryImpl#getResourceIdsByModuleReferences) is dynamically generated by UNIONing many SELECT statements together into one large statement, e.g.

SELECT id FROM yang_resource WHERE module_name='ietf-netconf' and revision='2011-06-01' UNION ALL 
SELECT id FROM yang_resource WHERE module_name='ietf-inet-types' and revision='2019-11-04' UNION ALL 
SELECT id FROM yang_resource WHERE module_name='ietf-netconf-acm' and revision='2018-02-14' UNION ALL 
SELECT id FROM yang_resource WHERE module_name='ietf-yang-types' and revision='2019-11-04' UNION ALL 
...

In the logs during an Out Of Memory situation, it was observed that up to 200 such statements will be chained into a single statement, with such statements being executed per CM-handle (e.g. 20,000 times) during Module Sync.

Space shortcuts

Page tree