CPS-2146: (WIP) Overview of Out of Memory Errors in NCMP

References

CPS-2146 - Getting issue details... STATUS

Assumptions

#	Assumption	Notes
1	Proposed solution is not a quick fix, but allows for future scaling of NCMP.

Issues & Decisions

#	Issue	Notes	Decision
1	This is an open issue
2	Do we need a analysis template?	is convention for (new) developers to guide them	01 Aug 2022 Luke Gleeson and Toine Siebelink agreed we do to have consistent study pages
3	This is a very important (blocking issue)

<Note. use green for closed issues, yellow for important ones if needed>

Background

CPS and NCMP have much higher memory consumption than required. Regarding NCMP specifically, it has some in-memory data structures that grow linearly with the number of CM-handles.

Regarding CPS-core, there is a more fundamental problem in that CPS path queries could return any amount of data - it will be unknown to the application until a query is executed (even if pagination on DataNodes were used, since CPS DataNodes represent tree-structures, it is not known how deep the tree is). Further analysis is needed here, however simple mitigations will be recommended (e.g. extending public APIs to allow limiting maximum results).

This study and implementation proposal will target ~~simple and~~ concrete steps to reduce memory consumption.

Analysis

A number of issues leading to high memory usage have been identified.

NCMP CM Handle Queries

NCMP CM Handle Queries are directly implicated in CPS-2146, as the Out Of Memory errors occurs during NCMP Search and ID Search functions.

Use of Postgres Arrays in Repository methods

Use of Postgres arrays in JpaRepository methods may be using too much memory.

For example, see this partial stack trace:

2024-02-28T05:18:25.049Z@eric-oss-ncmp-04@ncmp@Connection leak detection triggered for org.postgresql.jdbc.PgConnection@b358fc9 on thread qtp1699794502-7604, stack trace follows, logger: com.zaxxer.hikari.pool.ProxyLeakTask, thread_name: CpsDatabasePool housekeeper, stack_trace: java.lang.Exception: Apparent connection leak detected
 org.onap.cps.spi.repository.YangResourceRepository.findAllModuleReferencesByDataspaceAndModuleNames(YangResourceRepository.java:111)
 org.onap.cps.spi.impl.CpsAdminPersistenceServiceImpl.validateDataspaceAndModuleNames(CpsAdminPersistenceServiceImpl.java:206)
 org.onap.cps.spi.impl.CpsAdminPersistenceServiceImpl.queryAnchors(CpsAdminPersistenceServiceImpl.java:143)
 org.onap.cps.api.impl.CpsAnchorServiceImpl.queryAnchorNames(CpsAnchorServiceImpl.java:90)
 org.onap.cps.ncmp.api.impl.inventory.InventoryPersistenceImpl.getCmHandleIdsWithGivenModules(InventoryPersistenceImpl.java:174)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyCmHandleQueryServiceImpl.executeModuleNameQuery(NetworkCmProxyCmHandleQueryServiceImpl.java:167)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyCmHandleQueryServiceImpl.executeQueries(NetworkCmProxyCmHandleQueryServiceImpl.java:256)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyCmHandleQueryServiceImpl.queryCmHandleIds(NetworkCmProxyCmHandleQueryServiceImpl.java:71)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyCmHandleQueryServiceImpl.queryCmHandles(NetworkCmProxyCmHandleQueryServiceImpl.java:95)
 org.onap.cps.ncmp.api.impl.NetworkCmProxyDataServiceImpl.executeCmHandleSearch(NetworkCmProxyDataServiceImpl.java:215)
 org.onap.cps.ncmp.rest.controller.NetworkCmProxyController.searchCmHandles(NetworkCmProxyController.java:253)

The code causing the exception in YangResourceRepository is:

    default Set<YangResourceModuleReference> findAllModuleReferencesByDataspaceAndModuleNames(
        final String dataspaceName, final Collection<String> moduleNames) {
        return findAllModuleReferencesByDataspaceAndModuleNames(dataspaceName, moduleNames.toArray(new String[0]));
    }

Hazelcast

The use of Hazelcast (an In-Memory Data Grid) has been identified as a particular source of high memory usage. Some points of interest:

In NCMP, Hazelcast is not used as a cache, so idle eviction is not used, and the structures are configured to have 3 backups. It follows that scaling up the deployment (e.g. Kubernetes auto-scaling) would not help in a low-memory situation, as the new instances would have also be storing the whole structure.
Given Hazelcast is configured for synchronous operation, it is likely to have worse performance than a database solution.
There are additional reasons to avoid Hazelcast, since as a distributed asynchronous system, it cannot give strong consistency guarantees like an ACID database - it is prone to split brain among other issues.
I strongly advise against the use of Hazelcast for future development.

Side note - this was seen in logs of CPS-2146:

2024-02-28T05:23:53.961Z@eric-oss-ncmp-04@ncmp@[192.168.89.193]:5701 ["cps-and-ncmp-common-cache-cluster"] [5.2.4] A split-brain merge validation request was received, but the current member is not a master. The master address will be sent to the request source ([192.168.124.37]:5705), logger: com.hazelcast.internal.cluster.impl.operations.SplitBrainMergeValidationOp, thread_name: hz.hazelCastInstanceCpsCore.priority-generic-operation.thread-0

The following is an overview of Hazelcast structures in CPS and NCMP, along with recommendations.

Component	Hazelcast Structure	Type	Recommendation	Implementation Proposal	Notes
CPS	anchorDataCache	Map<String, AnchorDataCacheEntry>	Needs further analysis
NCMP	moduleSyncWorkQueue	BlockingQueue<DataNode>	Remove	TBC	Entire CM handles are stored in work queue for module sync. This creates very high memory usage during CM handle registration. The use of this blocking queue likely causes issues with load balancing during module sync also.
NCMP	moduleSyncStartedOnCmHandles	Map<String, Object>	Remove	TBC	One entry is stored in memory per CM handle in ADVISED state.
NCMP	dataSyncSemaphores	Map<String, Boolean>	No immediate action, see notes		Low priority - this map is only populated if data sync is enabled for a CM handle. If the feature is used, it will store one entry per CM handle with data sync enabled.
NCMP	trustLevelPerCmHandle	Map<String, TrustLevel>	Remove	TBC	One entry is stored in memory per CM handle. This is directly implicated in logs supplied in investigation of out-of-memory errors in CPS-2146
NCMP	trustLevelPerDmiPlugin	Map<String, TrustLevel>	Low risk, see notes		Low priority - there are only small number of DMIs, so this structure will not grow so large. However, if trustLevelPerCmHandle is being removed, this structure may be removed as part of the same solution.
NCMP	cmNotificationSubscriptionCache	Map<String, Map<String, DmiCmNotificationSubscriptionDetails>>	Will need further analysis in future; see notes		This is low priority, as the CM subscription feature is not yet implemented, thus it is not in use. It is unclear how much data will be stored in the structure. It is presumed to be low, as this structure will only hold pending subscriptions.

ModuleSetTag lookup

As ModuleSetTag is a relatively new feature, and performance testing of that feature is not yet complete (in progress, see CPS-1805), there is some risk of this causing high memory consumption.

Particularly the method ModuleSyncService::getAnyReadyCmHandleByModuleSetTag use a Cps Path Query that will return all CM handles with a given moduleSetTag. An alternate solution is recommended.

Space shortcuts

Page tree