Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

References

<insert Jira Ref, Using confluence menu options +, Jira Issue/Filter>
<optional other relevant references, Jiras, Confluence pages, external links>

Assumptions<optional>

<optional, assumptions are like decision made up front ie. everyone agrees on the answer but they are important to mention>

CPS-2146: Analysis of Out of Memory and related Errors in NCMP

Jira
serverONAP Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId425b2b0a-557c-3c0c-b515-579789cceedb
keyCPS-2146

...

Issues & Decisions

This is a very important (blocking issue)
#IssueNotes Decision
1This is an open issue
2Do we need a analysis template?is convention for (new) developers to guide them

Luke Gleeson and Toine Siebelink  agreed we do to have consistent study pages 

3Placeholder for issue

<Note. use green for closed issues, yellow for important ones if needed>

Any Other Header

< we do not want to dictate the remainder of an analysis it will depend on the type of user story at hand>

Any Other Header

Background

The use of Hazelcast during NCMP's CM-handle Module Sync is leading to:

  1. High memory usage during CM-handle registration
  2. Consistency problems
  3. Poor load balancing between NCMP instances for module sync

Summary of Hazelcast structures for Module/Data Sync

StructureTypeNotes
moduleSyncWorkQueueBlockingQueue<DataNode>Entire CM handles are stored in work queue for module sync. This creates very high memory usage during CM handle registration. The use of this blocking queue likely causes issues with load balancing during module sync also.
moduleSyncStartedOnCmHandlesMap<String, Object>One entry is stored in memory per CM handle in ADVISED state.
dataSyncSemaphoresMap<String, Boolean>Note this map is only populated if data sync is enabled for a CM handle. If the feature is used, it will store one entry per CM handle with data sync enabled.

Consistency problems

Consistency problems are evidenced by log entries showing duplicate CM-handles being created:

STATEMENT:  insert into fragment (anchor_id,attributes,parent_id,xpath) values ($1,$2,$3,$4) RETURNING *
DETAIL:  Key (anchor_id, xpath)=(2, /dmi-registry/cm-handles[@id='C9B31349E93B850D52EFD2F632BAE598']) already exists.
ERROR:  duplicate key value violates unique constraint "fragment_anchor_id_xpath_key"

Additionally, in CPS-2146 it was reported that:

moduleSync was quite chaotic between the two NCMP pods, both of them logged that the other one is working on the given cmHandle which reached the READY state minutes ago.

The consistency issues are likely a result of Hazelcast requiring an odd number of cluster members to resolve consistency issues via quorum.

Proposed Changes

It is proposed the LCM (Lifecycle Management) State Machine be changed to include an explicit state for syncing modules (or data).

The previous LCM State Machine is outlined here:

draw.io Diagram
bordertrue
diagramNameExisting LCM State Machine
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth841
revision4

The proposed LCM State Machine is:

draw.io Diagram
bordertrue
diagramNameProposed LCM State Machine
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth1091
revision2

Aside: For Module Upgrade, the state transition from READY to LOCKED to ADVISED could be simplified to READY to ADVISED.

Proof of Concept

A PoC was constructed: WIP Remove hazelcast map for module sync | https://gerrit.nordix.org/c/onap/cps/+/20724

...