Summary

This is an assessment of current performance of CPS operations, as of August 2023.

Since January 2023, performance of many CPS and NCMP operations have been increased by orders of magnitude - some highlights:

NCMP CM-handle de-registration moved from average quadratic to linear time (constant rate). De-registering 20,000 CM-handles is thousands of times faster now.
CPS Path Queries have moved from worst-case quadratic to linear time complexity. Queries that took days previously take minutes or seconds now.
Memory usage during read, query, and update operations has been decreased by more than 90% (10 times less memory).
Not only has performance increased, but reliability and robustness. It is now possible for a single batch operation to handle tens of thousands of elements, where it would have failed before.

CPS Core Performance

Test Environment

Hardware	Specifications
CPU	13th Gen Intel© Core™ i9-13900H (24 MB cache, 14 cores, up to 5.40 GHz turbo mode)
Memory	32 GB RAM (16 GB x 2, DDR5, 4800 MHz)
Storage	1 TB M.2 PCIe NVMe SSD
Operating System	Linux Mint 21.2 Cinnamon (based on Ubuntu 22.04)

Test setup

The performance tests are written in Groovy (a JVM language). The tests were executed from Intelli-J using the Amazon Corretto 17 JDK.

As all CPS Core operations are synchronous, the results here are to be considered as single-threaded performance only.

Test data

Test data uses the Open ROADM YANG model - a real-world model for optical devices. Specifically, openroadm-device nodes, each consisting of 86 fragments, are created. For example, a test that creates 1,000 device nodes will result in 86,000 fragments in the database. Some tests use up to 3,000 device nodes (258,000 fragments - equivalent to around 20,000 CM-handles in NCMP), with four anchors replicating the data, meaning that the system has been tested up to 1 million fragments.

Storing data nodes

A varying number of Open ROADM device nodes will be stored using CpsDataService::saveData.

Created device nodes	1	100	200	300	400	500	600	700	800	900	1,000
Fragments created	86	8,600	17,200	25,800	34,400	43,000	51,600	60,200	68,800	77,400	86,000
Time (seconds)	0.30	2.36	4.36	7.15	9.76	11.50	14.77	18.43	19.79	22.16	26.54
Fragments per second	287	3,644	3,945	3,608	3,525	3,739	3,494	3,266	3,477	3,493	3,240

Graph of time taken to store device nodes

Observations

Storing data nodes has linear time complexity (as expected).
Raw performance is roughly 3,500 fragments per second for the given test setup.
Performance can be improved by enabling write batching (CPS-1795)
There are edge cases with exponential complexity (adding books to the bookstore model).

Commentary

The current database schema does not allow for Hibernate to use JDBC write batching. There is work in progress to address this.

JPA/Hibernate is not well-suited to tree-structured data. As an example, writing data nodes / fragments happens in two low-level steps:

SQL INSERT to create the fragments
SQL UPDATE to set the parent IDs of those created fragments
meaning each created fragment requires two DB operations

There is work in progress to address this behavior using a custom algorithm using JDBC. The use of a graph database would implicitly fix such issues.

Updating data nodes

(NOTE: I have removed the previous test results here, as the testing was flawed).

Analysis is still ongoing, but here is an overview of some of the findings:

For making CM-handle state updates, NCMP uses CpsDataService::updateDataNodesAndDescendants (plural - taking a Map<xpath, json>). The performance of this function is poor. Because it takes a map of JSON, it needs to instantiate new YangParser for piece of JSON. Better performance would be achieved by using CpsDataService::updateDataNodeAndDescendants (singular - taking a single piece of JSON). Note this API could be removed after, since it is only used by NCMP, and was never exposed as part of the public REST API.
Related: CpsDataService::saveListElementsBatch likewise has very poor performance compared with CpsDataService::saveListsElements, which can achieve the same effect by correctly preparing the input JSON. The poor performance stems from needing a new YangParser instance for each chunk of JSON instead of a single YangParser for one JSON string. This API is not publicly exposed via the REST API and is only used in one place in NCMP, so it could be removed.
The main public API for updating data nodes is CpsDataService::updateDataNodeAndDescendants (singular). This API has very high variations in performance, depending on how the JSON is prepared, for example, given an update that effects exactly the same data nodes:
- updating a top-level node with parent path of '/' is much slower than updating the descendants. See table below.
- As such, it is now advised to update data trees at the deepest possible level.

Comparison of updating using different approaches

There are multiple was of updating data nodes, each with different performance.

In these scenario, 1,000 Open ROADM device nodes are already defined. A number of these existing data nodes will be updated using CpsDataService::updateDataNodeAndDescendants (singular).

Scenario	Time (seconds)	Parent Xpath	Sample Json
Update 1 device node under top-level container	0.6	/openroadm-devices	{ "openroadm-device": [ ... ] }
Update a list of 100 device nodes under top-level container	3	/openroadm-devices	{ "openroadm-device": [ ... ] }
Update top-level container having a list of 100 device nodes	22	/	{ "openroadm-devices": { "openroadm-device": [ ... ] } }

Updating data leaves

In this scenario, 1,000 Open ROADM device nodes are already defined. The data leaves of a number of these existing data nodes will be updated using CpsDataService::updateNodeLeaves.

Example JSON payload for updating data leaves for one device:

{
    'openroadm-device': [
        {'device-id':'C201-7-1A-1', 'status':'fail', 'ne-state':'jeopardy'}
    ]
}

Test Results

Updated device nodes	1	100	200	300	400	500	600	700	800	900	1,000
Fragments updated	1	100	200	300	400	500	600	700	800	900	1,000
Time (seconds)	0.20	0.27	0.28	0.28	0.32	0.38	0.39	0.47	0.49	0.52	0.56
Fragments per second	5	370	714	1,071	1,250	1,316	1,539	1,489	1,633	1,731	1,786

Graph of updating data leaves of device nodes

Observations

Updating data leaves has linear time complexity.
Raw performance is around 1,500 fragments per second.
This is very fast compared to updating whole data nodes, and should be preferred where possible.

Commentary

The performance of this function is excellent, and yet it may be improved by write batching.

I recommend that NCMP use this API for updating CM-handle state.

Deleting data nodes

In this scenario, 300 Open ROADM device nodes are already defined. A number of these data nodes will be deleted using CpsDataService::deleteDataNodes. The types of nodes will be varied, for example, deleting container nodes, list elements, or whole lists.

Test results

N =	50	100	150	200	250	300	Example xpath
Delete top-level container node	-	-	-	-	-	0.63	/openroadm-devices
Batch delete N/300 container nodes	0.15	0.26	0.38	0.45	0.55	0.69	/openroadm-devices/openroadm-device[@device-id='C201-7-1A-10']/org-openroadm-device
Batch delete N/300 lists elements	0.13	0.25	0.34	0.45	0.55	0.67	/openroadm-devices/openroadm-device[@device-id='C201-7-1A-49']
Batch delete N/300 whole lists	0.51	1.05	1.40	1.85	2.13	2.56	/openroadm-devices/openroadm-device[@device-id='C201-7-1A-293']/org-openroadm-device/degree
Try batch delete N/300 non-existing	0.25	0.54	0.67	0.95	1.15	1.32	/path/to/non-existing/node[@id='27']

Observations

Delete performance is linear on the amount of data being deleted (as expected).
Raw performance of deleting container nodes is around 35,000 fragments per second. (So we can delete data nodes more than 10x faster than creating them.)
Deleting lists is much slower than deleting the parent container of the list (this can be improved).
Of note, attempting to delete non-existing data nodes takes longer than actually deleting the equivalent amount of nodes with descendants - it is a slow operation.

Suggested improvement: For whole list deletion, add a condition to the WHERE clause in the SQL for deleting lists, to narrow the search space to children of the parent. For example:

DELETE FROM fragment WHERE (existing conditions)

  AND parent_id = (SELECT id FROM fragment WHERE xpath = '/parent-xpath')

This should narrow the performance gap in this case.

Reading data nodes

In these tests, a varying number of Open ROADM devices are created and retrieved.

Reading top-level container node

In this test, CpsDataService::getDataNodes is used to retrieve the top-level container node.

Test results

Reading the top-level container node with no descendants:

Total device nodes	500	1,000	1,500	2,000	2,500	3,000
Fragments read	1	1	1	1	1	1
Time (milliseconds)	47	52	48	56	48	47

The above data clearly indicates constant time.

Reading the top-level container node with all descendants:

Total device nodes	500	1,000	1,500	2,000	2,500	3,000
Fragments read	43,000	86,000	129,000	172,000	215,000	258,000
Time (seconds)	0.42	1.19	1.54	2.16	2.53	2.67
Fragments per second	102,381	72,269	83,766	79,630	85,657	96,629

Graph of time taken to read top-level container node with all descendants

Observations

Reading a single top-level container node with no descendants has constant time (as expected).
Reading a single top-level container node with all descendants has linear time (as expected).
Raw performance of reading with all descendants is roughly 100,000 fragments per second.

Commentary

This is the fastest operation in CPS, in terms of fragments per second. This is not surprising, as a simple read should be the fastest operation.

Reading data nodes for multiple xpaths

This test uses CpsDataService::getDataNodesForMultipleXpaths with all descendants to retrieve a varying number of Open ROADM device nodes.

Test results

Total device nodes	500	1,000	1,500	2,000	2,500	3,000
Fragments read	43,000	86,000	129,000	172,000	215,000	258,000
Time (seconds)	0.61	1.15	1.52	2.14	2.96	3.97

Special case: attempting to read multiple non-existing data nodes

In this case, we attempt to read many non-existing data nodes:

Total devices nodes	500	1,000	1,500	2,000	2,500	3,000
Fragments read	0	0	0	0	0	0
Time (milliseconds)	10	10	9	9	7	8

The above case appears to be constant time, but theoretically must be linear - we'll call it negligible time.

Observations

Reading many data nodes with all descendants has linear time (as expected).
Attempting to read many non-existing data nodes takes negligible time.
Raw performance of reading many with all descendants is roughly 80,000 fragments per second.

Additional test cases: Reading container node versus list

Recently, functionality was added to enable reading whole lists (CPS-1696). Here we compare performance of reading a container node containing a list, versus reading the list (with all descendants).

Total device nodes	500	1,000	1,500	2,000	2,500	3,000	xpath
Reading container	0.386	0.712	1.529	2.667	1.759	3.112	/openroadm-devices
Reading list	0.585	1.335	2.036	2.860	2.769	3.949	/openroadm-devices/openroadm-device

As can be seen, it is slower reading the list than reading the parent node containing the list.

Suggested improvement: Add a condition to the WHERE clause in the SQL for reading lists, to narrow the search space to children of the parent. For example:

SELECT * FROM fragment WHERE (existing conditions)

  AND parent_id = (SELECT id FROM fragment WHERE xpath = '/parent-xpath')

This should narrow the performance gap in this case.

Cps Path Queries

There have been massive orders-of-magnitude improvements to query performance. Formerly, queries had quadratic complexity, which is now linear. In the last month alone, performance has improved by 100x.

With a simple improvement, some cases can be made constant time instead of linear.

In the graphs below, time is measured in milliseconds.

Query one device out of many using absolute cps path

In this case, a query that matches a single device node using an absolute cps path is executed.

Fetch descendants	Omit Descendants	Direct Descendants	All Descendants
Query	//openroadm-device[@device-id="C201-7-1A-14"]
Graph
Time Complexity	Linear on total number of fragments in database	Linear on total number of fragments in database	Linear on total number of fragments in database
Comments	These operations are as fast as reading with getDataNodes API.

Query one device out of many using descendant cps path

In this case, a query that matches a single device node using a descendant cps path is executed.

Fetch descendants	Omit Descendants	Direct Descendants	All Descendants
Query	//openroadm-device[@device-id="C201-7-1A-14"]
Graph
Time Complexity	Linear on total number of fragments in database	Linear on total number of fragments in database	Linear on total number of fragments in database
Comments	These operations are as fast as reading with getDataNodes API.

Query all devices using absolute cps path

In this case, a query that matches all device nodes using an absolute cps path is executed.

Fetch descendants	Omit Descendants	Direct Descendants	All Descendants
Query	/openroadm-devices/openroadm-device[@status="success"]
Graph
Time Complexity	Linear on the amount of fragments returned from query	Linear on the amount of fragments returned from query	Linear on the amount of fragments returned from query
Comments	This operation is around half the speed as getDataNodes.	This operation is around half the speed as getDataNodes.	This has same speed as reading with getDataNodes.

Query all devices using descendant cps path

In this case, a query that matches all device nodes using a descendant cps path is executed.

Fetch descendants	Omit Descendants	Direct Descendants	All Descendants
Query	//openroadm-device[@ne-state="inservice"]
Graph
Time Complexity	Linear on the amount of fragments returned from query	Linear on the amount of fragments returned from query	Linear on the amount of fragments returned from query
Comments	This operation is around half the speed as getDataNodes.	This operation is around half the speed as getDataNodes.	This has same speed as reading with getDataNodes.

Commentary

Suggested improvement: For absolute cps path queries, the complexity can be reduced to best-case constant time (so it is no longer linear on the total number of fragments in the database).

Add a condition to the WHERE clause in the SQL to narrow the search space to children of the parent. For example:

SELECT * FROM fragment WHERE (existing conditions)

  AND parent_id IN (SELECT id FROM fragment WHERE xpath = '/parent-xpath')

(Note we used "parent_id IN" as there may be multiple parents with the same xpath during query across anchors.

Comparison of NCMP Performance across versions

CM-handle registration

Release	Date	CM-Handles removed	100	500	1,000	2,000	5,000	10,000	20,000	40,000	Comments
Kohn	Oct 2022	Time taken	8 sec	16 sec	19 sec	33 sec	1 min	3 min	ERROR	ERROR	Error due to DB running out shared memory
London	May 2023	Time taken	6 sec	7 sec	12 sec	22 sec	1 min	3 min	10 min	32 min
current	Aug 2023	Time taken	6 sec	7 sec	10 sec	16 sec	31 sec	57 sec	71 sec	108 sec

Commentary

CM-handle registration is multi-threaded, so performance may appear to scale better than linear, until the CPU cores are maxed out.

As can be seen below, CPU usage never reached 100% during the tests of the current version.

CM-handle de-registration

Release	Date	CM-Handles removed	100	500	1,000	2,000	5,000	10,000	20,000	40,000	Comments
Kohn	Oct 2022	Time taken	7 sec	2 min	7 min	25 min	2.5 hour	est: 10 hour	est: 2 days	est: 7 days	Some values estimated due to time constraints
London	May 2023	Time taken	< 1 sec	2 sec	3 sec	5 sec	17 sec	37 sec	85 sec	ERROR	Error due to 32K limit
current	Aug 2023	Time taken	< 1 sec	1 sec	3 sec	4 sec	14 sec	23 sec	65 sec	2 min

Current release has exactly linear performance for CM-handle de-registration (on a single thread):

CM-handles Removed	Time (sec)	CM-handles/sec
500	1.5	327
1000	2.6	377
5000	13.2	377
10000	25.9	385
20000	56.1	356

Commentary

CM-handle de-registration is a synchronous operation. Since the above tests send a single large batch, the operation is performed in a single thread. The linear performance reflects this. If the operation was broken into many smaller requests, the expectation is that it would complete many times faster - but this has not been tested.

Space shortcuts

Page tree

Summary

CPS Core Performance

Test Environment

Storing data nodes

Updating data nodes

Comparison of updating using different approaches

Updating data leaves

Deleting data nodes

Reading data nodes

Reading top-level container node

Reading data nodes for multiple xpaths

Additional test cases: Reading container node versus list

Cps Path Queries

Query one device out of many using absolute cps path

Query one device out of many using descendant cps path

Query all devices using absolute cps path

Query all devices using descendant cps path

Commentary

Comparison of NCMP Performance across versions

CM-handle registration

CM-handle de-registration

Space shortcuts

Page tree

Montreal Read/Write Performance

Summary

CPS Core Performance

Test Environment

Storing data nodes

Updating data nodes

Comparison of updating using different approaches

Updating data leaves

Deleting data nodes

Reading data nodes

Reading top-level container node

Reading data nodes for multiple xpaths

Additional test cases: Reading container node versus list

Cps Path Queries

Query one device out of many using absolute cps path

Query one device out of many using descendant cps path

Query all devices using absolute cps path

Query all devices using descendant cps path

Commentary

Comparison of NCMP Performance across versions

CM-handle registration

CM-handle de-registration