Summary

A new Cps Path query algorithm is being proposed. The time complexity was measured for existing and proposed solutions, and is found to be:

Operation	Existing solution	Proposed solution
Query returning 1 out of N nodes	O(N)	O(1)
Query returning all N out of N nodes	O(N²)	O(N)

For a database comprising over 1,000,000 data nodes in a single anchor, the following results were recorded:

Operation	Existing solution	Proposed solution
Query returning 0.01% of all nodes	20 seconds	< 50 ms
Query returning all nodes	45 minutes	45 seconds

Background

The current implementation of Cps Path queries relies on regular expressions in the generated SQL queries.

For example, given this Cps Path:

/bookstore/categories[@code="1"]/books[@title="Matilda"]

the following SQL query is generated:

SELECT
    * 
FROM
    FRAGMENT 
WHERE
    anchor_id = 3
    AND xpath ~ '/bookstore/categories\[@code=''1'']/books(\[@(?!.*\[).*?])?$' 
    AND (
        attributes @> '{"title":"Matilda"}'
    )

The use of regex potentially results in full table scans, severely impacting performance, and causing query time duration to grow linearly with the fragment table size (i.e. queries get slower as the database gets bigger).

Path queries using path-component lookup

A new solution is being proposed where each path component of the Cps Path is looked up in constant time, thus addressing performance issues.

Objective of the proposed solution

The objective is to achieve the maximum performance for queries.

For a query that will return 1 out of N items, the best theoretical time complexity is constant, O(1). The existing solution is linear, O(N).
For a query that will return all N items, the best theoretical time complexity is linear, O(N). The existing solution is quadratic, O(N²).

Implementation Proposal

The new approach involves adding a column to the Fragment table, storing the last path component (called xpath_component here). The new column is indexed, to allow constant-time lookups.

id	parent_id	anchor_id	xpath	xpath_component
1	NULL	3	/bookstore	bookstore
2	1	3	/bookstore/categories[@code='1']	categories[@code='1']
3	2	3	/bookstore/categories[@code='1']/books[@title='Matilda']	books[@title='Matilda']
4	2	3	/bookstore/categories[@code='1']/books[@title='The Gruffalo']	books[@title='The Gruffalo']

For example, given this CPS Path:

/bookstore/categories[@code="1"]/books[@title="Matilda"]

The new approach will first look for "bookstore", and using that as the parent node, look for ''categories[@code='1']", and using that as parent, look for "books" or an xpath component starting with "books[", before finally applying leaf condition checks.

Given the above CPS Path, the following SQL query is generated - note the use of sub-queries for each path component:

SELECT
    * 
FROM
    fragment 
WHERE
    parent_id IN (
        SELECT
            id 
        FROM
            fragment 
        WHERE
            xpath_component = 'categories[@code=''1'']'
            AND parent_id = (
                SELECT
                    id 
                FROM
                    fragment 
                WHERE
                    xpath_component = 'bookstore'
                    AND anchor_id = 3
                    AND parent_id IS NULL
            )
        ) 
        AND (
            xpath_component = 'books'
            OR xpath_component LIKE 'books[%'
        ) 
        AND (
            attributes @> '{"title":"Matilda"}'
        )

Comparison with existing solution

Given the following (incomplete) fragment table:

id	parent_id	anchor_id	xpath
1	NULL	3	/bookstore
2	1	3	/bookstore/categories[@code='1']
3	2	3	/bookstore/categories[@code='1']/books[@title='Matilda']
4	2	3	/bookstore/categories[@code='1']/books[@title='The Gruffalo']
5	1	3	/bookstore/categories[@code='2']
6	5	3	/bookstore/categories[@code='2']/books[@title='Good Omens']
7	5	3	/bookstore/categories[@code='2']/books[@title='The Colour of Magic']
8	NULL	4	/dmi-registry
9	8	4	/dmi-registry/cm-handles[@id='d99ef4cc8d744a749490cceefd4c469e']

Here is a visualization of both algorithms:

Existing solution	Proposed solution
Because the existing solution uses a regex, all fragments in anchor 3 need to be examined to see if they match the regex.	Because the proposed solution uses sub-queries to look up each path component, only relevant fragments are examined.
Green and yellow nodes have been checked against the regex. Yellow nodes have had JSON attributes examined.	Green nodes have looked up using an index-only lookup. Yellow nodes have had JSON attributes examined.

A note on fetching descendant nodes

The above sections only partly describe the current and proposed solutions. In both cases, a CPS path query is actually sent as two database queries:

The first SQL query performs the CPS path query.
The second SQL query fetches the descendant nodes, if required.
- Alternatively, if an ancestor CPS path was provided, the second SQL query would fetch the ancestors (and their descendants if required).

The existing implementation uses a SQL query using a LIKE operator to fetch descendants, while the proposed solution will use recursive SQL to fetch descendants of fragment returned from the CPS path query:

WITH RECURSIVE descendant_search AS (
  SELECT id, 0 AS depth
    FROM fragment
   WHERE id IN (:fragmentIds)
   UNION
  SELECT child.id, depth + 1
    FROM fragment child INNER JOIN descendant_search ds ON child.parent_id = ds.id
   WHERE depth <= :maxDepth
) 
SELECT f.id, anchor_id AS anchorId, xpath, f.parent_id AS parentId, CAST(attributes AS TEXT) AS attributes 
FROM fragment f INNER JOIN descendant_search ds ON f.id = ds.id

Note: using the proposed query solution without the recursive descendant search results in no performance improvement; the converse is also true. Only when the two new queries are combined that optimal performance is reached (the two queries being path-component lookup and recursive SQL descendant search).

Proof of Concept

A proof of concept (poc) implementation was developed so that performance could be compared against existing Cps Path query algorithms.

Test data

These tests below were performed using Groovy/Spock tests, directly calling the CPS-Service Java API (thus these tests do not include any HTTP overhead as when REST APIs are used).

The test data is using the openroadm model, present in the integration-test module of CPS source code repository. In this case, each 'device node' has 86 data nodes. In these tests, four anchors were populated using the same data. Thus, for 3000 device nodes, there are 3000 devices x 86 data nodes x 4 anchors = 1,032,000 total data nodes.

Performance Comparison

Query one device out of many (1 out of N)

In this case, a query that matches a single device node is executed.

Case	Query one out of many using descendant cps path	Query one out of many using absolute cps path
Query	//openroadm-device[@device-id="C201-7-1A-14"]	/openroadm-devices/openroadm-device[@device-id="C201-7-1A-19"]
Comparison graph
Graph detail		Note: constant to within ±1 ms
Time complexity of existing solution	O(N)	O(N)
Time complexity of proposed solution	O(N)	O(1)
Comments	This may be O(1) or O(N) depending on how many nodes are matched by the descendant prefix. For example, it is O(N) with the above query, but O(1) when using this query: //openroadm-devices/openroadm-device[@device-id="C201-7-1A-14"]

As seen in the graphs, query performance for current master branch is linear on the size of the database, while the PoC implementation is constant time (independent of database size).

(Note, I have not illustrated all different fetchDescendantOptions, as it has only minor impact in this case of fetching 1 device node.)

Query all devices (N out of N) using descendant cps path

In this case, a query that matches all device nodes using a descendant cps path is executed.

Fetch descendants	Omit Descendants	Direct Descendants	All Descendants
Query	//openroadm-device[@ne-state="inservice"]	//openroadm-device[@ne-state="inservice"]	//openroadm-device[@ne-state="inservice"]
Comparison graph
Graph detail
Time complexity of existing solution	Unclear, as there is very high variance. Likely O(N).	O(N²)	O(N²)
Time complexity of proposed solution	O(N)	O(N)	O(N)

In above cases, it is observed that the existing algorithm grows quadratically - O(n²), e.g. doubling from 1000 to 2000 nodes takes 4 times longer (going from 40 to 160 seconds).

The PoC algorithm grows linearly - O(n), e.g. doubling from 1000 to 2000 nodes takes 2 times longer (going from 3.5 to 7 seconds).

Query all devices (N out of N) using absolute cps path

In this case, a query that matches many device nodes using an absolute cps path is executed.

Fetch descendants	Omit Descendants	Direct Descendants	All Descendants
Query	/openroadm-devices/openroadm-device[@status="success"]	/openroadm-devices/openroadm-device[@status="success"]	/openroadm-devices/openroadm-device[@status="success"]
Comparison graph
Graph detail
Time complexity of existing solution	Unclear, as there is very high variance. Likely O(N).	O(N²)	O(N²)
Time complexity of proposed solution	O(N)	O(N)	O(N)

In above cases, it is observed that the existing algorithm grows quadratically - O(n²), while the PoC algorithm grows linearly - O(n).

Possible improvements of proposed solution

Cps Path Query capabilities can be extended

Presently, Cps Path queries limit leaf-condition, text-condition, etc. to the final path component. The proposed solution allows for future improvement where conditions could be applied to any or all path components.

For example, given these queries:

/bookstore/categories[@code='1']/books[@title='Matilda']
/bookstore/categories[@name='SciFi']/books[@title='Matilda']
/bookstore/categories/books[@title='Matilda']

Presently, only case 1 will return data, and cases 2 and 3 return nothing. Given the proposed solution uses an SQL sub-query for each path-component, with a little work, cases 2 and 3 could be supported.

Other operations can be accelerated

The same algorithm to improve query performance could also be used to improve performance of other operations, such as GET, UPDATE, and DELETE data nodes.

However, some of these existing operations have "plural" implementations taking Collections as parameters, such as getDataNodesForMultipleXpaths. Additional investigation is needed for these cases.

Index-only lookup where leaf-condition is in the xpath

Given the following CPS path query:

/bookstore/categories[@code='1']/books[@title='Matilda']

In this case, the bookstore model specifies that the 'title' leaf is the key for 'books', and thus the xpath is encoded in the database as: /bookstore/categories[@code='1']/books[@title='Matilda']

We can optimize for this case, by checking if xpath_component = "books[@title='Matilda']", thus avoiding the need to examine the attributes field in the fragment table.

In the PoC implementation, the following is a snippet from the SQL generated from the CPS path query:

        AND (
            xpath_component = 'books' OR xpath_component LIKE 'books[%'
        ) 
        AND (
            attributes @> '{"title":"Matilda"}'
        )

it may be replaced with:

        AND (
            xpath_component = 'books[@title=''Matilda'']'
            OR (
                (
                xpath_component = 'books' OR xpath_component LIKE 'books[%'
                )
                AND (
                    attributes @> '{"title":"Matilda"}'
                ) 
            )
        )

This would allow for an index-only lookup of the node, without needing the LIKE comparison or the attribute check.

Work Breakdown for Implementation

In addition to the changes outlined above, there is additional work remaining for this change to be production-ready.

The main algorithm was mostly done during the PoC (all integration tests are passing for the PoC). The existing PoC code can thus be refactored to make it production ready.

DB upgrade

Because a new column is being added to the Fragment table, this column needs to be populated. An SQL script will be needed to provide a value for of the new xpath_component field based on existing xpath field.

Cps Path Parser changes

CpsPathBuilder and CpsPathQuery classes from cps-path-parser module will need to be updated to provide the individual path components.

Space shortcuts

Page tree

Summary

Background

Path queries using path-component lookup

Objective of the proposed solution

Implementation Proposal

Comparison with existing solution

A note on fetching descendant nodes

Proof of Concept

Test data

Performance Comparison

Query one device out of many (1 out of N)

Query all devices (N out of N) using descendant cps path

Direct Descendants

All Descendants

Query all devices (N out of N) using absolute cps path

Possible improvements of proposed solution

Cps Path Query capabilities can be extended

Other operations can be accelerated

Index-only lookup where leaf-condition is in the xpath

Work Breakdown for Implementation

DB upgrade

Cps Path Parser changes

Space shortcuts

Page tree

CPS-1646 CpsPath queries using path-component lookup

Summary

Background

Path queries using path-component lookup

Objective of the proposed solution

Implementation Proposal

Comparison with existing solution

A note on fetching descendant nodes

Proof of Concept

Test data

Performance Comparison

Query one device out of many (1 out of N)

Query all devices (N out of N) using descendant cps path

Direct Descendants

All Descendants

Query all devices (N out of N) using absolute cps path

Possible improvements of proposed solution

Cps Path Query capabilities can be extended

Other operations can be accelerated

Index-only lookup where leaf-condition is in the xpath

Work Breakdown for Implementation

DB upgrade

Cps Path Parser changes