Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Summary

A new Cps Path query algorithm is being proposed. The time complexity was assessed through code analysis and performance testing, for existing and proposed solutions.

Time complexityExisting solutionProposed solution
Best caseO(N)O(1)
Worst caseO(N2)

O(N)

For a database comprising 1,000,000 data nodes in a single anchor, the following results were recorded:

OperationExisting solutionProposed solution
Query returning 0.01% of all nodes20 seconds< 1 second
Query returning all nodes45 minutes45 seconds

Of course, the performance gap widens as the database grows larger, owing to the different time complexity.

Description of existing Cps Path query algorithms

The current implementation of Cps Path queries relies on regular expressions in the generated SQL queries.

SQL query using regular expression on the full path

For example, given this Cps Path:

/bookstore/categories[@code="1"]/books[@price=15]

the following SQL query is generated:

Code Block
languagesql
SELECT
    * 
FROM
    FRAGMENT 
WHERE
    anchor_id = 3
    AND xpath ~ '/bookstore/categories\[@code=''1'']/books(\[@(?!.*\[).*?])?$' 
    AND (
        attributes @> '{"price":15}'
    )

The use of regex potentially results in full table scans (as each node's xpath needs to be compared against the regex), severely impacting performance, and causing query time duration to grow linearly with the fragment table size (i.e. queries get slower as the database gets bigger).

In addition, the mechanism for fetching descendant nodes causes quadratic complexity in the worst case.

Why does the existing solution have worst case quadratic complexity?

The reason the existing solution is quadratic time when fetching descendants is because the path query translates into a number of SQL queries:

  1. An SQL query that returns nodes matching query criteria - O(N)
  2. For each node returned from query 1, another SQL query is issued to fetch that node with descendants, using a LIKE operator - O(N).

In the case of a query that matches all N nodes, we are issuing 1+N queries, each with complexity O(N). Thus the overall complexity is O(N2).

Path queries using path-component lookup

A new solution is being proposed where each path component of the Cps Path is looked up in constant time, thus addressing performance issues.

Objective

The objective is to achieve the best possible performance for queries.

  • For a query that will return 1 out of N items, the best possible time complexity is constant, O(1).
  • For a query that will return all N items, the best possible time complexity is linear, O(N).

Implementation Proposal

An additional fragment table field is needed to store path components

The new approach involves adding a column to the Fragment table, storing the last path component (called xpath_component here). The new column is indexed, to allow constant-time lookups.

idparent_idanchor_idxpathxpath_component
1NULL3/bookstorebookstore
213/bookstore/categories[@code='1']categories[@code='1']
323/bookstore/categories[@code='1']/books[@title='Matilda']books[@title='Matilda']
423/bookstore/categories[@code='1']/books[@title='The Gruffalo']books[@title='The Gruffalo']

SQL query for path-component lookup

For example, given this CPS Path:

/bookstore/categories[@code="1"]/books[@price=15]

The new approach will used nested sub-queries to search for each path component, first looking for "bookstore", and using that as the parent node, look for ''categories[@code='1']", and using that as parent, look for "books" before finally applying leaf condition checks.

Given the above CPS Path, the following SQL query is generated - note the use of sub-queries for each path component:

Code Block
languagesql
SELECT
    * 
FROM
    fragment 
WHERE
    parent_id IN (

Background

The current implementation of Cps Path queries relies on regular expressions in the generated SQL queries.

For example, given this Cps Path:

/bookstore/categories[@code="1"]/books[@title="Matilda"]

the following SQL query is generated:

Code Block
languagesql
SELECT
    * 
FROM
    FRAGMENT 
WHERE
    anchor_id = 3
    AND xpath ~ '/bookstore/categories\[@code=''1'']/books(\[@(?!.*\[).*?])?$' 
    AND (
        attributes @> '{"title":"Matilda"}'
    )

The use of regex potentially results in full table scans, severely limiting performance, and causing query time duration to grow linearly with the fragment table size (i.e. queries get slower as the DB gets bigger).

A new algorithm for queries is being proposed, avoiding regex, so that query duration is independent of database size.

Objective

The objective is to achieve the maximum theoretical performance for queries.

  • For a query that will return 1 out of N items, the best theoretical time complexity is constant, O(1).
  • For a query that will return all N items, the best theoretical time complexity is linear, O(N).

Complexity of existing solution

As will be demonstrated by performance tests later in this document, the existing solution has the following performance characteristics:

  • For a query that will return 1 out of N items, the current time complexity is linear, O(N).
  • For a query that will return all N items, the current time complexity is quadratic, O(N2).

...

Proposal

The new approach involves adding a column to the Fragment table, storing the last path component (called xpath_component here). The new column is indexed, to allow constant-time lookups.

...

The new approach will first look for "bookstore", and using that as the parent ID, look for ''categories[@code='1']", and using that as parent ID, look for "books" or xpath component starting with "books[", before finally applying leaf condition checks.

For example, given this Cps Path:

/bookstore/categories[@code="1"]/books[@title="Matilda"]

the following SQL query is generated:

Code Block
languagesql
SELECT
    * 
FROM
    fragment 
WHERE
    parent_id IN (
        SELECT
            id 
        FROM
            fragment 
        WHERE
            xpath_component = 'categories[@code=''1'']'
            AND parent_id = (
                SELECT
                    id 
                FROM
                    fragment 
                WHERE
                    xpath_component = 'bookstore'
                    AND anchor_id = 3
                    AND parent_id IS NULL
        SELECT
    )
        )id 
        AND (
FROM
            fragment 
      xpath_component = 'books'WHERE
            OR xpath_component LIKE= 'books[%categories[@code=''1'']'
        ) 
   AND parent_id IN (
        AND (
        SELECT
                   attributes @> '{"title":"Matilda"}'
id 
          )

A note on fetching descendant nodes

The above sections only partly describe the current and proposed solutions. In both cases, a CPS path query is actually send as two database queries:

  • the first query performs the CPS path query
  • the second query fetches the descendants, if required
    • alternatively, if an ancestor CPS path was provided, the second query would fetch the ancestors (with descendants if required)

The current implementation uses a regex-based query to fetch descendants, while the proposed solution will use recursive SQL to fetch descendants of fragment returned from the CPS path query:

Code Block
languagesql
WITH RECURSIVE descendant_search AS (
  SELECT id, 0 AS depth
    FROM fragment
   WHERE id IN (:fragmentIds)
   UNION
  SELECT child.id, depth + 1
    FROM fragment child INNER JOIN descendant_search ds ON child.parent_id = ds.id
   WHERE depth <= :maxDepth
) 
SELECT f.id, anchor_id AS anchorId, xpath, f.parent_id AS parentId, CAST(attributes AS TEXT) AS attributes 
FROM fragment f INNER JOIN descendant_search ds ON f.id = ds.id

Proof of Concept

A PoC was developed so that performance could be compared against existing Cps Path Query algorithms.

Test data

These tests were performed using the openroadm model, present in the integration-test module of CPS source code repo.

In this case, each 'device' node is comprised of 86 cps data nodes. In these tests, 4 anchors were populated using the same data.

Performance Improvement

Query one device from many, using descendant cps path

In this case, a query that matches a single device node is executed, such as:

//openroadm-device[@device-id="C201-7-1A-14"]

...

Image Removed

...

Image Removed

...

Time complexity of
existing solution

...

As seen in the graphs, query performance for current master branch is linear on the size of the database, while the PoC implementation is constant time (independent of DB size).

(Note, I have not illustrated all different fetchDescendantOptions, as it has only minor impact in this case of fetching 1 device node.

Query all devices using descendant cps path

In this case, a query that matches many device nodes using a descendant cps path is executed:

//openroadm-device[@ne-state="inservice"]

...

Direct Descendants

...

All Descendants

...

Image Removed

...

Image Removed

...

Image Removed

...

Image Removed

...

Time complexity of
existing solution

...

In above cases, it is observed that the existing algorithm grows quadratically - O(n2), e.g. doubling from 1000 to 2000 nodes takes 4 times longer (going from 40 to 160 seconds).

The PoC algorithm grows linearly - O(n), e.g. doubling from 1000 to 2000 nodes takes 2 times longer (going from 3.5 to 7 seconds).

Query all devices using absolute cps path

In this case, a query that matches many device nodes using an absolute cps path is executed:

/openroadm-devices/openroadm-device[@status="success"]

...

Image Removed

...

Image Removed

...

Image Removed

...

Image Removed

...

Image Removed

...

Time complexity of
existing solution

...

In above cases, it is observed that the existing algorithm grows quadratically - O(n2), while the PoC algorithm grows linearly - O(n).

      FROM
                    fragment 
                WHERE
                    xpath_component = 'bookstore'
                    AND anchor_id = 3
                    AND parent_id IS NULL
            )
        ) 
        AND (
            xpath_component = 'books'
            OR xpath_component LIKE 'books[%'
        ) 
        AND (
            attributes @> '{"price":15}'
        )

In addition, a new algorithm for fetching descendants is implemented.

Fetching descendants using Recursive SQL query

The proposed solution will use recursive SQL to fetch descendants of fragment returned from the CPS path query:

Code Block
languagesql
WITH RECURSIVE descendant_search AS (
  SELECT id, 0 AS depth
    FROM fragment
   WHERE id IN (:fragmentIds)
   UNION
  SELECT child.id, depth + 1
    FROM fragment child INNER JOIN descendant_search ds ON child.parent_id = ds.id
   WHERE depth <= :maxDepth
) 
SELECT f.id, anchor_id AS anchorId, xpath, f.parent_id AS parentId, CAST(attributes AS TEXT) AS attributes 
FROM fragment f INNER JOIN descendant_search ds ON f.id = ds.id

There is existing code in CpsDataPersistenceService similar to this, so minimal code changes are required to implement this.

Comparison with existing solution

Given the following (incomplete) fragment table:

idparent_idanchor_idxpath
1NULL3/bookstore
213/bookstore/categories[@code='1']
323/bookstore/categories[@code='1']/books[@title='Matilda']
423/bookstore/categories[@code='1']/books[@title='The Gruffalo']
513/bookstore/categories[@code='2']
653/bookstore/categories[@code='2']/books[@title='Good Omens']
753/bookstore/categories[@code='2']/books[@title='The Colour of Magic']
8NULL4/dmi-registry
984/dmi-registry/cm-handles[@id='d99ef4cc8d744a749490cceefd4c469e']

Here is a visualization of both algorithms, executing a query for /bookstore/categories[@code='1']/books[@price='15']:

Existing solutionProposed solution

Because the existing solution uses a regex, all nodes in anchor 3 need to be examined to see if they match the regex.

The proposed solution uses sub-queries to do indexed look-up of each path component, so only relevant nodes are examined.

Image Added

  • Green and yellow nodes have been checked against the regex.
  • Yellow nodes matched the regex, and thus have JSON attributes examined.

Image Added

  • Green nodes have been looked up using an index-only lookup.
  • Yellow nodes have had JSON attributes examined.

If the number of nodes is doubled, the number of xpaths checked against the regex is also doubled (linear complexity).

Image Added

If the number of nodes is doubled, the same number of nodes is still scanned in this case (constant complexity).

Image Added

Detailed Performance Comparison

Proof of Concept

A proof of concept (poc) implementation was developed so that performance could be compared against existing Cps Path query algorithms.

Test setup

These tests below were performed using Groovy/Spock tests, directly calling the CPS-Service Java API (thus these tests do not include any service overhead as when REST APIs are used).

The test data is using the openroadm model, present in the integration-test module of CPS source code repository. In this case, each 'device node' has 86 data nodes. In these tests, four anchors were populated using the same data. Thus, for 3000 device nodes, there are 3000 devices x 86 data nodes x 4 anchors + parent node = 1,032,001 total data nodes.

The test timings used to generate the charts in this document is available as a spreadsheet: CPS-1646 CpsPath queries using path-component lookup.xlsx

Query one device out of many

In this case, a query that matches a single device node is executed.

CaseQuery one out of many using descendant cps pathQuery one out of many using absolute cps path
Query//openroadm-device[@device-id="C201-7-1A-14"]/openroadm-devices/openroadm-device[@device-id="C201-7-1A-19"]
Comparison graph

Image Added

Image Added

Graph detail

Image Added

Image Added

Time complexity of
existing solution

O(N)O(N)
Time complexity of
proposed solution

O(N)

Near-constant, O(1)
Comments

The use of LIKE 'openroadm-device[%' in the query to match candidates
means that all fragments need to be checked, e.g. for 3000 devices,
3000 x 86 ~= 250k fragments need to be scanned.

The use of the absolute path narrows the search space to only direct
children of /openroadm-devices - for 3000 devices, we only need to scan
3000 fragments instead of 250k. (Technically this is still O(N), but linear on
the number of items in the list, not the number of fragments in the DB.)

As seen in the graphs, query performance for current master branch is linear on the size of the database, while the PoC implementation is constant time (independent of database size).

Note: I have not illustrated all different fetchDescendantOptions, as it has only minor impact in the case of fetching 1 device node.

Query all devices using descendant cps path

In this case, a query that matches all device nodes using a descendant cps path is executed.

Fetch descendantsOmit Descendants

Direct Descendants

All Descendants

Query//openroadm-device[@ne-state="inservice"]//openroadm-device[@ne-state="inservice"]//openroadm-device[@ne-state="inservice"]
Comparison graph

Image Added

Image Added

Image Added

Graph detail

Image Added

Image Added

Image Added

Time complexity of
existing solution

Unclear, as there is very high variance. Probably O(N).O(N2)O(N2)
Time complexity of
proposed solution
O(N)O(N)O(N)

Query all devices using absolute cps path

In this case, a query that matches many device nodes using an absolute cps path is executed.

Fetch descendantsOmit DescendantsDirect DescendantsAll Descendants
Query/openroadm-devices/openroadm-device[@status="success"]/openroadm-devices/openroadm-device[@status="success"]/openroadm-devices/openroadm-device[@status="success"]
Comparison graph

Image Added

Image Added

Image Added

Graph detail

Image Added

Image Added

Image Added

Time complexity of
existing solution

Unclear, as there is very high variance. Likely O(N).O(N2)O(N2)
Time complexity of
proposed solution
O(N)O(N)O(N)

Possible improvements

Optimization: Index-only lookup when list key is used in leaf condition

Given the following CPS path query:

/bookstore/categories[@code='1']/books[@title='Matilda']

In this case, the bookstore model specifies that the 'title' leaf is the key for 'books', and thus the xpath is encoded in the database as: /bookstore/categories[@code='1']/books[@title='Matilda']

We can optimize for this case, by checking if xpath_component = "books[@title='Matilda']", thus avoiding the need to examine the attributes field in the fragment table (thus the entire operation involves only database indexes).

In the PoC implementation, the following is a snippet from the SQL generated from the CPS path query:

Code Block
languagebash
        AND (xpath_component = 'books' OR xpath_component LIKE 'books[%') 
        AND (attributes @> '{"title":"Matilda"}')

It may be replaced with this to add the new functionality:

Code Block
languagesql
        AND (
            xpath_component = 'books[@title=''Matilda'']'
            OR (
                (xpath_component = 'books' OR xpath_component LIKE 'books[%') 
                AND (attributes @> '{"title":"Matilda"}')
            )
        ) 

This would allow for an index-only lookup of the node, without needing the LIKE comparison or the attribute check.

Optimization: Faster look-ups for lists and list elements

Given the bookstore model where books are stored in a list, and given a path query //books[@price=15], using proposed solution, this would use a LIKE 'books[' to find potential matches.

AND (xpath_component = 'books' OR xpath_component LIKE 'books[%')

The use of the LIKE operator in this case would result in O(N) time, as every fragment may need to be compared against the LIKE pattern. There is a solution which may push this case toward constant O(1) performance, namely adding the list name as an indexed column to the fragments table - note only list elements need this field populated, container nodes can set this field to NULL:

idparent_idanchor_idxpathxpath_componentlist_name
1NULL3/bookstorebookstoreNULL
213/bookstore/categories[@code='1']categories[@code='1']categories
323/bookstore/categories[@code='1']/books[@title='Matilda']books[@title='Matilda']books
423/bookstore/categories[@code='1']/books[@title='The Gruffalo']books[@title='The Gruffalo']books

Our query could be modified to replace the LIKE with an indexed look up:

AND (xpath_component = 'books' OR list_name = 'books')

This would give rise to constant time look up in such cases. If this solution is not implemented, users can work around this problem wrapping lists in container nodes, with the container being included in the descendant query, e.g. instead of //books[@title='Matilda'], change the model and use //books/book[@title='Matilda'] as a query. This would then have O(1) performance.

Note: if we wish to avoid creating additional database fields, using list/container name field would offer the best balance of performance, e.g.

idparent_idanchor_idxpathcontainer
1NULL3/bookstorebookstore
213/bookstore/categories[@code='1']categories
323/bookstore/categories[@code='1']/books[@title='Matilda']books
423/bookstore/categories[@code='1']/books[@title='The Gruffalo']books

In this case, attributes for any list element path component would need to be looked up (so it would no longer be index-only). My suspicion is that this would have better general performance than using xpath_component alone, but testing is needed to verify.

Cps Path Query capabilities can be extended

Presently, Cps Path queries limit leaf-condition, text-condition, etc. to the final path component. The proposed solution allows for future improvement where leaf conditions could be applied to any or all path components.

For example, given these queries:

  1. /bookstore/categories[@code='1']/books[@title='Matilda']
  2. /bookstore/categories[@name='SciFi']/books[@title='Matilda']
  3. /bookstore/categories/books[@title='Matilda']

Presently, only case 1 will return data (because 'code' is the list key for 'categories'), and cases 2 and 3 return nothing. Given the proposed solution uses an SQL sub-query for each path-component, the query generator can be adapted to support cases 2 and 3, with little expected performance impact. Note this feature would require storing the container/list name as an indexed field, as described in the previous section.

Other operations can be accelerated

The same algorithm to improve query performance could also be used to improve performance of other operations, such as GET, UPDATE, and DELETE data nodes. However, some of these existing operations have "plural" implementations taking Collections as parameters, such as getDataNodesForMultipleXpaths. Additional investigation is needed for these cases.

Work Breakdown for Implementation

Beside the work done in the PoC implementation, there is additional work for this change to be production-ready.  The main algorithm is mostly complete in the PoC (all integration tests are passing for the PoC). The existing PoC code can thus be refactored to make it production ready.

Fetch descendants

The proposed solution can broadly be broken into two major changes:

  1. Cps Path query using path-component look-up
  2. Using recursive SQL to fetch descendants of nodes returned from 1.

I proposed that the recursive SQL to fetch descendants be implemented first as an independent change, as it move worst-case complexity from O(N2) to O(N). By contrast, the query using path-component look-up will improve best-case performance from O(N) to O(1).

Cps Path Parser changes

The PoC uses String.split for parsing path components, which means paths containing '/' in the leaf condition are not parsed correctly, such as //books[@title='GNU/Linux']. CpsPathBuilder and CpsPathQuery classes from cps-path-parser module will need to be updated to provide the individual path components (in normalized form).

Work Breakdown

In addition to the changes outlined above, there is additional work remaining for this change to be production-ready. 

The main algorithm was mostly done during the PoC (all integration tests are passing for the PoC). The existing PoC code can thus be refactored to make it production ready.

DB upgrade

Because a new column is being added to the Fragment table, this the new column needs to be populated during DB upgrade (to maintain backwards compatibility with existing databases). An SQL script will be needed to provide a value for of the new xpath_component field based on existing xpath field.

Cps Path Parser changes

CpsPathBuilder and CpsPathQuery classes from cps-path-parser module will need to be updated to provide the individual path components.