Background
The current implementation of Cps Path queries relies on regular expressions in the generated SQL queries.
For example, given this Cps Path:
/bookstore/categories[@code="1"]/books[@title="Matilda"]
the following SQL query is generated:
SELECT * FROM FRAGMENT WHERE anchor_id = 3 AND xpath ~ '/bookstore/categories\[@code=''1'']/books(\[@(?!.*\[).*?])?$' AND ( attributes @> '{"title":"Matilda"}' )
The use of regex potentially results in full table scans, severely limiting performance, and causing query time duration to grow linearly with the fragment table size (i.e. queries get slower as the DB gets bigger).
A new algorithm for queries is being proposed, avoiding regex, so that query duration is independent of database size.
Objective
The objective is to achieve the maximum theoretical performance for queries.
- For a query that will return 1 out of N items, the best theoretical time complexity is constant, O(1).
- For a query that will return all N items, the best theoretical time complexity is linear, O(N).
Complexity of existing solution
As will be demonstrated by performance tests later in this document, the existing solution has the following performance characteristics:
- For a query that will return 1 out of N items, the current time complexity is linear, O(N).
- For a query that will return all N items, the current time complexity is quadratic, O(N2).
Operation | Existing solution | Proposed solution |
---|---|---|
Query 1 out of N | O(N) | O(1) |
Query all out of N | O(N2) | O(N) |
Proposal
The new approach involves adding a column to the Fragment table, storing the last path component (called xpath_component here). The new column is indexed, to allow constant-time lookups.
id | parent_id | anchor_id | xpath | xpath_component |
---|---|---|---|---|
1 | NULL | 3 | /bookstore | bookstore |
2 | 1 | 3 | /bookstore/categories[@code='1'] | categories[@code='1'] |
3 | 2 | 3 | /bookstore/categories[@code='1']/books[@title='Matilda'] | books[@title='Matilda'] |
4 | 2 | 3 | /bookstore/categories[@code='1']/books[@title='The Gruffalo'] | books[@title='The Gruffalo'] |
The new approach will first look for "bookstore", and using that as the parent ID, look for ''categories[@code='1']", and using that as parent ID, look for "books" or xpath component starting with "books[", before finally applying leaf condition checks.
For example, given this Cps Path:
/bookstore/categories[@code="1"]/books[@title="Matilda"]
the following SQL query is generated:
SELECT * FROM fragment WHERE parent_id IN ( SELECT id FROM fragment WHERE xpath_component = 'categories[@code=''1'']' AND parent_id = ( SELECT id FROM fragment WHERE xpath_component = 'bookstore' AND anchor_id = 3 AND parent_id IS NULL ) ) AND ( xpath_component = 'books' OR xpath_component LIKE 'books[%' ) AND ( attributes @> '{"title":"Matilda"}' )
A note on fetching descendant nodes
The above sections only partly describe the current and proposed solutions. In both cases, a CPS path query is actually send as two database queries:
- the first query performs the CPS path query
- the second query fetches the descendants, if required
- alternatively, if an ancestor CPS path was provided, the second query would fetch the ancestors (with descendants if required)
The current implementation uses a regex-based query to fetch descendants, while the proposed solution will use recursive SQL to fetch descendants of fragment returned from the CPS path query:
WITH RECURSIVE descendant_search AS ( SELECT id, 0 AS depth FROM fragment WHERE id IN (:fragmentIds) UNION SELECT child.id, depth + 1 FROM fragment child INNER JOIN descendant_search ds ON child.parent_id = ds.id WHERE depth <= :maxDepth ) SELECT f.id, anchor_id AS anchorId, xpath, f.parent_id AS parentId, CAST(attributes AS TEXT) AS attributes FROM fragment f INNER JOIN descendant_search ds ON f.id = ds.id
Proof of Concept
A PoC was developed so that performance could be compared against existing Cps Path Query algorithms.
Test data
These tests were performed using the openroadm model, present in the integration-test module of CPS source code repo.
In this case, each 'device' node is comprised of 86 cps data nodes. In these tests, 4 anchors were populated using the same data.
Performance Improvement
Query one device from many, using descendant cps path
In this case, a query that matches a single device node is executed, such as:
//openroadm-device[@device-id="C201-7-1A-14"]
Case | Query one out of many using descendant cps path | Query one out of many using absolute cps path |
---|---|---|
Query | //openroadm-device[@device-id="C201-7-1A-14"] | /openroadm-devices/openroadm-device[@device-id="C201-7-1A-19"] |
Comparison graph | ||
Time complexity of | O(N) | O(N) |
Time complexity of proposed solution | O(1) | O(1) |
As seen in the graphs, query performance for current master branch is linear on the size of the database, while the PoC implementation is constant time (independent of DB size).
(Note, I have not illustrated all different fetchDescendantOptions, as it has only minor impact in this case of fetching 1 device node.
Query all devices using descendant cps path
In this case, a query that matches many device nodes using a descendant cps path is executed:
//openroadm-device[@ne-state="inservice"]
Fetch descendants | Omit Descendants | Direct Descendants | All Descendants |
---|---|---|---|
Comparison graph | |||
Graph detail | |||
Time complexity of | O(N) | O(N2) | O(N2) |
Time complexity of proposed solution | O(N) | O(N) | O(N) |
In above cases, it is observed that the existing algorithm grows quadratically - O(n2), e.g. doubling from 1000 to 2000 nodes takes 4 times longer (going from 40 to 160 seconds).
The PoC algorithm grows linearly - O(n), e.g. doubling from 1000 to 2000 nodes takes 2 times longer (going from 3.5 to 7 seconds).
Query all devices using absolute cps path
In this case, a query that matches many device nodes using an absolute cps path is executed:
/openroadm-devices/openroadm-device[@status="success"]
Fetch descendants | Omit Descendants | Direct Descendants | All Descendants |
---|---|---|---|
Comparison graph | |||
Graph detail | |||
Time complexity of | O(N) | O(N2) | O(N2) |
Time complexity of proposed solution | O(N) | O(N) | O(N) |
In above cases, it is observed that the existing algorithm grows quadratically - O(n2), while the PoC algorithm grows linearly - O(n).
Possible improvements of proposed solution
Index-only lookup where leaf-condition is in the xpath
Note in the following CPS path query:
/bookstore/categories[@code='1']/books[@title='Matilda']
In this case, the bookstore model specifies that the 'title' leaf is the key for 'books', and thus the xpath is encoded in the database as: /bookstore/categories[@code='1']/books[@title='Matilda']
We can optimize for this case, using an index-only lookup.
Given the following SQL snippet from the SQL generated from the CPS path query:
AND ( xpath_component = 'books' OR xpath_component LIKE 'books[%' ) AND ( attributes @> '{"title":"Matilda"}' )
it may be replaced with:
AND ( xpath_component = 'books[@title=''Matilda'']' OR ( ( xpath_component = 'books' OR xpath_component LIKE 'books[%' ) AND ( attributes @> '{"title":"Matilda"}' ) ) )
This would allow for an index-only lookup of the item, without needing the LIKE comparison or the attribute check.
Work Breakdown
In addition to the changes outlined above, there is additional work remaining for this change to be production-ready.
The main algorithm was mostly done during the PoC (all integration tests are passing for the PoC). The existing PoC code can thus be refactored to make it production ready.
DB upgrade
Because a new column is being added to the Fragment table, this column needs to be populated. An SQL script will be needed to provide a value for of the new xpath_component field based on existing xpath field.
Cps Path Parser changes
CpsPathBuilder and CpsPathQuery classes from cps-path-parser module will need to be updated to provide the individual path components.