...
The use of regex potentially results in full table scans (as each node's xpath needs to be compared against the regex), severely impacting performance, and causing query time duration to grow linearly with the fragment table size (i.e. queries get slower as the database gets bigger).
Path queries using path-component lookup
A new solution is being proposed where each path component of the Cps Path is looked up in constant time, thus addressing performance issues.
Objective of the proposed solution
The objective is to achieve the maximum performance for queries.
- For a query that will return 1 out of N items, the best theoretical time complexity is constant, O(1). The existing solution is linear, O(N).
- For a query that will return all N items, the best theoretical time complexity is linear, O(N). The existing solution is quadratic, O(N2).
Implementation Proposal
The new approach involves adding a column to the Fragment table, storing the last path component (called xpath_component here). The new column is indexed, to allow constant-time lookups.
In addition, the mechanism for fetching descendant nodes returned from the query causes worst case quadratic performance.
Why does the existing solution have worst case quadratic complexity?
The reason the existing solution is quadratic time when fetching descendants is because the path query translates into a number of SQL queries:
- An SQL query that returns nodes matching query criteria - O(N)
- For each node returned from query 1, another SQL query is issued to fetch that node with descendants, using a LIKE operator - O(N).
The LIKE operator has linear time complexity, as all nodes' xpaths in a given anchor need to be examined.
In the case of a query that matches all N nodes, we are issuing 1+N queries, each with complexity O(N). Thus the overall complexity is O(N2).
Path queries using path-component lookup
A new solution is being proposed where each path component of the Cps Path is looked up in constant time, thus addressing performance issues.
Objective of the proposed solution
The objective is to achieve the maximum performance for queries.
- For a query that will return 1 out of N items, the best possible time complexity is constant, O(1).
- For a query that will return all N items, the best possible time complexity is linear, O(N).
Implementation Proposal
The new approach involves adding a column to the Fragment table, storing the last path component (called xpath_component here). The new column is indexed, to allow constant-time lookups.
id | parent_id | anchor_id | xpath | xpath_component |
---|---|---|---|---|
1 | NULL | 3 | /bookstore | bookstore |
id | parent_id | anchor_id | xpath | xpath_component |
1 | NULL | 3 | /bookstore | bookstore |
2 | 1 | 3 | /bookstore/categories[@code='1'] | categories[@code='1'] |
3 | 2 | 3 | /bookstore/categories[@code='1']/books[@title='Matilda'] | books[@title='Matilda'] |
4 | 2 | 3 | /bookstore/categories[@code='1']/books[@title='The Gruffalo'] | books[@title='The Gruffalo'] |
...
Code Block | ||
---|---|---|
| ||
SELECT * FROM fragment WHERE parent_id IN ( SELECT id FROM fragment WHERE xpath_component = 'categories[@code=''1'']' AND parent_id = ( SELECT id FROM fragment WHERE xpath_component = 'bookstore' AND anchor_id = 3 AND parent_id IS NULL ) ) AND ( xpath_component = 'books' OR xpath_component LIKE 'books[%' ) AND ( attributes @> '{"title":"Matilda"}' ) |
Comparison with existing solution
Given the following (incomplete) fragment table:
In addition, a new algorithm for fetching descendants is implemented.
Recursive SQL query to fetch descendants
The proposed solution will use recursive SQL to fetch descendants of fragment returned from the CPS path query:
Code Block | ||
---|---|---|
| ||
WITH RECURSIVE descendant_search AS (
SELECT id, 0 AS depth
FROM fragment
WHERE id IN (:fragmentIds)
UNION
SELECT child.id, depth + 1
FROM fragment child INNER JOIN descendant_search ds ON child.parent_id = ds.id
WHERE depth <= :maxDepth
)
SELECT f.id, anchor_id AS anchorId, xpath, f.parent_id AS parentId, CAST(attributes AS TEXT) AS attributes
FROM fragment f INNER JOIN descendant_search ds ON f.id = ds.id |
Comparison with existing solution
Given the following (incomplete) fragment table:
id | parent_id | anchor_id | xpath |
---|---|---|---|
1 | NULL | 3 | /bookstore |
2 | |||
id | parent_id | anchor_id | xpath |
1 | NULL | 3 | /bookstore |
2 | 1 | 3 | /bookstore/categories[@code='1'] |
3 | 2 | 3 | /bookstore/categories[@code='1']/books[@title='Matilda'] |
4 | 2 | 3 | /bookstore/categories[@code='1']/books[@title='The Gruffalo'] |
5 | 1 | 3 | /bookstore/categories[@code='21'] |
63 | 52 | 3 | /bookstore/categories[@code='21']/books[@title='Good OmensMatilda'] |
74 | 52 | 3 | /bookstore/categories[@code='21']/books[@title='The Colour of MagicGruffalo'] |
5 | NULL1 | 43 | / | dmi-registry
9 | 8 | 4 | /dmi-registry/cm-handles[@id='d99ef4cc8d744a749490cceefd4c469e'] |
...
bookstore/categories[@code='2'] | |||
6 | 5 | 3 | /bookstore/categories[@code=' |
...
2']/books[@title=' |
...
Good Omens'] |
...
Existing solution | |
7 | Proposed solution |
---|---|
Because the existing solution uses a regex, all nodes in anchor 3 need to be examined to see if they match the regex. | Because the proposed solution uses sub-queries to look up each path component, only relevant nodes are examined. |
|
|
If the number of nodes is doubled, the number of xpaths checked against the regex is also doubled (linear complexity). | If the number of nodes is doubled, the same number of nodes is still scanned in this case (constant complexity). |
A note on fetching descendant nodes
The above sections only partly describe the current and proposed solutions. In the existing solution, a CPS path query is actually sent as multiple database queries:
- The first SQL query performs the CPS path query.
- Additional SQL queries fetch the descendant nodes, if required.
- Alternatively, if an ancestor CPS path was provided, the second SQL query would fetch the ancestors (and their descendants if required).
Recursive SQL query to fetch descendants
The proposed solution will use recursive SQL to fetch descendants of fragment returned from the CPS path query:
...
language | sql |
---|
...
5 | 3 | /bookstore/categories[@code='2']/books[@title='The Colour of Magic'] | |
8 | NULL | 4 | /dmi-registry |
9 | 8 | 4 | /dmi-registry/cm-handles[@id='d99ef4cc8d744a749490cceefd4c469e'] |
Here is a visualization of both algorithms, executing a query for /bookstore/categories[@code='1']/books[@title='Matilda']:
Existing solution | Proposed solution |
---|---|
Because the existing solution uses a regex, all nodes in anchor 3 need to be examined to see if they match the regex. | Because the proposed solution uses sub-queries to look up each path component, only relevant nodes are examined. |
|
|
If the number of nodes is doubled, the number of xpaths checked against the regex is also doubled (linear complexity). | If the number of nodes is doubled, the same number of nodes is still scanned in this case (constant complexity). |
Proof of Concept
A proof of concept (poc) implementation was developed so that performance could be compared against existing Cps Path query algorithms.
...
Fetch descendants | Omit Descendants | Direct Descendants | All Descendants |
---|---|---|---|
Query | //openroadm-device[@ne-state="inservice"] | //openroadm-device[@ne-state="inservice"] | //openroadm-device[@ne-state="inservice"] |
Comparison graph | |||
Graph detail | |||
Time complexity of | Unclear, as there is very high variance. Likely Probably O(N). | O(N2) | O(N2) |
Time complexity of proposed solution | O(N) | O(N) | O(N) |
...
Fetch descendants | Omit Descendants | Direct Descendants | All Descendants |
---|---|---|---|
Query | /openroadm-devices/openroadm-device[@status="success"] | /openroadm-devices/openroadm-device[@status="success"] | /openroadm-devices/openroadm-device[@status="success"] |
Comparison graph | |||
Graph detail | |||
Time complexity of | Unclear, as there is very high variance. Likely O(N). | O(N2) | O(N2) |
Time complexity of proposed solution | O(N) | O(N) | O(N) |
Why does the existing solution have worst case quadratic complexity?
The reason the existing solution is quadratic time when fetching descendants is because the path query translates into a number of SQL queries:
- An SQL query that returns nodes matching query criteria - O(N)
- For each node returned from query 1, another SQL query is issued to fetch that node with descendants, using a LIKE operator - O(N).
The LIKE operator has linear time complexity, as all nodes' xpaths in a given anchor need to be examined.
...
"success"] | |||
Comparison graph | |||
Graph detail | |||
Time complexity of | Unclear, as there is very high variance. Likely O(N). | O(N2) | O(N2) |
Time complexity of proposed solution | O(N) | O(N) | O(N) |
Possible improvements of proposed solution
Cps Path Query capabilities can be
...
extended
Presently, Cps Path queries limit leaf-condition, text-condition, etc. to the final path component. The proposed solution allows for future improvement where conditions could be applied to any or all path components.
...
The main algorithm was mostly done during the PoC (all integration tests are passing for the PoC). The existing PoC code can thus be refactored to make it production ready.
It may be possible to incorporate the recursive SQL descendant search into the existing solution as a starting point.
DB upgrade
Because a new column is being added to the Fragment table, this column needs to be populated. An SQL script will be needed to provide a value for of the new xpath_component field based on existing xpath field.
...