Background

Repository queries taking collection parameters are limited to 32K collection items.
For example, given a FragmentRepository method:

List<FragmentEntity> findAllByAnchorIdAndXpathIn(int anchorId, Collection<String> xpaths);

This would throw an exception if supplied with a collection of 32,767 xpaths.

The reason is that JPA/Hibernate queries would translate the command to the following SQL:

SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)

Each collection item (xpath) is supplied as an SQL query parameter. Hibernate with Postgres has a limit of 32,767 query parameters per SQL statement, so an exception will be thrown with too many.

CPS-1574 is to remove those limits (and write tests to verify).

Tests

As part of CPS-1574, tests have been added to confirm the presence of the limit:

    def 'Delete datanodes from multiple anchors limit exceeded: 32,766 (~ 2^15) anchors.'() {
        given: 'more than 32,766 anchor names'
            def anchorNames = (0..32_766).collect { "size-of-this-name-does-not-matter-for-limit-" + it }
        when: 'single operation is executed to delete all datanodes in given anchors'
            objectUnderTest.deleteDataNodes(CPS_PERFORMANCE_TEST_DATASPACE, anchorNames, OffsetDateTime.now())
        then: 'a database exception is thrown'
            thrown(DataAccessResourceFailureException.class)
    }

Once the solution is implemented, these tests will be updated to show that no exception is thrown, with even more items e.g. 40,000.

Proposal #1: Batching at 32,000

This solution adds a wrapper method that partitions the input collection into lists of 32,000 and calls the original repository method on each batch:

@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath IN :xpaths", nativeQuery = true)
List<FragmentEntity> findAllByAnchorIdAndXpathIn(int anchorId, Collection<String> xpaths);

default List<FragmentEntity> findAllByAnchorIdAndXpathInBatch(final int anchorId, final List<String> xpaths) {
    List<FragmentEntity> results = new ArrayList<>(xpaths.size());
    for (final Collection<String> xpathsBatch : Lists.partition(xpaths, 32000)) {
        results.addAll(findAllByAnchorIdAndXpathIn(anchorId, xpathsBatch));
    }
    return results;
}

This solution is largely self-documenting, as the 32K limit is clearly seen in the code.

This solution still involves having a separate query parameter for each collection item - we are avoiding the limit by simply splitting into multiple SQL queries:

SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)
SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)
SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?)

Observations

Disadvantage: public API signatures may need to be changed

If a public method takes in a Collection (and not a List), then either an ArrayList copy of the Collection is needed, or the public API method needs to be changes to List, as Lists.partition requires a List to partition.

Batching on repository methods versus API methods

This solution could also work equally well batching at the public API methods instead of each repository method (probably).

Proposal #2: Postgres arrays

This solution uses a third-party library to add Postgres array support to queries.

Adding the dependency to cps-ri/pom.xml:

<dependency>
    <groupId>io.hypersistence</groupId>
    <artifactId>hypersistence-utils-hibernate-52</artifactId>
</dependency>

This solution involves creating two versions of each repository method taking a collection parameter - one taking a Java array, and one using a Collection:

@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath = ANY(:xpaths)", nativeQuery = true)
List<FragmentEntity> findAllByAnchorIdAndXpathIn(@Param("anchorId") int anchorId, @Param("xpaths") String[] xpaths);

default List<FragmentEntity> findAllByAnchorIdAndXpathIn(final int anchorId, final Collection<String> xpaths) {
    return findAllByAnchorIdAndXpathIn(anchorId, xpaths.toArray(new String[0]));
}

This would result in a single SQL query, given any number of xpaths:

SELECT * FROM fragment WHERE anchor_id = ? AND xpath = ANY(?)

As seen, only two SQL query parameters will be ever used. This is because the array is binary-encoded and sent as a single parameter.

It is unclear if this has better performance in practice (limited testing shows both solutions have similar performance, but the test scenarios may be unrealistic).

Observations

Disadvantage of arrays: Limited array type support

Postgres array support is limited to primitive data types and String, that is int[], long[], float[], double[], boolean[], String[]

This means given a method such as:

void deleteByAnchorIn(Collection<AnchorEntity> anchorEntities);

Changing to an array of AnchorEntity will not work. The following will not compile:

void deleteByAnchorIn(AnchorEntity[] anchorEntities);

Rather, the method would have to be changed to use anchor ID's:

void deleteByAnchorIn(int[] anchorIds);

Advantage of arrays: Postgres arrays can simplify existing code

The current FragmentRepository has code already using Postgres arrays to support "LIKE ANY" logic (using LIKE with a collection). However, it uses query parameters, and is thus limited to 32K parameters.

    @Override
    public void deleteByAnchorIdAndXpathLikeAny(final int anchorId, final Collection<String> xpathPatterns) {
        final String queryString = "DELETE FROM fragment f WHERE f.anchor_id = ? AND xpath LIKE ANY (array[:parameterPlaceholders])";
        executeUpdateWithAnchorIdAndCollection(queryString, anchorId, xpathPatterns);
    }

    // Accept security hotspot as placeholders in SQL query are created internally, not from user input.
    @SuppressWarnings("squid:S2077")
    private void executeUpdateWithAnchorIdAndCollection(final String sqlTemplate, final int anchorId, final Collection<String> collection) {
        if (!collection.isEmpty()) {
            final String parameterPlaceholders = String.join(",", Collections.nCopies(collection.size(), "?"));
            final String queryStringWithParameterPlaceholders =
                sqlTemplate.replaceFirst(":parameterPlaceholders\\b", parameterPlaceholders);

            final Query query = entityManager.createNativeQuery(queryStringWithParameterPlaceholders);
            query.setParameter(1, anchorId);
            int parameterIndex = 2;
            for (final String parameterValue : collection) {
                query.setParameter(parameterIndex++, parameterValue);
            }
            query.executeUpdate();
        }
    }

Using native Postgres array support, this code can be written directly in the FragmentRepository interface using Query annotation, without the 32K limitation:

    @Modifying
    @Query(value = "DELETE FROM fragment f WHERE anchor_id = :anchorId AND xpath LIKE ANY (:xpathPatterns)", nativeQuery = true)
    void deleteByAnchorIdAndXpathLikeAny(@Param("anchorId") int anchorId, @Param("xpathPatterns") String[] xpathPatterns);

    default void deleteByAnchorIdAndXpathLikeAny(int anchorId, Collection<String> xpathPatterns) {
        deleteByAnchorIdAndXpathLikeAny(anchorId, xpathPatterns.toArray(new String[0]));
    }

Comparison

Limited testing shows both solutions have similar performance (as determined using Groovy performance tests modified to take 100,000 items)
Solution #1 is explicit using 32K batches, so easy to understand what it does
Solution #2 has shorter code, but not very obvious to the reader what it does
Solution #2 simplifies code in the cases when we need Postgres array support.
It's possible to use both solutions, where most appropriate.

Space shortcuts

Page tree

Background

Tests

Proposal #1: Batching at 32,000

Observations

Disadvantage: public API signatures may need to be changed

Batching on repository methods versus API methods

Proposal #2: Postgres arrays

Observations

Disadvantage of arrays: Limited array type support

Advantage of arrays: Postgres arrays can simplify existing code

Comparison

Space shortcuts

Page tree

CPS-1574: Remove 32K limit from DB operations

Background

Tests

Proposal #1: Batching at 32,000

Observations

Disadvantage: public API signatures may need to be changed

Batching on repository methods versus API methods

Proposal #2: Postgres arrays

Observations

Disadvantage of arrays: Limited array type support

Advantage of arrays: Postgres arrays can simplify existing code

Comparison