Code Block

language	sql

SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)
SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)
SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?)

Observations

Disadvantage: public API signatures may need to be changed

Performance testing is inconclusive (so far)

Testing with a batch size of 1000 (not 32000) showed a performance drop of 30% at 20000 items. More testing is needed to see if this applies with batch size of 32000, or if the test involved bad luck.

Advantage: Solution is easy to understand

This code is largely self-documenting, as the 32K limit is clearly seen in the code.If a public method takes in a Collection (and not a List), then either an ArrayList copy of the Collection is needed, or the public API method needs to be changes to List, as Lists.partition requires a List to partition.

Batching on repository methods versus API methods

This solution could also should work equally well if batching at the public API methods instead of each repository method (probably).

Proposal #2: Postgres arrays

...

As seen, only two SQL query parameters will be ever used. This is because the array is binary-encoded and sent as a single parameter.

It is unclear if this has better performance in practice (limited testing shows both solutions have similar performance, but the test scenarios may be unrealistic).

Observations

...

Testing so far has shown no noticeable performance difference with master branch.

Observations

Performance is unchanged

Testing shows there was no increase or decrease in performance compared to master branch after arrays were used.

Disadvantage: extra dependency

~~Extra dependencies could add possible risks of dependency issues, security issues, etc. in future.~~

The new dependency is an updated version of an existing dependency (hibernate-types-52) - the developer renamed the library at some point. More info here: https://vladmihalcea.com/hibernate-types-hypersistence-utils/

Disadvantage: Limited array type support

Postgres array support is limited to primitive data types and String, that is int[], long[], float[], double[], boolean[], String[]

This means For example, given a FragmentRepository method such as:

Code Block

language	java

void deleteByAnchorIn(Collection<AnchorEntity> anchorEntities);

Changing to an array of AnchorEntity will not work. The following will not compilethrow an exception:

Code Block

language	java

void deleteByAnchorIn(AnchorEntity[] anchorEntities);

Rather, the method would have to be changed to use anchor ID'sIDs:

Code Block

language	java

void deleteByAnchorIndeleteByAnchorIdIn(int[] anchorIds);

...

Side note: in the above case, if using auto-generated SQL from the method name, the SQL generated using the anchor ID will be simpler than when using AnchorEntity objects, as Hibernate will generate a JOIN between FragmentEntity and AnchorEntity in the first case, but will only use FragmentEntity in the last case. So while the types are more restrictive, it may result in more optimal SQL code.

Advantage: Code already using Postgres arrays can be simplified

The current FragmentRepository has code already using Postgres arrays to support "LIKE ANY" logic (using LIKE with a collection). However, it currently uses query parameters, and is thus limited to 32K parameters.

...

Code Block

language	java

    @Modifying
    @Query(value = "DELETE FROM fragment f WHERE anchor_id = :anchorId AND xpath LIKE ANY (:xpathPatterns)", nativeQuery = true)
    void deleteByAnchorIdAndXpathLikeAny(@Param("anchorId") int anchorId, @Param("xpathPatterns") String[] xpathPatterns);

    default void deleteByAnchorIdAndXpathLikeAny(int anchorId, Collection<String> xpathPatterns) {
        deleteByAnchorIdAndXpathLikeAny(anchorId, xpathPatterns.toArray(new String[0]));
    }

Comparison

Limited testing shows both solutions have similar performance (as determined using Groovy performance tests modified to take 100,000 items)More testing is needed to confirm if list partitioning drops performance with large datasets.
Solution #1 is explicit using 32K batches, so easy to understand what it does
Solution #2 has shorter code, but not very obvious to the reader what it does
- We could add a comment linking to this ticket to explain the 32K limit and why arrays are used
Solution #2 simplifies code in the cases when we need Postgres array support.
It's possible to use both solutions, where most appropriate.

Space shortcuts

Page tree

Versions Compared

Old Version 8

New Version Current

Key

Observations

Disadvantage: public API signatures may need to be changed

Performance testing is inconclusive (so far)

Advantage: Solution is easy to understand

Batching on repository methods versus API methods

Proposal #2: Postgres arrays

Observations

Observations

Performance is unchanged

Disadvantage: extra dependency

Disadvantage: Limited array type support

Advantage: Code already using Postgres arrays can be simplified

Comparison

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 8

New Version Current

Key

Observations

Disadvantage: public API signatures may need to be changed

Performance testing is inconclusive (so far)

Advantage: Solution is easy to understand

Batching on repository methods versus API methods

Proposal #2: Postgres arrays

Observations

Observations

Performance is unchanged

Disadvantage: extra dependency

Disadvantage: Limited array type support

Advantage: Code already using Postgres arrays can be simplified

Comparison