Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function by shyla226 · Pull Request #514 · datastax/jvector

shyla226 · 2025-08-29T00:02:37Z

This pull request introduces improvement to Euclidean similarity function in PQVectors.diversityFunctionFor. From flamegraph, it is observed that considerable amount of time is spent in jdk/incubator/vector/FloatVector.reducelanesTemplate. This is mainly because FloatVector.reducelanes() is expensive and it is being called inside a for loop (via VectorUtil.squareL2Distance). Modification in this pull request moves call to reduceLanes() outside the for loop.

Change proposed here was tested with the benchmark, PQDistanceCalculationBenchmark.diversityCalculation
With this benchmark, ~18% reduction in time was observed when M=64 and ~22% when M=192.

Code modifications:

Added a new function pqDiversityEuclidean in VectorUtilSupport and its corresponding implementations
Removed for loop in PQVectors.diversityFunctionFor and moved it into pqDiversityEuclidean
Moved FloatVector.reducelanes() outside the for loop

Test setup:
Jvector version : main branch (as of 2025-08-28)
JDK version : openjdk version "24.0.2" 2025-07-15
Platform : INTEL(R) XEON(R) PLATINUM 8592+
Benchmark : PQDistanceCalculationBenchmark.diversityCalculation

New changes
I have modified the code to include dot product & cosine functions and implemented similar changes for scoreFunctionFor.
With the changes applied to scoreFunctionFor, when M=64: dot product shows ~30% reduction in latency & cosine shows ~43% reduction.
I can add data points for other subspace counts, if required.

Code modifications:

Added new functions pqScoreEuclidean, pqScoreDotProduct, pqScoreCosine in VectorUtilSupport and its corresponding implementations for diversityFunctionFor
Added overloaded version of above functions for scoreFunctionFor
Removed for loop in PQVectors.diversityFunctionFor and PQVectors.scoreFunctionFor and moved them into respective functions in PanamaVectorUtilSupport
Moved FloatVector.reducelanes() outside the for loop
Added a new benchmark which uses MutablePQVectors to test this

Test setup:
Jvector version : main branch (as of 2025-10-22)
JDK version : openjdk version "25.0.1" 2025-10-21
Platform : Intel(R) Xeon(R) 6979P
Benchmark : PQDistanceCalculationMutableVectorBenchmark (Added this by replicating PQDistanceCalculationBenchmark to measure performance for MutablePQVectors)

MarkWolters · 2025-10-03T16:49:12Z

This PR does successfully remove the calls to reduceLanes from the inside of the loop iterating over the subspaces the Euclidean case. I do have a concern about applying this pattern to one specific case but leaving the handling of other cases as is, leaving the code with 2 different implementations for the same problem. Some questions:

Why is the optimization only implemented for the Euclidean case? In PanamaVectorUtilSupport, where the call to reduceLanes is made from the squareDistance methods which are called in a loop for the subspace count, likewise reduceLanes is called from the dotProduct methods. Why only fix one?
Why is the optimization only applied to diversityFunctionFor(…) and not scoreFunctionFor(…) which uses the same pattern of calling squareL2Distance(…) inside of a loop for each subspace?

shyla226 · 2025-10-06T23:12:31Z

@MarkWolters, thank you very much for the feedback. I can make the changes to dotProduct methods and to function calls in scoreFunctionFor()

MarkWolters · 2025-10-10T18:31:32Z

@shyla226 please be aware that in order to pass the automated GitHub action regression test any pull request must come from a branch inside the datastax/jvector repository and not a fork of the repository. So while I am willing to review the PR from a fork and provide commentary, in order for it to pass the requirements for merging it will have to be a branch and not a fork.

shyla226 · 2025-10-10T19:04:18Z

@MarkWolters, Sounds good! I will create a branch

MarkWolters · 2025-10-10T19:52:56Z

@shyla226 I don't think you'll have sufficient permissions to push a new branch to the repository. For now you can keep the changes on a fork and they can be reviewed there while we come up with a solution to this issue.

shyla226 · 2025-10-27T21:16:18Z

@MarkWolters , I have added new functions to include dotproduct & cosine for diversityFunction.
I have also added overloaded function for scoreFunction, for all the 3 cases

MarkWolters · 2025-10-28T12:23:59Z

Thanks @shyla226, I will review the updated code asap

shyla226 · 2025-10-28T22:53:49Z

Thank you very much @MarkWolters . With the changes, when M=64, dot product shows ~30% reduction in latency & COSINE shows ~43% reduction

shyla226 · 2025-10-28T23:02:34Z

@MarkWolters sorry, accidentally closed it! I will update the test setup used in the description above.

shyla226 · 2025-10-30T17:45:28Z

@MarkWolters, I have addressed the issue with the 2 failed checks and tested the code changes on Windows server

MarkWolters · 2026-02-02T15:24:10Z

@shyla226 apologies for not addressing this sooner. I have created a branch named reduce_lanes based off of the latest main. Please rebase your change on current main and open a pull request from your fork to the reduce_lanes branch. From the Github UI you should be able to select Contribute --> Open Pull Request and in the resulting screen select the branch you wish to merge to. From there we can run the internal github actions and regression tests and get your changes merged.

shyla226 · 2026-02-03T01:21:36Z

@MarkWolters Thank you very much. I will rebase the changes.

r-devulap

First‑pass review: there’s quite a bit of duplicated code, and I think we can consolidate it significantly.

r-devulap · 2026-03-05T12:14:02Z

jvector-twenty/src/main/java/io/github/jbellis/jvector/vector/PanamaVectorUtilSupport.java

+        return pqScoreDotProduct_64(codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount);
+    }
+
+    float pqScoreCosine_512(VectorFloat<?>[] codebooks, int[][] subvectorSizesAndOffsets, ByteSequence<?> node1Chunk, int node1Offset, ByteSequence<?> node2Chunk, int node2Offset, int subspaceCount) {


This name is a mis-leading, probably better to rename this to pqScoreCosine_preferred?

r-devulap · 2026-03-05T12:15:49Z

jvector-twenty/src/main/java/io/github/jbellis/jvector/vector/PanamaVectorUtilSupport.java

+            int length1 = centroidIndex1 * centroidLength;
+            int length2 = centroidIndex2 * centroidLength;
+
+            if (centroidLength == FloatVector.SPECIES_PREFERRED.length()) {


Why do we need this special case? It introduces a lot of duplicated code across all the functions. Wouldn’t the loop below already handle this scenario correctly?

r-devulap · 2026-03-05T12:23:57Z

jvector-twenty/src/main/java/io/github/jbellis/jvector/vector/PanamaVectorUtilSupport.java

+        else if (subvectorSizesAndOffsets[0][0]  >=  FloatVector.SPECIES_128.length()) {
+            return pqScoreEuclidean_128( codebooks,  subvectorSizesAndOffsets, node1Chunk, node1Offset,  node2Chunk, node2Offset, subspaceCount);
+        }
+        return pqScoreEuclidean_64( codebooks,  subvectorSizesAndOffsets, node1Chunk, node1Offset,  node2Chunk, node2Offset, subspaceCount);


We currently have 4 separate implementations (pqScoreEuclidean_64/_128/_256/_512) that are structurally identical aside from the FloatVector.SPECIES_* used. This is a lot of duplicated hot-path code and increases the risk of fixes/optimizations diverging across variants.

Could we fold these into a single generic-ish implementation that takes the species as a parameter, e.g.:

pqScoreEuclideanImpl(VectorSpecies<Float> species, ...)

pqScoreEuclidean(...) selects species (SPECIES_PREFERRED, SPECIES_256, SPECIES_128, SPECIES_64) once and dispatches to the shared impl.

This keeps the vectorized logic in one place and avoids the confusing _512 naming given it actually uses SPECIES_PREFERRED on some paths. It should also make it much easier to keep Euclidean/Dot/Cosine implementations consistent.

r-devulap · 2026-03-05T12:42:05Z

Probably would have been better to comment in #623, apologies for the duplicate review. Ignore the comments here.

shyla226 force-pushed the reduce_lanes branch from a991b33 to dd1c332 Compare September 2, 2025 17:24

tlwillke requested a review from MarkWolters September 30, 2025 17:50

shyla226 requested review from jshook, marianotepper and tlwillke as code owners October 10, 2025 17:05

shyla226 closed this Oct 28, 2025

shyla226 reopened this Oct 28, 2025

added space to force branch creation

8e85edb

shyla226 force-pushed the reduce_lanes branch from c2a8068 to 8e85edb Compare February 4, 2026 19:31

Rebased reduce_lanes changes with latest main

1c2a692

shyla226 mentioned this pull request Feb 4, 2026

Rebased reduce_lanes changes with latest main #611

Merged

r-devulap reviewed Mar 5, 2026

View reviewed changes

Conversation

shyla226 commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarkWolters commented Oct 3, 2025

Uh oh!

shyla226 commented Oct 6, 2025

Uh oh!

MarkWolters commented Oct 10, 2025

Uh oh!

shyla226 commented Oct 10, 2025

Uh oh!

MarkWolters commented Oct 10, 2025

Uh oh!

shyla226 commented Oct 27, 2025

Uh oh!

MarkWolters commented Oct 28, 2025

Uh oh!

shyla226 commented Oct 28, 2025

Uh oh!

shyla226 commented Oct 28, 2025

Uh oh!

shyla226 commented Oct 30, 2025

Uh oh!

MarkWolters commented Feb 2, 2026

Uh oh!

shyla226 commented Feb 3, 2026

Uh oh!

r-devulap left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r-devulap Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

r-devulap Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

r-devulap Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

r-devulap commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shyla226 commented Aug 29, 2025 •

edited

Loading

r-devulap left a comment •

edited

Loading