Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function#514
Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function#514shyla226 wants to merge 2 commits intodatastax:mainfrom
Conversation
a991b33 to
dd1c332
Compare
|
This PR does successfully remove the calls to reduceLanes from the inside of the loop iterating over the subspaces the Euclidean case. I do have a concern about applying this pattern to one specific case but leaving the handling of other cases as is, leaving the code with 2 different implementations for the same problem. Some questions:
|
|
@MarkWolters, thank you very much for the feedback. I can make the changes to dotProduct methods and to function calls in scoreFunctionFor() |
|
@shyla226 please be aware that in order to pass the automated GitHub action regression test any pull request must come from a branch inside the datastax/jvector repository and not a fork of the repository. So while I am willing to review the PR from a fork and provide commentary, in order for it to pass the requirements for merging it will have to be a branch and not a fork. |
|
@MarkWolters, Sounds good! I will create a branch |
|
@shyla226 I don't think you'll have sufficient permissions to push a new branch to the repository. For now you can keep the changes on a fork and they can be reviewed there while we come up with a solution to this issue. |
|
@MarkWolters , I have added new functions to include dotproduct & cosine for diversityFunction. |
|
Thanks @shyla226, I will review the updated code asap |
|
Thank you very much @MarkWolters . With the changes, when M=64, dot product shows ~30% reduction in latency & COSINE shows ~43% reduction |
|
@MarkWolters sorry, accidentally closed it! I will update the test setup used in the description above. |
|
@MarkWolters, I have addressed the issue with the 2 failed checks and tested the code changes on Windows server |
|
@shyla226 apologies for not addressing this sooner. I have created a branch named |
|
@MarkWolters Thank you very much. I will rebase the changes. |
c2a8068 to
8e85edb
Compare
| return pqScoreDotProduct_64(codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount); | ||
| } | ||
|
|
||
| float pqScoreCosine_512(VectorFloat<?>[] codebooks, int[][] subvectorSizesAndOffsets, ByteSequence<?> node1Chunk, int node1Offset, ByteSequence<?> node2Chunk, int node2Offset, int subspaceCount) { |
There was a problem hiding this comment.
This name is a mis-leading, probably better to rename this to pqScoreCosine_preferred?
| int length1 = centroidIndex1 * centroidLength; | ||
| int length2 = centroidIndex2 * centroidLength; | ||
|
|
||
| if (centroidLength == FloatVector.SPECIES_PREFERRED.length()) { |
There was a problem hiding this comment.
Why do we need this special case? It introduces a lot of duplicated code across all the functions. Wouldn’t the loop below already handle this scenario correctly?
| else if (subvectorSizesAndOffsets[0][0] >= FloatVector.SPECIES_128.length()) { | ||
| return pqScoreEuclidean_128( codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount); | ||
| } | ||
| return pqScoreEuclidean_64( codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount); |
There was a problem hiding this comment.
We currently have 4 separate implementations (pqScoreEuclidean_64/_128/_256/_512) that are structurally identical aside from the FloatVector.SPECIES_* used. This is a lot of duplicated hot-path code and increases the risk of fixes/optimizations diverging across variants.
Could we fold these into a single generic-ish implementation that takes the species as a parameter, e.g.:
pqScoreEuclideanImpl(VectorSpecies<Float> species, ...)pqScoreEuclidean(...)selectsspecies(SPECIES_PREFERRED,SPECIES_256,SPECIES_128,SPECIES_64) once and dispatches to the shared impl.
This keeps the vectorized logic in one place and avoids the confusing _512 naming given it actually uses SPECIES_PREFERRED on some paths. It should also make it much easier to keep Euclidean/Dot/Cosine implementations consistent.
|
Probably would have been better to comment in #623, apologies for the duplicate review. Ignore the comments here. |
This pull request introduces improvement to Euclidean similarity function in
PQVectors.diversityFunctionFor. From flamegraph, it is observed that considerable amount of time is spent injdk/incubator/vector/FloatVector.reducelanesTemplate. This is mainly becauseFloatVector.reducelanes()is expensive and it is being called inside aforloop (viaVectorUtil.squareL2Distance). Modification in this pull request moves call toreduceLanes()outside theforloop.Change proposed here was tested with the benchmark,
PQDistanceCalculationBenchmark.diversityCalculationWith this benchmark, ~18% reduction in time was observed when M=64 and ~22% when M=192.
Code modifications:
pqDiversityEuclideaninVectorUtilSupportand its corresponding implementationsforloop inPQVectors.diversityFunctionForand moved it intopqDiversityEuclideanFloatVector.reducelanes()outside theforloopTest setup:
Jvector version : main branch (as of 2025-08-28)
JDK version : openjdk version "24.0.2" 2025-07-15
Platform : INTEL(R) XEON(R) PLATINUM 8592+
Benchmark : PQDistanceCalculationBenchmark.diversityCalculation
New changes
I have modified the code to include dot product & cosine functions and implemented similar changes for scoreFunctionFor.
With the changes applied to scoreFunctionFor, when M=64: dot product shows ~30% reduction in latency & cosine shows ~43% reduction.
I can add data points for other subspace counts, if required.
Code modifications:
pqScoreEuclidean, pqScoreDotProduct, pqScoreCosineinVectorUtilSupportand its corresponding implementations fordiversityFunctionForscoreFunctionForforloop inPQVectors.diversityFunctionForandPQVectors.scoreFunctionForand moved them into respective functions inPanamaVectorUtilSupportFloatVector.reducelanes()outside theforloopMutablePQVectorsto test thisTest setup:
Jvector version : main branch (as of 2025-10-22)
JDK version : openjdk version "25.0.1" 2025-10-21
Platform : Intel(R) Xeon(R) 6979P
Benchmark :
PQDistanceCalculationMutableVectorBenchmark(Added this by replicatingPQDistanceCalculationBenchmarkto measure performance for MutablePQVectors)