-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNDB-12980: Clone the vectors issued by the postingsMap in CompactionGraph #1586
Conversation
…he same object gets reused by the underlying implementation.
Checklist before you submit for review
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fix looks correct to me. Test failures posted by Butler are acceptable; that's a known flaky test in an unrelated area of the codebase. The other failures flagged by Butler as known failures are indeed known failures.
I think we should add some test coverage that exposes the bug, as we had no tests that surfaced the issue. I did a quick sketch of enhancing the existing PQ refinement test to test recall, confirming that it failed without the patch. This might be sufficient. WDYT?
diff --git a/test/unit/org/apache/cassandra/index/sai/cql/VectorCompactionTest.java b/test/unit/org/apache/cassandra/index/sai/cql/VectorCompactionTest.java
index f6634b86ee..9047bb8b1b 100644
--- a/test/unit/org/apache/cassandra/index/sai/cql/VectorCompactionTest.java
+++ b/test/unit/org/apache/cassandra/index/sai/cql/VectorCompactionTest.java
@@ -19,9 +19,13 @@
package org.apache.cassandra.index.sai.cql;
import java.util.ArrayList;
+import java.util.stream.Collectors;
import org.junit.Test;
+import io.github.jbellis.jvector.vector.VectorSimilarityFunction;
+import org.apache.cassandra.db.marshal.FloatType;
+import org.apache.cassandra.index.sai.disk.v3.V3OnDiskFormat;
import org.apache.cassandra.index.sai.disk.v5.V5VectorPostingsWriter;
import org.apache.cassandra.index.sai.disk.vector.CompactionGraph;
@@ -59,13 +63,16 @@ public class VectorCompactionTest extends VectorTester.Versioned
createTable();
disableCompaction();
+ var vectors = new ArrayList<float[]>();
// 3 sstables
for (int j = 0; j < 3; j++)
{
for (int i = 0; i <= MIN_PQ_ROWS; i++)
{
var pk = j * MIN_PQ_ROWS + i;
- execute("INSERT INTO %s (pk, v) VALUES (?, ?)", pk, vector(pk, pk + 1));
+ var v = create2DVector();
+ vectors.add(v);
+ execute("INSERT INTO %s (pk, v) VALUES (?, ?)", pk, vector(v));
}
flush();
}
@@ -73,8 +80,21 @@ public class VectorCompactionTest extends VectorTester.Versioned
CompactionGraph.PQ_TRAINING_SIZE = 2 * MIN_PQ_ROWS;
compact();
- // Confirm we can query the data
- assertRowCount(execute("SELECT * FROM %s ORDER BY v ANN OF [1,2] LIMIT 1"), 1);
+ // Confirm we can query the data with reasonable recall
+ double recall = 0;
+ int ITERS = 10;
+ for (int i = 0; i < ITERS; i++)
+ {
+ var q = create2DVector();
+ var result = execute("SELECT pk, v FROM %s ORDER BY v ANN OF ? LIMIT 20", vector(q));
+ var ann = result.stream().map(row -> {
+ var vList = row.getVector("v", FloatType.instance, 2);
+ return new float[]{ vList.get(0), vList.get(1) };
+ }).collect(Collectors.toList());
+ recall += computeRecall(vectors, q, ann, VectorSimilarityFunction.COSINE);
+ }
+ recall /= ITERS;
+ assert recall >= 0.9 : recall;
}
@Test
@@ -179,4 +199,9 @@ public class VectorCompactionTest extends VectorTester.Versioned
}
}
}
+
+ private static float[] create2DVector() {
+ var R = getRandom();
+ return new float[] { R.nextFloatBetween(-100, 100), R.nextFloatBetween(-100, 100) };
+ }
}
The test looks right to me. It may help discover other issues in the future. |
Thanks! Pushed the test to this branch. |
|
❌ Build ds-cassandra-pr-gate/PR-1586 rejected by Butler1 new test failure(s) in 2 builds Found 1 new test failures
Found 7 known test failures |
Addresses CNDB-12980
The postingsMap in CompactionGraph reuses the VectorFloat object when reading from storage. We need to copy this vector before storing it in another data structure if we want it to hold the right information.