Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BWC Rolling upgrade tests fails for SparseEncoder Processor during Batch Ingestion #1142

Closed
vibrantvarun opened this issue Jan 24, 2025 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@vibrantvarun
Copy link
Member

What is the bug?

BatchIngestionIT.testBatchIngestion_SparseEncodingProcessor_E2EFlow test is failing with the following error.

> Task :qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster
REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.rolling.BatchIngestionIT.testBatchIngestion_SparseEncodingProcessor_E2EFlow" -Dtests.seed=801B4B74838557A -Dtests.security.manager=false -Dtests.bwc.version=2.19.0-SNAPSHOT -Dtests.locale=sw-TZ -Dtests.timezone=America/Bahia_Banderas -Druntime.java=21
Suite: Test class org.opensearch.neuralsearch.bwc.rolling.BatchIngestionIT
  2> Jan 24, 2025 6:12:44 PM org.apache.lucene.internal.vectorization.VectorizationProvider lookup
  2> WARNING: Java vector incubator module is not readable. For optimal vector performance, pass '--add-modules jdk.incubator.vector' to enable Vector API.
  2> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  2> SLF4J: Defaulting to no-operation (NOP) logger implementation
  2> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  2> REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.rolling.BatchIngestionIT.testBatchIngestion_SparseEncodingProcessor_E2EFlow" -Dtests.seed=801B4B74838557A -Dtests.security.manager=false -Dtests.bwc.version=2.19.0-SNAPSHOT -Dtests.locale=sw-TZ -Dtests.timezone=America/Bahia_Banderas -Druntime.java=21
  2> java.lang.AssertionError: expected:<10> but was:<8>
        at __randomizedtesting.SeedInfo.seed([801B4B74838557A:8487AE9ACE1A1513]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.BaseNeuralSearchIT.validateDocCountAndInfo(BaseNeuralSearchIT.java:1535)
        at org.opensearch.neuralsearch.bwc.rolling.BatchIngestionIT.testBatchIngestion_SparseEncodingProcessor_E2EFlow(BatchIngestionIT.java:46)
  2> NOTE: leaving temporary files on disk at: /home/runner/work/neural-search/neural-search/qa/rolling-upgrade/build/testrun/testAgainstOneThirdUpgradedCluster/temp/org.opensearch.neuralsearch.bwc.rolling.BatchIngestionIT_801B4B74838557A-001
  2> NOTE: test params are: codec=Asserting(Lucene912): {}, docValues:{}, maxPointsInLeafNode=22, maxMBSortInHeap=5.7175746038185356, sim=Asserting(RandomSimilarity(queryNorm=true): {}), locale=sw-TZ, timezone=America/Bahia_Banderas
  2> NOTE: Linux 6.8.0-1020-azure amd64/Azul Systems, Inc. 21.0.6 (64-bit)/cpus=4,threads=3,free=453753024,total=536870912
  2> NOTE: All tests run in this JVM: [BatchIngestionIT]
BatchIngestionIT > testBatchIngestion_SparseEncodingProcessor_E2EFlow FAILED
    java.lang.AssertionError: expected:<10> but was:<8>
        at __randomizedtesting.SeedInfo.seed([801B4B74838557A:8487AE9ACE1A1513]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.BaseNeuralSearchIT.validateDocCountAndInfo(BaseNeuralSearchIT.java:1535)
        at org.opensearch.neuralsearch.bwc.rolling.BatchIngestionIT.testBatchIngestion_SparseEncodingProcessor_E2EFlow(BatchIngestionIT.java:46)
  1> [2025-01-24T12:12:44,811][INFO ][o.o.n.b.r.BatchIngestionIT] [testBatchIngestion_SparseEncodingProcessor_E2EFlow] before test
  1> [2025-01-24T12:12:45,122][INFO ][o.o.n.b.r.BatchIngestionIT] [testBatchIngestion_SparseEncodingProcessor_E2EFlow] initializing REST clients against [http://[::1]:[410](https://github.com/opensearch-project/neural-search/actions/runs/12942873461/job/36138496383?pr=1140#step:4:411)85, http://127.0.0.1:40513, http://[::1]:35229, http://127.0.0.1:41063, http://[::1]:41591, http://127.0.0.1:34731]
  1> [2025-01-24T12:12:49,676][INFO ][o.o.n.b.r.BatchIngestionIT] [testBatchIngestion_SparseEncodingProcessor_E2EFlow] There are still tasks running after this test that might break subsequent tests [cluster:admin/opensearch/ml/undeploy_model, cluster:admin/opensearch/mlinternal/syncup, indices:admin/seq_no/global_checkpoint_sync, indices:admin/seq_no/global_checkpoint_sync[p], indices:data/write/bulk, indices:data/write/bulk[s]].
  1> [2025-01-24T12:12:49,722][INFO ][o.o.n.b.r.BatchIngestionIT] [testBatchIngestion_SparseEncodingProcessor_E2EFlow] after test

How can one reproduce the bug?

Run the test locally or raise a PR on neural to see it in github CI check.

What is the expected behavior?

Test should pass successfully.

Do you have any additional context?

https://github.com/opensearch-project/neural-search/actions/runs/12942873461/job/36138496383?pr=1140

@heemin32
Copy link
Collaborator

Is it flaky test?

@zhichao-aws
Copy link
Member

zhichao-aws commented Jan 26, 2025

It's a serialization/deserialization issue from OpenSearch core, all query builders fail to serialization/deserialization between nodes.

To reproduce:

  1. set up a two node cluster with 1 3.0.0 node + 1 2.19 snapshot node.
  2. create index
PUT test
{
    "settings": {
      "number_of_shards": 2,
      "number_of_replicas": 0
    },
    "mappings": {
      "properties": {
        "passage_embedding": {
          "type": "rank_features"
        },
        "passage_text": {
          "type": "text"
        }
      }
    }
}
  1. ingest doc
POST _bulk
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
{"index":{"_index":"test"}}
{"passage_embedding":{"hello":1.1,"world":1.2}, "passage_text": "hello world"}
  1. do search
GET test/_search

here we can use any query(if leave it empty, it's match_all), and send request to any nodes. we get shard failure of serialization/deserialization error in response:

{'took': 63,
 'timed_out': False,
 '_shards': {'total': 2,
  'successful': 1,
  'skipped': 0,
  'failed': 1,
  'failures': [{'shard': 1,
    'index': 'test2',
    'node': 'AEmZn6uASnGjAU0SCF-EXg',
    'reason': {'type': 'illegal_state_exception',
     'reason': 'unexpected byte [0x3f]'}}]},
 'hits': {'total': {'value': 3, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [ ... ]}}

@vibrantvarun
Copy link
Member Author

Hey @zhichao-aws It can be due the recent breaking changes in 3.0. Can you try the same with 2.18, 2.19?

@vibrantvarun
Copy link
Member Author

vibrantvarun commented Jan 27, 2025

@martin-gaievski You did some deep-dive on the breaking changes coming from 3.0, can you share some insights here as we do not see this issue in bwc tests running on 2.x.

@martin-gaievski
Copy link
Member

I checked the lucene 10 changes, drafted PR to address them #1141.

If that would be related to lucene 10 then code wouldn't even compile, there are some incompatible API changes. At the first glance that is related to the internal logic of the SparseEncoder, or maybe some resource that we setup in this particular test

@zhichao-aws
Copy link
Member

Hey @zhichao-aws It can be due the recent breaking changes in 3.0. Can you try the same with 2.18, 2.19?

Do you mean the bwc test between 2.18, 2.19 or bwc test between main and 2.18?

@zhichao-aws
Copy link
Member

I checked the lucene 10 changes, drafted PR to address them #1141.

If that would be related to lucene 10 then code wouldn't even compile, there are some incompatible API changes. At the first glance that is related to the internal logic of the SparseEncoder, or maybe some resource that we setup in this particular test

It's not due to the lucene 10 changes. It's caused by serialization/deserialization error from core opensearch-project/OpenSearch#17125

@zhichao-aws
Copy link
Member

Based on the comment, the issue should be fixed after we bump to 3.0.0-alpha1

@martin-gaievski
Copy link
Member

I've drafted PR for 3.0-alpha #1141, we can monitor progress over there, I see BWC are failing for now os it doesn't look like simple adoption of latest core takes care. maybe some additional steps are needed on neural search side

@zhichao-aws
Copy link
Member

I've drafted PR for 3.0-alpha #1141, we can monitor progress over there, I see BWC are failing for now os it doesn't look like simple adoption of latest core takes care. maybe some additional steps are needed on neural search side

Based on the error log, it seems to fail due to another reason (ml model). While the previous reason is related to query serialization/deserialization between nodes.

REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOldCluster' --tests "org.opensearch.neuralsearch.bwc.rolling.BatchIngestionIT.testBatchIngestion_SparseEncodingProcessor_E2EFlow" -Dtests.seed=34D952AC3611A812 -Dtests.security.manager=false -Dtests.bwc.version=2.20.0-SNAPSHOT -Dtests.locale=bg-Cyrl-BG -Dtests.timezone=Pacific/Gambier -Druntime.java=23

BatchIngestionIT > testBatchIngestion_SparseEncodingProcessor_E2EFlow FAILED
    java.lang.RuntimeException: Model t0NmApUBfHInJ7oBTwbr failed to load after 30 attempts
        at __randomizedtesting.SeedInfo.seed([34D952AC3611A812:B85F4881B033E87B]:0)
        at org.opensearch.neuralsearch.bwc.rolling.BatchIngestionIT.waitForModelToLoad(BatchIngestionIT.java:108)
        at org.opensearch.neuralsearch.bwc.rolling.BatchIngestionIT.testBatchIngestion_SparseEncodingProcessor_E2EFlow(BatchIngestionIT.java:36)

@zhichao-aws
Copy link
Member

The bug is fixed after switch to 3.0.0-alpha1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants