Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jaeger-v2 e22 tests are unstable #5418

Open
yurishkuro opened this issue May 5, 2024 · 1 comment
Open

jaeger-v2 e22 tests are unstable #5418

yurishkuro opened this issue May 5, 2024 · 1 comment
Labels

Comments

@yurishkuro
Copy link
Member

@akagami-harsh since merging #5398 I noticed this test failing sporadically. Here's one failed run: https://github.com/jaegertracing/jaeger/actions/runs/8955823698/job/24596965281?pr=5416

=== RUN   TestCassandraStorage/GetLargeSpans
    integration.go:205: Testing Large Trace over 10K ...
    integration.go:375: Dropped binary type attributes from trace ID: 0000000000000011
2024/05/05 05:05:57 traces export: context deadline exceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    integration.go:145: Waiting for storage backend to update documents, iteration 1 out of 100
    integration.go:145: Waiting for storage backend to update documents, iteration 2 out of 100
...
    integration.go:145: Waiting for storage backend to update documents, iteration 99 out of 100
    integration.go:145: Waiting for storage backend to update documents, iteration 100 out of 100
    integration.go:215: 
        	Error Trace:	/home/runner/work/jaeger/jaeger/plugin/storage/integration/integration.go:215
        	Error:      	Should be true
        	Test:       	TestCassandraStorage/GetLargeSpans
    trace_compare.go:74: 
        	Error Trace:	/home/runner/work/jaeger/jaeger/plugin/storage/integration/trace_compare.go:74
        	            				/home/runner/work/jaeger/jaeger/plugin/storage/integration/trace_compare.go:60
        	            				/home/runner/work/jaeger/jaeger/plugin/storage/integration/integration.go:216
        	Error:      	Not equal: 
        	            	expected: 10008
        	            	actual  : 10012

There are a couple notable things here:

  1. there was a DeadlineExceeded gRPC error earlier, probably during writing of the spans
  2. It's odd that there are more spans returned than written

I don't know if we have any retries in the writing pipeline, but even if we did the writes are supposed to be idempotent since we derive a primary key from trace/span ID and the content hash. It may be useful to debug this further and perhaps log the difference between written and loaded trace/span IDs (trace ID is supposed to be the same), and perhaps this test needs to be more resilient to duplicates.

@yurishkuro yurishkuro added the bug label May 5, 2024
@yurishkuro yurishkuro changed the title cassandra 4.x v2 test unstable jaeger-v2 e22 tests are unstable May 5, 2024
@yurishkuro
Copy link
Member Author

Another example, this time from ES:

=== RUN   TestESStorage/GetLargeSpans
    integration.go:205: Testing Large Trace over 10K ...
    integration.go:375: Dropped binary type attributes from trace ID: 0000000000000011
2024/05/05 16:47:30 traces export: context deadline exceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    integration.go:145: Waiting for storage backend to update documents, iteration 1 out of 100

I wonder if this is due to the fact that when a large batch is sent to the exporter, since our v1 storage implementations only save one span at a time we may have the situation that storage loops through all the 10k spans but the exporter meanwhile times out on the context deadline. It's not completely clear why 'rpc' would be mentioned in this case, however.

yurishkuro pushed a commit that referenced this issue May 8, 2024
## Which problem is this PR solving?
- Helps debugging unstable e2e tests in #5418

## Description of the changes
- Add more visibility to CI workflows by dumping related storage docker
logs if the tests failed.
- Made changes to Cassandra, Elasticsearch, and Opensearch.

## How was this change tested?
- Not tested, unable to test locally.

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [ ] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: James Ryans <james.ryans2012@gmail.com>
yurishkuro pushed a commit that referenced this issue May 16, 2024
## Which problem is this PR solving?
- Helps to debug unstable e2e tests in
#5418
- Current logs are still hard to follow even it already dump storage's
docker and OTEL col binary logs because we can't see which part of the
test related to which storage docker/ OTEL col binary logs.

## Description of the changes
- Add more logging to e2e tests that capture the timestamp of each
read/write run.

## How was this change tested?
- Run `STORAGE=grpc SPAN_STORAGE_TYPE=memory make
jaeger-v2-storage-integration-test`

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [ ] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

1 participant