Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: is it possible to reconstitute the current set of evaluations by ingesting the current real world #3234

Closed
giorgiosironi opened this issue May 9, 2024 · 11 comments
Projects

Comments

@giorgiosironi
Copy link
Collaborator

No description provided.

@giorgiosironi giorgiosironi created this issue from a note in Sciety (In progress) May 9, 2024
@giorgiosironi
Copy link
Collaborator Author

We ran this for all 7 PCI groups:

make ingest-evaluations INGEST_ONLY=pci INGEST_DAYS=6000 INGEST_DEBUG=true

resulted in ~327 evaluations being correctly recorded.

@giorgiosironi
Copy link
Collaborator Author

To look into a local database:

$ make dev-sql
docker compose --file docker-compose.yml --file docker-compose.dev.yml exec -e PGUSER=user -e PGPASSWORD=secret -e PGDATABASE=sciety db psql
psql (12.3)
Type "help" for help.

sciety=# SELECT COUNT(*) FROM events;
 count 
-------
   327
sciety=# SELECT COUNT(*) FROM events WHERE payload->>'groupId'='32025f28-0506-480e-84a0-b47ef1e92ec5';

 count 
-------
   114
(1 row)

@giorgiosironi
Copy link
Collaborator Author

To count unique evaluations for a group on a copy of the production database:

grep -r 32025f28-0506-480e-84a0-b47ef1e92ec5 data/exploratory-test-from-prod.csv | grep -o '""evaluationLocator"": ""doi:.*' | so
rt | uniq | wc -l

@davidcmoulton
Copy link
Collaborator

davidcmoulton commented May 9, 2024

The Arcadia group has 2642 evaluations specified on their group card in prod. Local, fresh ingestion yielded 2083. We believe that this is correlated with switching off the old Arcadia Hypothesis group, see 6a6137a.

In another experiment we tried ingesting from both Hypothesis groups with a cutoff date between the two. We saw 1475 evaluations ingested from the current group (after 2023-04-15), and 556 ingested from the old group.

We discovered we did a backfill around 2023-08-01 that accidentally recorded all the content from the new group, creating duplicate evaluations:

$ grep -r SR9Keto7Ee2DzL80cY8qDA data/exploratory-test-from-prod.csv
bee9739c-fba6-4c8b-85f2-68799cfcec9d,EvaluationRecorded,2023-08-01 13:47:03.413,"{""authors"": [], ""groupId"": ""bc1f956b-12e8-4f5c-aadc-70f91347bd18"", ""articleId"": ""doi:10.1101/2023.02.06.527367"", ""publishedAt"": ""2023-04-13T20:39:28.282Z"", ""evaluationType"": ""not-provided"", ""evaluationLocator"": ""hypothesis:SR9Keto7Ee2DzL80cY8qDA""}"
$ grep -r DGndiNsdEe2sBGfmCuQGFQ data/exploratory-test-from-prod.csv
4b65d6fc-d704-46da-aa21-b15b2f22692f,EvaluationRecorded,2023-08-01 13:48:26.013,"{""authors"": [], ""groupId"": ""bc1f956b-12e8-4f5c-aadc-70f91347bd18"", ""articleId"": ""doi:10.1101/2023.03.19.532758"", ""publishedAt"": ""2023-04-14T23:35:32.479Z"", ""evaluationType"": ""not-provided"", ""evaluationLocator"": ""hypothesis:DGndiNsdEe2sBGfmCuQGFQ""}"
$ grep -r bc1f956b-12e8-4f5c-aadc-70f91347bd18 data/exploratory-test-from-prod.csv | grep "EvaluationRecorded,2023-08-01" | wc -l
613

@giorgiosironi
Copy link
Collaborator Author

The preLights group has 1175 evaluations specified on their group card in prod. Local, fresh ingestion yielded 1542.
We had to edit hours=120000 in the ingestion code to look further into the past than the hardcoded 120 hours.

@giorgiosironi
Copy link
Collaborator Author

The prereview group has 415 evaluations specified on their group card in prod. Local, fresh ingestion yielded 405.

@giorgiosironi
Copy link
Collaborator Author

It is known NCRC does not have an ingestion set up currently, as a dormant group; we would need to bring that back.

@kevinrutherford
Copy link
Collaborator

The eLife group has 26996 evaluations specified on their group card in prod. Local, fresh ingestion yielded 24750 (but with 1666 lefts).

@giorgiosironi
Copy link
Collaborator Author

To compare, the current Hypothesis group we ingest from for eLife:

$ curl -s "https://api.hypothes.is/api/search?group=q5X6RWJ6&limit=200&sort=created&order=asc" | jq .total
26477

which is about ~500 less than the group card in prod.

The number of unique evaluation locators recorded in prod is:

$ grep -r b560187e-f2fb-4ff9-a861-a204f3fc0fb0 data/exploratory-test-from-prod.csv | grep -o '""evaluationLocator"": ""hypothesis:.*' | wc -l
28865

which suggests we erased/removed almost ~2000 of them over time?

@giorgiosironi
Copy link
Collaborator Author

More on eLife suggests the Hypothes.is group is the source of truth and will cover our use case:

https://hypothes.is/users/Public_Reviews is the user and https://hypothes.is/groups/q5X6RWJ6/elife is the group. Using the group rather than the user makes sense. We post reviews/responses using Kotahi, which ensures they end up on hypothes.is in the group specified (as well as bioRxiv, medRxiv, EPP, and hopefully Sciety!) Paul, let us know if any of that is mistaken!
< 1 minute ago

Thanks, that fits our understanding, and it goes back to 2019 so it doesn't miss any historical content.

@giorgiosironi
Copy link
Collaborator Author

giorgiosironi commented May 10, 2024

The Rapid Reviews Infectious Diseases group has 995 evaluations specified on their group card in prod. Local, fresh ingestion yielded only 114 even after unhardcoding daysAgo(1).

Edit: various problems in the ingestion:

After hardcoding offset=0, offset=1000, and offset=2000 in the ingestion, I was able to get 1118 events. We might be missing content from this group in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants