Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates returned when listing nodes files #10086

Open
aaronwolen opened this issue Oct 22, 2022 · 5 comments
Open

Duplicates returned when listing nodes files #10086

aaronwolen opened this issue Oct 22, 2022 · 5 comments
Labels

Comments

@aaronwolen
Copy link

This was originally reported in ropensci/osfr#150 by @doomlab.

The reprex here shows that the same file, 421_Lu.pdf, is returned twice when listing files in the Local IRB directory within this project.

I've confirmed that the duplicate entries are coming from the OSF API, across different pages of results:

#!/usr/bin/env bash

set -e

TOKEN="$OSF_PAT"
NODE="ycn7z"
ID="6113d75ae3801305b39612a8"
LIMIT=2

# Retrieve name and path attributes from JSON response
JQ_FILTER='.data[].attributes | "\(.name) \(.path)"'

for i in $(seq 1 $LIMIT); do
  echo "Retrieving page $i"
  curl --silent \
    "https://api.osf.io/v2/nodes/$NODE/files/osfstorage/$ID/?page=$i" \
    -H "Authorization: Bearer $TOKEN" \
    -H 'Accept-Header: application/vnd.api+json' \
    -H 'Content-Type: application/json; charset=utf-8' \
    | jq $JQ_FILTER
done

## Retrieving page 1
## "97_Pfuhl.pdf /6163f0e5fd5b230191983824"
## "1897_Parker.pdf /616440dfc5565801d34b71bf"
## "1698_Butt.pdf /616513bbc5565802014b9ae6"
## "1970_Pavlović.pdf /617436dae572ea00b13a7285"
## "1560_Irrazabal.pdf /618281a0a30f8100cdaa071d"
## "1867_Oner.pdf /6184db04bfb47d00a3ef50dd"
## "169_Montefinese.pdf /6186148c25f90a004a0f6aa6"
## "87_Vaughn.docx /619548800b0c1e01a27fdae5"
## "35_Stewart.pdf /6197ca37ef62980009f5c789"
## "421_Lu.pdf /6161fcd9fd5b2301429849b3"                  <-- copy 1
##
## Retrieving page 2
## "423_Arriaga.pdf /619d017da83c2001650e8e53"
## "761_Papadatou-Pastou.pdf /619df2886977cd010f496498"
## "712_Davis.pdf /61a7d30d4d4ce5018476e569"
## "1574_Al-Hoorie.pdf /61b89ac6da0b1b0488d05546"
## "206_Ergiyen.pdf /61cc42f3da632006e1fe6f4a"
## "437_Peker.pdf /61fc2630370e6c002bf3d6cc"
## "104_Stieger.pdf /620e3a2511da1c05cdf57647"
## "238_Martínez.pdf /620f7666d9b6cf0144b90449"
## "1052_Parzuchowski.pdf /6220fbccc064270378d90ce5"
## "421_Lu.pdf /6161fcd9fd5b2301429849b3"                  <-- copy 2

The waterbutler IDs are identical so this does seem like a possible bug.

Let me know if you need any more information.

@doomlab
Copy link

doomlab commented Oct 22, 2022

Thanks @aaronwolen - I will note that when I run the same code I get a different file duplicated. And the duplicated file sometimes changes, usually when I update/upload a new file. You can see my reprex here.

@Johnetordoff
Copy link
Contributor

I was able to reproduce the error and it is indeed coming from the API, the default sorting logic for this particular endpoint appears to be broken. As a temporary work around you can use your own sorting criteria to prevent the error, such as https://api.osf.io/v2/nodes/ycn7z/files/osfstorage/6113d75ae3801305b39612a8/?sort=id will sort correctly by id, removing all duplicates, similarly for name etc. We will resolve the issue eventually, but I recommend using the workaround for now. Thanks for your interest, I'll close this issue when we've resolved this bug if you have no further questions or comments.

@aaronwolen
Copy link
Author

Thanks for the quick response, @Johnetordoff! A couple follow-up questions for you:

  • I didn't even know about the sort param. Is it documented somewhere and I missed it?
  • What other attributes can we sort on?
  • Will sorting on any attribute solve the issue?

@doomlab
Copy link

doomlab commented Oct 23, 2022

Thanks @Johnetordoff - I have updated my code and got the appropriate output. Appreciate the workaround.

@Johnetordoff
Copy link
Contributor

@aaronwolen

I didn't even know about the sort param. Is it documented somewhere and I missed it?

It is not documented, unfortunately this param is not implemented consistently over all the endpoints it's applied. Some queries, legacy endpoints and attributes haven't been QA'ed for accurate sorting, so they remain undocumented.

What other attributes can we sort on?

The default sorting behavior for list view is to allow the user to sort on any of the attributes returned in JSON payload. For example https://api.osf.io/v2/users/ allows you to sort on full_name, given_name, middle_names, family_name, suffix, date_registered, active, tiimezone, locale, social, employment and education.

Will sorting on any attribute solve the issue?

I did not check, as I've written this behavior is not guaranteed to be accurate or consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants