Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MissingInputException for Valid Downloadable URL #21

Open
FabianHofmann opened this issue Apr 22, 2024 · 2 comments · May be fixed by #23
Open

MissingInputException for Valid Downloadable URL #21

FabianHofmann opened this issue Apr 22, 2024 · 2 comments · May be fixed by #23

Comments

@FabianHofmann
Copy link

I am encountering an unexpected error when using the storage plugin. I have the following link which downloads a xlsx file from the destatis data base (https://www.destatis.de/DE/Home/_inhalt.html):

"https://www.destatis.de/EN/Themes/Economy/Prices/Publications/Downloads-Energy-Price-Trends/energy-price-trends-xlsx-5619002.xlsx?__blob=publicationFile"

The link has no redirects and works properly when running it in the browser or in requests.get. However, when using it within the storage function, like in

rule retrieve_irena:
    input:
        storage(
            "https://www.destatis.de/EN/Themes/Economy/Prices/Publications/Downloads-Energy-Price-Trends/energy-price-trends-xlsx-5619002.xlsx?__blob=publicationFile",
        ),

the workflow throws the following error:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
MissingInputException in rule retrieve_irena in file /home/fabian/playground/snakemake-storage/Snakefile, line 1:
Missing input files for rule retrieve_irena:
    affected files:
        https://www.destatis.de/EN/Themes/Economy/Prices/Publications/Downloads-Energy-Price-Trends/energy-price-trends-xlsx-5619002.xlsx (storage)

I tried to understand what is going on, but could not resolve it. It seems to me like a bug, but perhaps I am missing a required setting.

@Hugovdberg
Copy link

It appears that there can be several causes that all result in the same MissingInputException, it could be an authentication issue (that happened to me today), but I suspect that in this case it is the ?__blob=publicationFile at the end that causes the issue. This URL for example seems to work just fine: http://wettelijkerente.net/wettelijkerente2.csv

@Hugovdberg
Copy link

ah no, I found the issue for your URL. snakemake uses requests.head to get some initial data from the file without downloading it in its entirety, but that returns an HTTP 303 status, which tells you to redirect elsewhere, but even following that redirect returns an HTTP 400 'Bad request'. So the assumption of snakemake is that every HTTP server supports both the HEAD and GET HTTP verbs, but that is not the case on this server.

I think the best way to fix this would be to add a configuration flag on the storage provider supports_http_head, which defaults to True, but can be set to False to use GET also to query the metadata.
Alternatively, a allow_http_get_fallback flag could be created instead, which defaults to False, but when set to True would fall back to GET on certain HTTP status codes. However, it might be quite tricky to get the correct set of status codes, because I think the error 400 would actually be a code on which you would not retry with GET. Therefore the supports_http_head flag would seem to me to be a better approach. I will create a pull request to implement this shortly.

@johanneskoester is there a way to make the MissingInputException give more feedback for remote files, because once a network is involved there are a lot of reasons for the file to (temporarily) not be found, even for a valid resource.

@Hugovdberg Hugovdberg linked a pull request Apr 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants