Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action fails, but no error? Appears incomplete - no summary. #104

Open
kubu4 opened this issue Dec 12, 2023 · 13 comments
Open

Action fails, but no error? Appears incomplete - no summary. #104

kubu4 opened this issue Dec 12, 2023 · 13 comments

Comments

@kubu4
Copy link

kubu4 commented Dec 12, 2023

Whenever the urlchecker-action runs, it fails. The end of the log file appears to be incomplete, as it doesn't provide any summary and/or error messages.

screencap of log file

This is what my workflow file looks like:

name: URLChecker
on: [push]

jobs:
  check-urls:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: urlchecker-action
      uses: urlstechie/urlchecker-action@0.0.34
      with:
        # A subfolder or path to navigate to in the present or cloned repository
        subfolder: posts
        file_types: .qmd
        # Choose whether to print a more verbose end summary with files and broken URLs.
        verbose: true
        # The timeout seconds to provide to requests, defaults to 5 seconds
        timeout: 5
        # How many times to retry a failed request (each is logged, defaults to 1)
        retry_count: 3
        # choose if the force pass or not
        force_pass: true

I'm not sure how to troubleshoot. Are there log files that get generated somewhere that I can look through?

@SuperKogito
Copy link
Member

I am not sure what is the issue exactly but you seem to have this issue for a while. The way I would debug this is to test the same workflow using the Python module locally and see if that works. If so then maybe I would lower the number of workers, or set it to 1 and see if that solves the issue.

@kubu4
Copy link
Author

kubu4 commented Dec 12, 2023

Thanks so much for the quick response and suggestions.

Admittedly, we're not sure how to run actions locally (we're a group of biologists, so not well-versed in software development stuff), but we'll poke around the web and report back.

@vsoch
Copy link
Collaborator

vsoch commented Dec 12, 2023

Hey @kubu4 ! You shouldn't need to poke around the web - urlchecker is a command line python tool, and there are instructions for install and usage here:

https://github.com/urlstechie/urlchecker-python

The action is simply running that under the hood. Let us know if you have any questions! I work with a lot of biologists. :)

@kubu4
Copy link
Author

kubu4 commented Dec 12, 2023

Ha! Thanks!

@kubu4
Copy link
Author

kubu4 commented Dec 12, 2023

Brief update:

I think it's a memory issue. I ran urlchecker check --file-types ".qmd" . in my repo on a high-memory computer (256GM RAM) and the memory was pegged all the way to the top!

I didn't let it finish because some other people were trying to use the computer for some other tasks and I, essentially, locked it up.

Possible solution? Reducing the number of workers, per @SuperKogito's suggestion?

@vsoch
Copy link
Collaborator

vsoch commented Dec 12, 2023

That seems strange - how many files are you checking (and what is a qmd extension)? Try adding --serial

@kubu4
Copy link
Author

kubu4 commented Dec 12, 2023

Thousands of files. .qmd is a Quarto markdown (it's still just markdown, but with a YAML that can be parsed by Quarto).

Many links are to large files (multi-GB in size). Would that have an impact on how this action runs?

@vsoch
Copy link
Collaborator

vsoch commented Dec 12, 2023

Yes likely - maybe try out testing a smaller subset of the files first and see at what size it starts to not work?

@SuperKogito
Copy link
Member

this is actually consistent with the memory overflow; the files are loaded and scanned for urls. This definitely will require a lot of RAM if your files are too big/ too many. Using multiple workers will only make this worse, hence I mentioned using one worker. Using '--serial' per @vsoch recommendation is also a possible solution but if your files are too big, it will be hard to escape this. Especially, if the memory is not flushed as soon as the links are extracted.

@vsoch
Copy link
Collaborator

vsoch commented Dec 12, 2023

Likely we need a fix that processes them in batches (and doesn't try to load everything into memory at once).

@vsoch
Copy link
Collaborator

vsoch commented Dec 12, 2023

You could also just target runs on separate subdirectories (one at a time or in a matrix), depending on how large your repository is.

@SuperKogito
Copy link
Member

I suspect that this has something to do with the memory management and garbage collection.

Memory Allocation for File Reading: When you read a file in Python, the data is loaded into RAM. If you read the entire file at once (e.g., using read() or readlines()), the entire file content is loaded into memory. This can be problematic for large files, as it can consume a significant amount of RAM.

https://github.com/urlstechie/urlchecker-python/blob/7dbd7ac171cf85788728b4cf5576c191f13c8399/urlchecker/core/fileproc.py#L135

Garbage Collection: Python uses a garbage collector to reclaim memory from objects that are no longer in use. The primary garbage collection mechanism in Python is reference counting. An object's memory is deallocated when its reference count drops to zero (i.e., when there are no more references to it in your program).

When Memory is Freed:

  • Automatic De-allocation: Memory for file data is automatically freed once the file object is no longer referenced. This can happen when the variable holding the file data goes out of scope, or if you explicitly del the variable.
  • Context Managers (with Statement): Using a with statement to handle file operations is a good practice. It ensures that the file is properly closed after its suite finishes, even if an error occurs. However, closing a file does not immediately free the memory used for its content stored in a variable.
  • Manual Intervention: If you're dealing with very large files and want to ensure memory is freed promptly, you might need to manually delete large objects or use more granular read operations.

Strategies for Large Files:

  • Read in Chunks: Instead of reading the whole file at once, you can read it in smaller chunks (e.g., line by line, or a fixed number of bytes at a time). This way, you only keep a small part of the file in memory at any given time.
  • Use Generators: Generators can be very effective for reading large files as they yield data on-the-fly and do not store it in memory.
  • External Libraries: Some Python libraries are optimized for handling large datasets and can be more efficient than standard file reading methods.

@vsoch generators could be a good fix here, what do you think of this?
@kubu4 as @vsoch mentioned processing in batches seems to be the best option atm. Just make multiple workflows each processing a different subset.

@vsoch
Copy link
Collaborator

vsoch commented Dec 13, 2023

@SuperKogito my first suggestion to @kubu4 is to try processing in batches (e.g., multiple runs on different roots, and that can be put into an action matrix). If that doesn't work, then I think we should add some kind of support to handle that internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants