Cache key for saving / restoring the link check cache (e.g. for GH Actions) #671

riccardoporreca · 2021-12-19T10:16:11Z

As discussed in #664 (comment), saving / restoring the HTMLProofer cache in CI tools like GitHub Actions suffers from the difficulty of defining a proper key reflecting the the content of the cache:

Indeed, by definition, the cache is not saved for hits on the primary key, since the idea is that the primary key uniquely defines what is being cached. See also actions/cache#171, actions/cache#481, actions/cache#628

This is unfortunately not easy with HTMLProofer, unless it would expose the functionality to just return the links to be checked (and cached) without really checking them. This information could be used to compute a hash used in the primary key.

It would be great if HTMLProofer could "return" (perhaps just to standard output for the command-line usage) the links or a computed hash of the links to be checked.

Note that the approach currently mentioned in the README suffers from the issue described in #664. I have been switching to using the commit SHA to work around this.

gjtorikian · 2021-12-19T14:56:40Z

The cache is so wildly broken right now; I'm working on a new major release (#669).

gjtorikian · 2021-12-31T17:39:24Z

I don't know much about how GitHub Actions work, but as of #672 the cache logic has been completely rewritten.

Would it be enough to have a, let's say, an md5 of the cache log file?

riccardoporreca · 2021-12-31T18:53:30Z

@gjtorikian, excited to see the new caching logic in action with the new v4, thanks for the big effort!

As for the cache key, we would need to compute it before checking the actual links and use it to restore the actual cache log file from a previous run. An md5 of the links to be eventually checked and cached would do... in a sense a cache log file "skeleton" with the links but w/o the result of the check would be enough, although I am not sure this fits the way the cache is implemented.

What would probably work is to construct @internal_urls and @external_urls in the runner as currently done, and extract their keys (or the md5 thereof) w/o calling validate_internal/external_urls (or doing any other check).

html-proofer/lib/html_proofer/runner.rb

Lines 66 to 79 in 041bc94

    
           # Collects any external URLs found in a directory of files. Also collectes 
        
           # every failed test from process_files. 
        
           # Sends the external URLs to Typhoeus for batch processing. 
        
           def check_files 
        
             process_files.each do |item| 
        
               @external_urls.merge!(item[:external_urls]) 
        
               @internal_urls.merge!(item[:internal_urls]) 
        
               @failures.concat(item[:failures]) 
        
             end 
        
             validate_external_urls unless @options[:disable_external] 
        
             validate_internal_urls 
        
           end

A (perhaps stupid) idea, trying not to be too disruptive and not too tailored to the cache key problem: add a new option like --extract-links that would

only enable link checks, basically using the existing machinery to populate @internal_urls and @external_urls.
instead of calling validate_internal/external_urls, dump @internal_urls and @external_urls to a file (similar to what done for the actual cache, perhaps with the same default location tmp/.htmlproofer)

The dumped extracted links could be useful per se, and the file(s) can be easily hashed to construct the key for GitHub Actions via the built-in hashFiles(). For the sake of constructing the key, it actually does not harm much if the dumped files contain the full @internal_urls and @external_urls information instead of just their keys.

gjtorikian · 2021-12-31T21:52:25Z

Ok, thanks for walking me through that. For external links this should be easy: an external link is the same no matter what file it appears in.

Internal links are a different monster. While we can just keep track of all the links found, but I think the source file name would also need to be preserved. The current cache logic does do this; I’m not sure how it would like to hash/preserve during CI runs. (To be exploit about this: a link to /somewhere.html behaves differently if the file with the link is a sibling, versus a file that is not actually a sibling. The cache now keeps track of the source filename path for internal links to resolve their validity.)

gjtorikian · 2022-01-02T15:19:59Z

So, here's what I'm thinking. The format for the cache file is roughly like this:

{
  "version": 2,
  "internal": {
    "/somewhere.html": {
      "time": "2022-01-01 22:46:47 -0500",
      "metadata": [
        {
          "source": "spec/html-proofer/fixtures/cache/internal_and_external_example.html",
          "current_path": "spec/html-proofer/fixtures/cache/internal_and_external_example.html",
          "line": 11,
          "base_url": "",
          "found": null
        }
      ]
    }
  },
  "external": {
    "https://github.com/gjtorikian/html-proofer": {
      "time": "2022-01-01 22:46:47 -0500",
      "found": true,
      "status_code": 200,
      "message": "OK",
      "metadata": [
        {
          "filename": "spec/html-proofer/fixtures/cache/internal_and_external_example.html",
          "line": 7
        }
      ]
    }
  }
}

It seems to me that if I:

Grab all the internal urls, and their source files, I can combine them into a giant string. An array of arrays, say: [[/somewhere.html, ["spec/html-proofer/fixtures/cache/internal_and_external_example.html"]]
Grab all the external URLs: ["https://github.com/gjtorikian/html-proofer"]
md5 this information and store it in the storage directory (as fingerprint or some such)

This should be enough to preserve a cache between runs.

I feel like I'm missing something around timestamps, though. For example, if a CI run occurred on March 2021, and then ran again in March 2022, with no fundamental changes to the contents...the md5 hash would be the exact same, but the cache should be invalidated, because it's possible the external links changed in that time (maybe the website was removed or whatever). Then, I think how I'm storing the timestamps right now are wrong; it should be on the top-level, rather than per-link, for easier access; and if I do that, then the fingerprint can be the timestamp of the last run plus the md5 hash (2022-01-01 22:46:47 -0500-4e7757d55a0f3255b3e25f81ec46f929). HTMLProofer by default can then invalidate the cache after, say, 7 days. I would also probably not add a new option for this, but rather, have it happen automatically if a CI=true environment variable is set.

What do you think? Am I missing anything?

riccardoporreca · 2022-01-02T18:32:14Z

@gjtorikian, you have a fair point about the timestamps... I will have to give it a few more thoughts, but indeed the timestamps is an important aspect to ensure the cache is saved if links expire. Hope to get back to this shortly.
I am not sure giving up with per-link timestamps is fully desirable, but will come back with more concrete feedback after more thoughts.

riccardoporreca · 2022-01-03T18:01:10Z

Let me take a few steps back and try to reason about cache keys.

TL;DR: Timestamps play indeed an important role for an ideal primary key, but they cannot be used to construct the key => We are better off not defining a "smart" key just suggesting an "aggressive" key unique to each run forcing the cache to be always saved.

(Yet another) Summary of GitHub Actions cache behavior

The cache is restored for the defined primary key if one is found, otherwise using the restore-keys
- The primary/restore keys must be defined before using and updating it by running the link validation, so they can be used to to restore the cache
The cache is saved for the primary key only if no hot on the primary key hit occurred when restoring, hence a restored primary key cache is never re-saved even if updated.
The cache is not saved if the job fails (which would occur by default if HTMLProofer finds issues, unless the failed shell status is mangled)

Main target: make sure the HTMLProofer cache is saved when a relevant update of the cache content occurs (which implies defining a different primary key)

This is the most relevant aspect, to avoid issues like Github actions does not seem to use cached links #664
Given the relatively small size of HTMLProofer caches, we can afford being conservative and save more than needed rather than not saving (the cache from the "latest" run would still be restored based on the restore-keys)
Relevant updates include
- A) the set of cached links is changed (especially with new links)
- B) the cached links are re-checked as they become outdated, which depends on the current time vs. (cached timestamps + timeframe), where considering the oldest timestamp could be enough to determine whether an update would occur
(A) is kind of easy, and was the main idea when I created the issue; (B) is indeed the tricky bit: we cannot know the cached timestamps (to define the key) before the cache is restored (for which we need the key)
If the points above are correct, your suggestion of including the timestamp in the key would indeed address (B), but is not viable in practice.

I fear we might have to use a conservative approach / workaround and make sure the cache is always saved in each (successful) run

One can easily implement this by including an identifier for the current run in the primary key (should be easy using the GitHub Actions context)
We can afford this conservative approach as mentioned above (small cache size)
The approach can be still used in combination with the fingerprint based on internal / external urls, mainly serving the restore-keys
- Can elaborate more on this, but I don't see much added value for the complexity of introducing additional functionality to HTMLProofer
We can then just update the README by suggesting such conservative primary key (happy to draft something concrete).

gjtorikian added the next-gen label Dec 19, 2021

gjtorikian mentioned this issue Dec 31, 2021

Github actions does not seem to use cached links #664

Closed

riccardoporreca mentioned this issue Jan 9, 2022

Problems with internal link cache re-checking logic (4.0.0.rc3) #695

Closed

riccardoporreca mentioned this issue Mar 14, 2022

Issue with internal links re-checking logic / cache keys #702

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache key for saving / restoring the link check cache (e.g. for GH Actions) #671

Cache key for saving / restoring the link check cache (e.g. for GH Actions) #671

riccardoporreca commented Dec 19, 2021

gjtorikian commented Dec 19, 2021

gjtorikian commented Dec 31, 2021

riccardoporreca commented Dec 31, 2021

gjtorikian commented Dec 31, 2021

gjtorikian commented Jan 2, 2022 •

edited

riccardoporreca commented Jan 2, 2022

riccardoporreca commented Jan 3, 2022

Cache key for saving / restoring the link check cache (e.g. for GH Actions) #671

Cache key for saving / restoring the link check cache (e.g. for GH Actions) #671

Comments

riccardoporreca commented Dec 19, 2021

gjtorikian commented Dec 19, 2021

gjtorikian commented Dec 31, 2021

riccardoporreca commented Dec 31, 2021

gjtorikian commented Dec 31, 2021

gjtorikian commented Jan 2, 2022 • edited

riccardoporreca commented Jan 2, 2022

riccardoporreca commented Jan 3, 2022

gjtorikian commented Jan 2, 2022 •

edited