Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto track binary files #828

Merged
merged 9 commits into from Apr 21, 2022

Conversation

LysandreJik
Copy link
Member

@LysandreJik LysandreJik commented Apr 8, 2022

Closes #825.
Closed #687.

This adds the option to automatically track binary files. It follows the following two assumptions:

  • Binary files usually cannot be read with the open file object reader, which raises a unicode error.
  • The file can sometimes be read even with the open keyword; in that case, we search for the null character which should only be present in binary files, and which seems to be what grep uses in order to identify binary files (and we use grep in the backend to look for these binary files)

This PR adds a new method, auto_track_binary_files, which automatically tracks binary files.

A change could be seen as a breaking change, which is the auto_lfs_track keyword of the git_add method. Previously, it only tracked large files, it now tracks both large files and binary files.
I left it as I believe it's more of a bugfix than a breaking change, as all binary files would have been previously rejected, resulting in errors.

If this is not deemed acceptable, the deprecation cycle would be to:

  • add the auto_track_large_files and auto_track_binary_files keyword arguments to git_add, to toggle manually, and deprecate auto_lfs_track to mention that it will soon handle both large and binary files.

The tests have been added directly to the others that test for large files, as this results in a significant speedup vs splitting these over two tests.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 8, 2022

The documentation is not available anymore as the PR was closed or merged.

@LysandreJik
Copy link
Member Author

LysandreJik commented Apr 8, 2022

This seems like a partial fix to the issue, as this code sample by @osanseviero will still fail:

from huggingface_hub import Repository, HfApi
import os

repo_url = HfApi().create_repo("test-bin-bug")
repo = Repository("local_repo", clone_from=repo_url)
with open(os.path.join("local_repo", "file"), "wb") as out:
    out.truncate(1024*1024)
repo.push_to_hub("Commit #1")

Trying to find a cross-platform approach that manages both.

@LysandreJik
Copy link
Member Author

Checking for the null character seems to do the trick.

Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can try to read only a certain amount of data ? Like f.read(512) or f.read(100_000) ? Or maybe filter out first files that are 10+MB (if it's not already done ^^) before checking if they're binary

(btw the check on the API (when receiving files to upload) is on the first 512 bytes of data with this code: https://github.com/gjtorikian/isBinaryFile/blob/main/src/index.ts#L158-L256 - the check on the prereceive hook is with grep indeed)

src/huggingface_hub/repository.py Outdated Show resolved Hide resolved
@nateraw
Copy link
Contributor

nateraw commented Apr 8, 2022

Thanks for the quick fix! This solved my issue here - #825 . This issue was brought up in our HugGAN event, and its effecting other users, so I'd love to get this merged as soon as we can 😄

One of the effected users also tried it out, and this solved their problem!

@julien-c
Copy link
Member

julien-c commented Apr 9, 2022

This proposed fix makes sense to me!

Yes looking for a null byte character is a common (simple) heuristic for binary files and AFAIK is used in text editors etc.

I agree with @coyotte508's other points

Copy link
Member

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!! 🚀 🚀

src/huggingface_hub/repository.py Outdated Show resolved Hide resolved
src/huggingface_hub/repository.py Outdated Show resolved Hide resolved
@@ -1113,35 +1129,48 @@ def test_auto_track_large_files(self):
# This content is 20MB (over 10MB)
large_file = [100] * int(4e6)

# This content is binary (contains the null character)
binary_file = "\x00\x00\x00\x00"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider moving this to another test to keep this test focused in large files (or rename this test, although I think different tests would be better)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, what I was mentioning in the description is that this effectively doubles the time spent on these tests, which is already significant. We're currently sitting at 10+ minutes for the tests, which isn't ideal.

I'm open to moving it around, but would like to emphasize that it will likely end up taking even longer. Let me know if this compromise is okay for you and I'll move it to its own separate test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let's keep as is then 😄

LysandreJik and others added 2 commits April 12, 2022 13:32
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me =)

src/huggingface_hub/repository.py Show resolved Hide resolved
Copy link
Contributor

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @LysandreJik

@@ -226,6 +226,27 @@ def is_git_ignored(filename: Union[str, Path]) -> bool:
return is_ignored


def is_binary_file(filename: Union[str, Path]) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than reading the first few bytes, I think this answer gives a more accurate way of checking if the file is binary or not: https://stackoverflow.com/a/7392391/2536294

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented what you propose in 9fe6bb6, let me know if that's what you had in mind. Cc @coyotte508

src/huggingface_hub/repository.py Show resolved Hide resolved
# This content is binary (contains the null character)
binary_file = "\x00\x00\x00\x00"

with open(f"{WORKING_REPO_DIR}/non_binary_file.txt", "w+") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't write on the working dir, we should work with python's temp files and folders.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WORKING_REPO_DIR is a folder that is created on the start of the test, and which is reset at every beginning of every test, see here:

def setUp(self):
if os.path.exists(WORKING_REPO_DIR):
shutil.rmtree(WORKING_REPO_DIR, onerror=set_write_permission_and_retry)
logger.info(
f"Does {WORKING_REPO_DIR} exist: {os.path.exists(WORKING_REPO_DIR)}"
)
self.REPO_NAME = repo_name()
self._repo_url = self._api.create_repo(
repo_id=self.REPO_NAME, token=self._token
)
self._api.upload_file(
path_or_fileobj=BytesIO(b"some initial binary data: \x00\x01"),
path_in_repo="random_file.txt",
repo_id=f"{USER}/{self.REPO_NAME}",
token=self._token,
)

This isn't the working directory in which the test is run, but a folder nested in the fixtures folder:

WORKING_REPO_DIR = os.path.join(
os.path.dirname(os.path.abspath(__file__)), "fixtures/working_repo_2"
)

binary_file = "\x00\x00\x00\x00"

# Test nested gitignores
os.makedirs(f"{WORKING_REPO_DIR}/directory")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as above :)

Copy link
Contributor

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

I'll open a separate issue for the fixtures folder issue.

"""
try:
with open(filename, "rb") as f:
content = f.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
content = f.read()
content = f.read(1024) # or 512 if we want to be consistent with the backend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed above here: #828 (comment)

Do you disagree with the conclusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see that. Yeah I don't think we should ever read 11GB of data into memory. This will most certainly crash most people's systems. I'd be happy if we read like 10MB instead of 1MB to reduce the probability of a false detection, which should address those concerns. If we really do want to read a lot more, we should read in chunks. As python's docs state:

To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string (in text mode) or bytes object (in binary mode). size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution currently implemented loads a maximum of 10MB in memory when calling git_add: it tracks large files before tracking binary files.

When tracking binary files, it looks at files which are not yet tracked with lfs, eliminating large files.

Instead of the 1MB limit that you propose here, we could instead put a max of 10MB here, which will only be triggered when auto_track_binary_files is called independently of git_add (which is a possibility!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but you also want to have this method public, which means people can call it before having tracked large files.

I'm happy with your suggestion of 10MB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for your review. Addressed in cbfdce5

@adrinjalali adrinjalali merged commit 45aec8e into huggingface:main Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants