Auto track binary files #828

LysandreJik · 2022-04-08T20:03:28Z

Closes #825.
Closed #687.

This adds the option to automatically track binary files. It follows the following two assumptions:

Binary files usually cannot be read with the open file object reader, which raises a unicode error.
The file can sometimes be read even with the open keyword; in that case, we search for the null character which should only be present in binary files, and which seems to be what grep uses in order to identify binary files (and we use grep in the backend to look for these binary files)

This PR adds a new method, auto_track_binary_files, which automatically tracks binary files.

A change could be seen as a breaking change, which is the auto_lfs_track keyword of the git_add method. Previously, it only tracked large files, it now tracks both large files and binary files.
I left it as I believe it's more of a bugfix than a breaking change, as all binary files would have been previously rejected, resulting in errors.

If this is not deemed acceptable, the deprecation cycle would be to:

add the auto_track_large_files and auto_track_binary_files keyword arguments to git_add, to toggle manually, and deprecate auto_lfs_track to mention that it will soon handle both large and binary files.

The tests have been added directly to the others that test for large files, as this results in a significant speedup vs splitting these over two tests.

HuggingFaceDocBuilderDev · 2022-04-08T20:12:25Z

The documentation is not available anymore as the PR was closed or merged.

LysandreJik · 2022-04-08T20:15:13Z

This seems like a partial fix to the issue, as this code sample by @osanseviero will still fail:

from huggingface_hub import Repository, HfApi
import os

repo_url = HfApi().create_repo("test-bin-bug")
repo = Repository("local_repo", clone_from=repo_url)
with open(os.path.join("local_repo", "file"), "wb") as out:
    out.truncate(1024*1024)
repo.push_to_hub("Commit #1")

Trying to find a cross-platform approach that manages both.

LysandreJik · 2022-04-08T20:22:52Z

Checking for the null character seems to do the trick.

coyotte508

Maybe we can try to read only a certain amount of data ? Like f.read(512) or f.read(100_000) ? Or maybe filter out first files that are 10+MB (if it's not already done ^^) before checking if they're binary

(btw the check on the API (when receiving files to upload) is on the first 512 bytes of data with this code: https://github.com/gjtorikian/isBinaryFile/blob/main/src/index.ts#L158-L256 - the check on the prereceive hook is with grep indeed)

src/huggingface_hub/repository.py

nateraw · 2022-04-08T21:13:02Z

Thanks for the quick fix! This solved my issue here - #825 . This issue was brought up in our HugGAN event, and its effecting other users, so I'd love to get this merged as soon as we can 😄

One of the effected users also tried it out, and this solved their problem!

julien-c · 2022-04-09T13:17:16Z

This proposed fix makes sense to me!

Yes looking for a null byte character is a common (simple) heuristic for binary files and AFAIK is used in text editors etc.

I agree with @coyotte508's other points

osanseviero

Thanks for this PR!! 🚀 🚀

src/huggingface_hub/repository.py

osanseviero · 2022-04-11T08:32:03Z

tests/test_repository.py

@@ -1113,35 +1129,48 @@ def test_auto_track_large_files(self):
        # This content is 20MB (over 10MB)
        large_file = [100] * int(4e6)

+        # This content is binary (contains the null character)
+        binary_file = "\x00\x00\x00\x00"


I would consider moving this to another test to keep this test focused in large files (or rename this test, although I think different tests would be better)

Right, what I was mentioning in the description is that this effectively doubles the time spent on these tests, which is already significant. We're currently sitting at 10+ minutes for the tests, which isn't ideal.

I'm open to moving it around, but would like to emphasize that it will likely end up taking even longer. Let me know if this compromise is okay for you and I'll move it to its own separate test.

Good point! Let's keep as is then 😄

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

src/huggingface_hub/repository.py

coyotte508

Ok for me =)

src/huggingface_hub/repository.py

adrinjalali

Thanks for the PR @LysandreJik

adrinjalali · 2022-04-14T10:00:47Z

src/huggingface_hub/repository.py

@@ -226,6 +226,27 @@ def is_git_ignored(filename: Union[str, Path]) -> bool:
    return is_ignored


+def is_binary_file(filename: Union[str, Path]) -> bool:


Other than reading the first few bytes, I think this answer gives a more accurate way of checking if the file is binary or not: https://stackoverflow.com/a/7392391/2536294

I implemented what you propose in 9fe6bb6, let me know if that's what you had in mind. Cc @coyotte508

src/huggingface_hub/repository.py

adrinjalali · 2022-04-14T10:11:06Z

tests/test_repository.py

+        # This content is binary (contains the null character)
+        binary_file = "\x00\x00\x00\x00"
+
+        with open(f"{WORKING_REPO_DIR}/non_binary_file.txt", "w+") as f:


we shouldn't write on the working dir, we should work with python's temp files and folders.

The WORKING_REPO_DIR is a folder that is created on the start of the test, and which is reset at every beginning of every test, see here:

huggingface_hub/tests/test_repository.py

Lines 76 to 91 in 6d27a15

def setUp(self):

if os.path.exists(WORKING_REPO_DIR):

shutil.rmtree(WORKING_REPO_DIR, onerror=set_write_permission_and_retry)

logger.info(

f"Does {WORKING_REPO_DIR} exist: {os.path.exists(WORKING_REPO_DIR)}"

)

self.REPO_NAME = repo_name()

self._repo_url = self._api.create_repo(

repo_id=self.REPO_NAME, token=self._token

)

self._api.upload_file(

path_or_fileobj=BytesIO(b"some initial binary data: \x00\x01"),

path_in_repo="random_file.txt",

repo_id=f"{USER}/{self.REPO_NAME}",

token=self._token,

)

This isn't the working directory in which the test is run, but a folder nested in the fixtures folder:

huggingface_hub/tests/test_repository.py

Lines 50 to 52 in 6d27a15

WORKING_REPO_DIR = os.path.join(

os.path.dirname(os.path.abspath(__file__)), "fixtures/working_repo_2"

)

adrinjalali · 2022-04-14T10:11:41Z

tests/test_repository.py

+        binary_file = "\x00\x00\x00\x00"
+
+        # Test nested gitignores
+        os.makedirs(f"{WORKING_REPO_DIR}/directory")


Same answer as above :)

adrinjalali

Otherwise LGTM.

I'll open a separate issue for the fixtures folder issue.

adrinjalali · 2022-04-21T11:05:48Z

src/huggingface_hub/repository.py

+    """
+    try:
+        with open(filename, "rb") as f:
+            content = f.read()


Suggested change

content = f.read()

content = f.read(1024) # or 512 if we want to be consistent with the backend

This was discussed above here: #828 (comment)

Do you disagree with the conclusion?

Didn't see that. Yeah I don't think we should ever read 11GB of data into memory. This will most certainly crash most people's systems. I'd be happy if we read like 10MB instead of 1MB to reduce the probability of a false detection, which should address those concerns. If we really do want to read a lot more, we should read in chunks. As python's docs state:

To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string (in text mode) or bytes object (in binary mode). size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.

The solution currently implemented loads a maximum of 10MB in memory when calling git_add: it tracks large files before tracking binary files.

When tracking binary files, it looks at files which are not yet tracked with lfs, eliminating large files.

Instead of the 1MB limit that you propose here, we could instead put a max of 10MB here, which will only be triggered when auto_track_binary_files is called independently of git_add (which is a possibility!).

yes, but you also want to have this method public, which means people can call it before having tracked large files.

I'm happy with your suggestion of 10MB.

Sounds good, thanks for your review. Addressed in cbfdce5

Auto track binary files

a4e9b19

LysandreJik mentioned this pull request Apr 8, 2022

push_to_hub_keras rejected because it contains binary files #825

Closed

Add null character check

35b480c

Tests

57f8613

LysandreJik marked this pull request as ready for review April 8, 2022 20:47

LysandreJik requested review from osanseviero and coyotte508 April 8, 2022 20:47

LysandreJik mentioned this pull request Apr 8, 2022

Git add with auto_lfs_track should track binary files #687

Closed

coyotte508 reviewed Apr 8, 2022

View reviewed changes

src/huggingface_hub/repository.py Outdated Show resolved Hide resolved

osanseviero approved these changes Apr 11, 2022

View reviewed changes

LysandreJik and others added 2 commits April 12, 2022 13:32

Apply suggestions from code review

b8bbe4f

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

Only read 512 bytes

17a1331

coyotte508 reviewed Apr 12, 2022

View reviewed changes

src/huggingface_hub/repository.py Outdated Show resolved Hide resolved

LysandreJik added 2 commits April 12, 2022 19:09

Address Eliott's comments

5f48d6c

Split test

6d27a15

coyotte508 approved these changes Apr 12, 2022

View reviewed changes

src/huggingface_hub/repository.py Show resolved Hide resolved

adrinjalali reviewed Apr 14, 2022

View reviewed changes

Address review comments

9fe6bb6

adrinjalali approved these changes Apr 21, 2022

View reviewed changes

10MB maximum

cbfdce5

adrinjalali approved these changes Apr 21, 2022

View reviewed changes

adrinjalali merged commit 45aec8e into huggingface:main Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto track binary files #828

Auto track binary files #828

LysandreJik commented Apr 8, 2022 •

edited

HuggingFaceDocBuilderDev commented Apr 8, 2022 •

edited

LysandreJik commented Apr 8, 2022 •

edited

LysandreJik commented Apr 8, 2022

coyotte508 left a comment •

edited

nateraw commented Apr 8, 2022 •

edited

julien-c commented Apr 9, 2022

osanseviero left a comment

osanseviero Apr 11, 2022

LysandreJik Apr 11, 2022

osanseviero Apr 19, 2022

coyotte508 left a comment

adrinjalali left a comment

adrinjalali Apr 14, 2022

LysandreJik Apr 18, 2022

adrinjalali Apr 14, 2022

LysandreJik Apr 18, 2022

adrinjalali Apr 14, 2022

LysandreJik Apr 18, 2022

adrinjalali left a comment

adrinjalali Apr 21, 2022

LysandreJik Apr 21, 2022

adrinjalali Apr 21, 2022

LysandreJik Apr 21, 2022

adrinjalali Apr 21, 2022

LysandreJik Apr 21, 2022

		@@ -226,6 +226,27 @@ def is_git_ignored(filename: Union[str, Path]) -> bool:
		return is_ignored


		def is_binary_file(filename: Union[str, Path]) -> bool:

	def setUp(self):
	if os.path.exists(WORKING_REPO_DIR):
	shutil.rmtree(WORKING_REPO_DIR, onerror=set_write_permission_and_retry)
	logger.info(
	f"Does {WORKING_REPO_DIR} exist: {os.path.exists(WORKING_REPO_DIR)}"
	)
	self.REPO_NAME = repo_name()
	self._repo_url = self._api.create_repo(
	repo_id=self.REPO_NAME, token=self._token
	)
	self._api.upload_file(
	path_or_fileobj=BytesIO(b"some initial binary data: \x00\x01"),
	path_in_repo="random_file.txt",
	repo_id=f"{USER}/{self.REPO_NAME}",
	token=self._token,
	)

	WORKING_REPO_DIR = os.path.join(
	os.path.dirname(os.path.abspath(__file__)), "fixtures/working_repo_2"
	)

	content = f.read()
	content = f.read(1024) # or 512 if we want to be consistent with the backend

Auto track binary files #828

Auto track binary files #828

Conversation

LysandreJik commented Apr 8, 2022 • edited

HuggingFaceDocBuilderDev commented Apr 8, 2022 • edited

LysandreJik commented Apr 8, 2022 • edited

LysandreJik commented Apr 8, 2022

coyotte508 left a comment • edited

Choose a reason for hiding this comment

nateraw commented Apr 8, 2022 • edited

julien-c commented Apr 9, 2022

osanseviero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coyotte508 left a comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik commented Apr 8, 2022 •

edited

HuggingFaceDocBuilderDev commented Apr 8, 2022 •

edited

LysandreJik commented Apr 8, 2022 •

edited

coyotte508 left a comment •

edited

nateraw commented Apr 8, 2022 •

edited