Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading FileInfo from a dummy file instead of the file itself #62

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Myself5
Copy link

@Myself5 Myself5 commented Oct 3, 2017

On our (see #58) server setup, the mirrorbits server is not serving any files and is only responsible for spreadding the load. Effectively, this means it only needs the files for checksum and size generation, which produces a lot of wasted storage.

we use https://github.com/Myself5/mirrorbits_dummycreator for the dummy creation, its effectively just a json table containing the file information and therefore saves us a lot of storage (5.8MB vs 217GB).

Used that setup for the past Week on our production servers without issues.

Copy link
Owner

@etix etix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

It's a very interesting idea I would definitely pursue. I would probably go even further and make it support hybrid repositories, this way a repository can contain both files and dummy files. I can imagine many use-cases for that.

scan/scan.go Outdated
d.modTime = f.ModTime()

if dummyFile {
raw, err := ioutil.ReadFile(path)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case someone makes a configuration error, we don't want mirrorbits to load a multi-gigabyte file into memory. Making a custom reader that will do some basic sanity check would be much better. Even better, if the file is not a "dummyFile", then fallback to the standard method. This way, we can have an hybrid repository.

scan/scan.go Outdated
if err != nil {
fmt.Println(err.Error())
fmt.Println("Failed to read file, DELETING!")
os.Remove(path)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way too dangerous and can result in a massive data loss if the redirector is also the main rsync server. Just ignore the file and display an error to the console instead.

scan/scan.go Outdated
var c []DummyFile
err = json.Unmarshal(raw, &c)
if err != nil {
log.Errorf("error decoding json: %v", err)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you're mixing space and tabs. Please go fmt your code.

scan/scan.go Outdated
if dummyFile {
raw, err := ioutil.ReadFile(path)
if err != nil {
fmt.Println(err.Error())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log.Error instead of fmt.

scan/scan.go Outdated
d.size = dfData.Size
d.modTime, err = time.Parse("2006-01-02 15:04:05.999999999 -0700 MST", dfData.ModTime)
if err != nil {
fmt.Println(err)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log.Error instead of fmt.

@Myself5
Copy link
Author

Myself5 commented Oct 17, 2017

OK, so I adjusted the fmt calls, and allowed fallback to "normal" mode. For reading I added a basic check if the file is > 1MB (which no JSON file should ever be considering it contains only 5 values) and I ran it through gofmt -w scan/scan.go. For a custom reader I lack the knowledge (thats my first GO project ever), however, I wouldn't see what else we can check for.

Only thing that we might want to consider is removing the log in line 259, as on a hybrid repo that would be "intended" I'd guess.

Havn't run it in production yet either, will see if I get some time to do that on the weekend.

@Myself5 Myself5 force-pushed the master branch 2 times, most recently from 482a274 to 329e4a8 Compare October 17, 2017 19:19
Copy link
Owner

@etix etix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your changes! Few more comments below.

scan/scan.go Outdated
@@ -237,11 +247,51 @@ func (s *sourcescanner) walkSource(conn redis.Conn, path string, f os.FileInfo,
return nil, nil
}

var dfData DummyFile
dummyFile := GetConfig().DummyFiles
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of re-reading the configuration within the scan loop I would do this once in ScanSource() and store the result in the sourcescanner{} struct. Therefore if the configuration is reloaded while a scan is in progress, it won't change its behavior until the next scan.

scan/scan.go Outdated

if dummyFile {
fi, _ := os.Stat(d.path)
if fi.Size() > 1048576 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers appearing in the middle of the code are usually bad. One way to improve this specific piece of code is to add to the util package (or even util subpackage) something like:

type ByteSize int64

const (
	_           = iota
	KB ByteSize = 1 << (10 * iota)
	MB
	GB
	TB
	PB
)

Therefore your check will end up looking like this instead: fi.Size() > 1 * utils.MB.

That being said, do we really want to enforce / hardcode a limit? What if someday we'll have much bigger (and valid) files to parse? One way I see is to use a json.Decoder instead. It takes an io.Reader and therefore will only read the necessary portion of files (in most cases, way less than 1MB). It will also have the advantage of being faster.

scan/scan.go Outdated
if dummyFile {
fi, _ := os.Stat(d.path)
if fi.Size() > 1048576 {
log.Errorf("File is bigger than 1MB, falling back to normal")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really an error in case of hybrid dummy file support?

scan/scan.go Outdated
raw, err := ioutil.ReadFile(path)
if err != nil {
log.Errorf(err.Error())
log.Errorf("Failed to read file, falling back to normal")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errorf uses a Printf-like syntax, therefore your can replace this by:

log.Errorf("Failed to read file: %s", err.Error())

But again, do we want to display an error in case of an hybrid repository? What about log.Debugf?

@etix
Copy link
Owner

etix commented Oct 19, 2017

Havn't run it in production yet either, will see if I get some time to do that on the weekend.

I have yet no way to test it in production, therefore if you can do it, it would definitely speedup the merge.

Thanks again for your work :)

@Myself5 Myself5 force-pushed the master branch 9 times, most recently from cdfbd78 to d623b7b Compare October 19, 2017 12:13
The system is using a hybrid mode. This means, that if a file turns out to be an invalid json file, it will be handled "normally"
@Myself5
Copy link
Author

Myself5 commented Oct 19, 2017

Allright, every requirement should be fullfilled now. I made sure that hybrid is working. The only close edge case I could image is, that someone uploads a json file with the exact same struct we use (lets say this: https://paste.myself5.de/d0ULubrZJL.json) in THAT case, the json load process ignores the second entry and just loads the values from the first entry. We might want to consider adding a fallback and moving to "normal" mode if a file like that gets detected, up to you.

EDIT: Moved our server to the hybrid mode now, will report back if we get any complaints.
EDIT2: 3 Days into it, and not a single Error entry in the log so far. Seems to be stable.

@etix
Copy link
Owner

etix commented Oct 31, 2017

Hello,

I don't understand the following statement:

The only close edge case I could image is, that someone uploads a json file with the exact same struct we use (lets say this: https://paste.myself5.de/d0ULubrZJL.json) in THAT case, the json load process ignores the second entry and just loads the values from the first entry.

@Myself5
Copy link
Author

Myself5 commented Oct 31, 2017

Assuming we upload a file like the one I linked, we currently threat it as a dummy file, and read the information we need from the first entry in that json array. We should decide if we want to keep that behaviour, or if we want to treat the file as a "normal" file in case of more than one existing json entry.

@etix
Copy link
Owner

etix commented Oct 31, 2017

Yes, indeed. Another option would be to name these dummy files with a specific extension. For instance, for a file named vlc-2.2.6-win32.exe that we want to distribute, mirrorbits can check if a file vlc-2.2.6-win32.exe.mdata exists. This file would of course contains the appropriate json. In addition doing hybrid repositories can be easier to manage since you can easily differentiate between complete files and metadata-only files.

Opinions?

@darix
Copy link

darix commented Jan 13, 2020

why not store the metadata for each file in the DB?

@ott
Copy link
Contributor

ott commented Sep 17, 2022

There are also file formats to store file metadata. This is commonly used in backup software.

@ott
Copy link
Contributor

ott commented Sep 17, 2022

why not store the metadata for each file in the DB?

This would also allow O(1) or O(log n) access to the metadata and avoid directory traversals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants