-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reading FileInfo from a dummy file instead of the file itself #62
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
It's a very interesting idea I would definitely pursue. I would probably go even further and make it support hybrid repositories, this way a repository can contain both files and dummy files. I can imagine many use-cases for that.
scan/scan.go
Outdated
d.modTime = f.ModTime() | ||
|
||
if dummyFile { | ||
raw, err := ioutil.ReadFile(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case someone makes a configuration error, we don't want mirrorbits to load a multi-gigabyte file into memory. Making a custom reader that will do some basic sanity check would be much better. Even better, if the file is not a "dummyFile", then fallback to the standard method. This way, we can have an hybrid repository.
scan/scan.go
Outdated
if err != nil { | ||
fmt.Println(err.Error()) | ||
fmt.Println("Failed to read file, DELETING!") | ||
os.Remove(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is way too dangerous and can result in a massive data loss if the redirector is also the main rsync server. Just ignore the file and display an error to the console instead.
scan/scan.go
Outdated
var c []DummyFile | ||
err = json.Unmarshal(raw, &c) | ||
if err != nil { | ||
log.Errorf("error decoding json: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you're mixing space and tabs. Please go fmt
your code.
scan/scan.go
Outdated
if dummyFile { | ||
raw, err := ioutil.ReadFile(path) | ||
if err != nil { | ||
fmt.Println(err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use log.Error
instead of fmt.
scan/scan.go
Outdated
d.size = dfData.Size | ||
d.modTime, err = time.Parse("2006-01-02 15:04:05.999999999 -0700 MST", dfData.ModTime) | ||
if err != nil { | ||
fmt.Println(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use log.Error
instead of fmt.
OK, so I adjusted the fmt calls, and allowed fallback to "normal" mode. For reading I added a basic check if the file is > 1MB (which no JSON file should ever be considering it contains only 5 values) and I ran it through gofmt -w scan/scan.go. For a custom reader I lack the knowledge (thats my first GO project ever), however, I wouldn't see what else we can check for. Only thing that we might want to consider is removing the log in line 259, as on a hybrid repo that would be "intended" I'd guess. Havn't run it in production yet either, will see if I get some time to do that on the weekend. |
482a274
to
329e4a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your changes! Few more comments below.
scan/scan.go
Outdated
@@ -237,11 +247,51 @@ func (s *sourcescanner) walkSource(conn redis.Conn, path string, f os.FileInfo, | |||
return nil, nil | |||
} | |||
|
|||
var dfData DummyFile | |||
dummyFile := GetConfig().DummyFiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of re-reading the configuration within the scan loop I would do this once in ScanSource()
and store the result in the sourcescanner{}
struct. Therefore if the configuration is reloaded while a scan is in progress, it won't change its behavior until the next scan.
scan/scan.go
Outdated
|
||
if dummyFile { | ||
fi, _ := os.Stat(d.path) | ||
if fi.Size() > 1048576 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic numbers appearing in the middle of the code are usually bad. One way to improve this specific piece of code is to add to the util
package (or even util
subpackage) something like:
type ByteSize int64
const (
_ = iota
KB ByteSize = 1 << (10 * iota)
MB
GB
TB
PB
)
Therefore your check will end up looking like this instead: fi.Size() > 1 * utils.MB
.
That being said, do we really want to enforce / hardcode a limit? What if someday we'll have much bigger (and valid) files to parse? One way I see is to use a json.Decoder instead. It takes an io.Reader
and therefore will only read the necessary portion of files (in most cases, way less than 1MB). It will also have the advantage of being faster.
scan/scan.go
Outdated
if dummyFile { | ||
fi, _ := os.Stat(d.path) | ||
if fi.Size() > 1048576 { | ||
log.Errorf("File is bigger than 1MB, falling back to normal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really an error in case of hybrid dummy file support?
scan/scan.go
Outdated
raw, err := ioutil.ReadFile(path) | ||
if err != nil { | ||
log.Errorf(err.Error()) | ||
log.Errorf("Failed to read file, falling back to normal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errorf
uses a Printf-like syntax, therefore your can replace this by:
log.Errorf("Failed to read file: %s", err.Error())
But again, do we want to display an error in case of an hybrid repository? What about log.Debugf
?
I have yet no way to test it in production, therefore if you can do it, it would definitely speedup the merge. Thanks again for your work :) |
cdfbd78
to
d623b7b
Compare
The system is using a hybrid mode. This means, that if a file turns out to be an invalid json file, it will be handled "normally"
Allright, every requirement should be fullfilled now. I made sure that hybrid is working. The only close edge case I could image is, that someone uploads a json file with the exact same struct we use (lets say this: https://paste.myself5.de/d0ULubrZJL.json) in THAT case, the json load process ignores the second entry and just loads the values from the first entry. We might want to consider adding a fallback and moving to "normal" mode if a file like that gets detected, up to you. EDIT: Moved our server to the hybrid mode now, will report back if we get any complaints. |
Hello, I don't understand the following statement:
|
Assuming we upload a file like the one I linked, we currently threat it as a dummy file, and read the information we need from the first entry in that json array. We should decide if we want to keep that behaviour, or if we want to treat the file as a "normal" file in case of more than one existing json entry. |
Yes, indeed. Another option would be to name these dummy files with a specific extension. For instance, for a file named Opinions? |
why not store the metadata for each file in the DB? |
There are also file formats to store file metadata. This is commonly used in backup software. |
This would also allow O(1) or O(log n) access to the metadata and avoid directory traversals. |
On our (see #58) server setup, the mirrorbits server is not serving any files and is only responsible for spreadding the load. Effectively, this means it only needs the files for checksum and size generation, which produces a lot of wasted storage.
we use https://github.com/Myself5/mirrorbits_dummycreator for the dummy creation, its effectively just a json table containing the file information and therefore saves us a lot of storage (5.8MB vs 217GB).
Used that setup for the past Week on our production servers without issues.