Allow reading FileInfo from a dummy file instead of the file itself #62

Myself5 · 2017-10-03T15:06:39Z

On our (see #58) server setup, the mirrorbits server is not serving any files and is only responsible for spreadding the load. Effectively, this means it only needs the files for checksum and size generation, which produces a lot of wasted storage.

we use https://github.com/Myself5/mirrorbits_dummycreator for the dummy creation, its effectively just a json table containing the file information and therefore saves us a lot of storage (5.8MB vs 217GB).

Used that setup for the past Week on our production servers without issues.

etix

Thanks for your contribution!

It's a very interesting idea I would definitely pursue. I would probably go even further and make it support hybrid repositories, this way a repository can contain both files and dummy files. I can imagine many use-cases for that.

etix · 2017-10-17T13:45:50Z

scan/scan.go

-	d.modTime = f.ModTime()
+
+	if dummyFile {
+		raw, err := ioutil.ReadFile(path)


In case someone makes a configuration error, we don't want mirrorbits to load a multi-gigabyte file into memory. Making a custom reader that will do some basic sanity check would be much better. Even better, if the file is not a "dummyFile", then fallback to the standard method. This way, we can have an hybrid repository.

etix · 2017-10-17T13:49:01Z

scan/scan.go

+		if err != nil {
+			fmt.Println(err.Error())
+			fmt.Println("Failed to read file, DELETING!")
+			os.Remove(path)


This is way too dangerous and can result in a massive data loss if the redirector is also the main rsync server. Just ignore the file and display an error to the console instead.

etix · 2017-10-17T13:50:22Z

scan/scan.go

+		var c []DummyFile
+		err = json.Unmarshal(raw, &c)
+		if err != nil {
+    		log.Errorf("error decoding json: %v", err)


It seems you're mixing space and tabs. Please go fmt your code.

etix · 2017-10-17T13:53:30Z

scan/scan.go

+	if dummyFile {
+		raw, err := ioutil.ReadFile(path)
+		if err != nil {
+			fmt.Println(err.Error())


Use log.Error instead of fmt.

etix · 2017-10-17T13:53:48Z

scan/scan.go

+		d.size =  dfData.Size
+		d.modTime, err = time.Parse("2006-01-02 15:04:05.999999999 -0700 MST", dfData.ModTime)
+		if err != nil {
+			fmt.Println(err)


Use log.Error instead of fmt.

Myself5 · 2017-10-17T19:09:36Z

OK, so I adjusted the fmt calls, and allowed fallback to "normal" mode. For reading I added a basic check if the file is > 1MB (which no JSON file should ever be considering it contains only 5 values) and I ran it through gofmt -w scan/scan.go. For a custom reader I lack the knowledge (thats my first GO project ever), however, I wouldn't see what else we can check for.

Only thing that we might want to consider is removing the log in line 259, as on a hybrid repo that would be "intended" I'd guess.

Havn't run it in production yet either, will see if I get some time to do that on the weekend.

etix

Thanks for your changes! Few more comments below.

etix · 2017-10-19T08:44:17Z

scan/scan.go

@@ -237,11 +247,51 @@ func (s *sourcescanner) walkSource(conn redis.Conn, path string, f os.FileInfo,
 		return nil, nil
 	}

+	var dfData DummyFile
+	dummyFile := GetConfig().DummyFiles


Instead of re-reading the configuration within the scan loop I would do this once in ScanSource() and store the result in the sourcescanner{} struct. Therefore if the configuration is reloaded while a scan is in progress, it won't change its behavior until the next scan.

etix · 2017-10-19T09:03:07Z

scan/scan.go


+	if dummyFile {
+		fi, _ := os.Stat(d.path)
+		if fi.Size() > 1048576 {


Magic numbers appearing in the middle of the code are usually bad. One way to improve this specific piece of code is to add to the util package (or even util subpackage) something like:

type ByteSize int64 const ( _ = iota KB ByteSize = 1 << (10 * iota) MB GB TB PB )

Therefore your check will end up looking like this instead: fi.Size() > 1 * utils.MB.

That being said, do we really want to enforce / hardcode a limit? What if someday we'll have much bigger (and valid) files to parse? One way I see is to use a json.Decoder instead. It takes an io.Reader and therefore will only read the necessary portion of files (in most cases, way less than 1MB). It will also have the advantage of being faster.

etix · 2017-10-19T09:04:35Z

scan/scan.go

+	if dummyFile {
+		fi, _ := os.Stat(d.path)
+		if fi.Size() > 1048576 {
+			log.Errorf("File is bigger than 1MB, falling back to normal")


Is it really an error in case of hybrid dummy file support?

etix · 2017-10-19T09:26:04Z

scan/scan.go

+		raw, err := ioutil.ReadFile(path)
+		if err != nil {
+			log.Errorf(err.Error())
+			log.Errorf("Failed to read file, falling back to normal")


Errorf uses a Printf-like syntax, therefore your can replace this by:

log.Errorf("Failed to read file: %s", err.Error())

But again, do we want to display an error in case of an hybrid repository? What about log.Debugf?

etix · 2017-10-19T09:36:22Z

Havn't run it in production yet either, will see if I get some time to do that on the weekend.

I have yet no way to test it in production, therefore if you can do it, it would definitely speedup the merge.

Thanks again for your work :)

The system is using a hybrid mode. This means, that if a file turns out to be an invalid json file, it will be handled "normally"

Myself5 · 2017-10-19T12:20:39Z

Allright, every requirement should be fullfilled now. I made sure that hybrid is working. The only close edge case I could image is, that someone uploads a json file with the exact same struct we use (lets say this: https://paste.myself5.de/d0ULubrZJL.json) in THAT case, the json load process ignores the second entry and just loads the values from the first entry. We might want to consider adding a fallback and moving to "normal" mode if a file like that gets detected, up to you.

EDIT: Moved our server to the hybrid mode now, will report back if we get any complaints.
EDIT2: 3 Days into it, and not a single Error entry in the log so far. Seems to be stable.

etix · 2017-10-31T10:24:54Z

Hello,

I don't understand the following statement:

The only close edge case I could image is, that someone uploads a json file with the exact same struct we use (lets say this: https://paste.myself5.de/d0ULubrZJL.json) in THAT case, the json load process ignores the second entry and just loads the values from the first entry.

Myself5 · 2017-10-31T10:33:01Z

Assuming we upload a file like the one I linked, we currently threat it as a dummy file, and read the information we need from the first entry in that json array. We should decide if we want to keep that behaviour, or if we want to treat the file as a "normal" file in case of more than one existing json entry.

etix · 2017-10-31T12:42:28Z

Yes, indeed. Another option would be to name these dummy files with a specific extension. For instance, for a file named vlc-2.2.6-win32.exe that we want to distribute, mirrorbits can check if a file vlc-2.2.6-win32.exe.mdata exists. This file would of course contains the appropriate json. In addition doing hybrid repositories can be easier to manage since you can easily differentiate between complete files and metadata-only files.

Opinions?

darix · 2020-01-13T16:19:20Z

why not store the metadata for each file in the DB?

ott · 2022-09-17T02:09:06Z

There are also file formats to store file metadata. This is commonly used in backup software.

ott · 2022-09-17T02:09:47Z

why not store the metadata for each file in the DB?

This would also allow O(1) or O(log n) access to the metadata and avoid directory traversals.

etix requested changes Oct 17, 2017

View reviewed changes

Myself5 force-pushed the master branch from 3e316c2 to d4fb8af Compare October 17, 2017 19:06

Myself5 force-pushed the master branch 2 times, most recently from 482a274 to 329e4a8 Compare October 17, 2017 19:19

etix requested changes Oct 19, 2017

View reviewed changes

Myself5 force-pushed the master branch 9 times, most recently from cdfbd78 to d623b7b Compare October 19, 2017 12:13

Allow reading FileInfo from a dummy file instead of the file itself

85f6df4

The system is using a hybrid mode. This means, that if a file turns out to be an invalid json file, it will be handled "normally"

Myself5 force-pushed the master branch from d623b7b to 85f6df4 Compare October 19, 2017 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow reading FileInfo from a dummy file instead of the file itself #62

Allow reading FileInfo from a dummy file instead of the file itself #62

Myself5 commented Oct 3, 2017 •

edited

etix left a comment

etix Oct 17, 2017

etix Oct 17, 2017

etix Oct 17, 2017

etix Oct 17, 2017

etix Oct 17, 2017

Myself5 commented Oct 17, 2017 •

edited

etix left a comment

etix Oct 19, 2017

etix Oct 19, 2017

etix Oct 19, 2017

etix Oct 19, 2017

etix commented Oct 19, 2017

Myself5 commented Oct 19, 2017 •

edited

etix commented Oct 31, 2017

Myself5 commented Oct 31, 2017

etix commented Oct 31, 2017

darix commented Jan 13, 2020

ott commented Sep 17, 2022

ott commented Sep 17, 2022 •

edited

Allow reading FileInfo from a dummy file instead of the file itself #62

Are you sure you want to change the base?

Allow reading FileInfo from a dummy file instead of the file itself #62

Conversation

Myself5 commented Oct 3, 2017 • edited

etix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Myself5 commented Oct 17, 2017 • edited

etix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etix commented Oct 19, 2017

Myself5 commented Oct 19, 2017 • edited

etix commented Oct 31, 2017

Myself5 commented Oct 31, 2017

etix commented Oct 31, 2017

darix commented Jan 13, 2020

ott commented Sep 17, 2022

ott commented Sep 17, 2022 • edited

Myself5 commented Oct 3, 2017 •

edited

Myself5 commented Oct 17, 2017 •

edited

Myself5 commented Oct 19, 2017 •

edited

ott commented Sep 17, 2022 •

edited