Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

moosefs metalogger #85

Open
linux-ops opened this issue Dec 16, 2017 · 56 comments
Open

moosefs metalogger #85

linux-ops opened this issue Dec 16, 2017 · 56 comments
Labels
data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. documentation Issue related to documentation PR welcome question Question

Comments

@linux-ops
Copy link

If I use the moosefs metalogger on the master and standby nodes of mfsmaster, if you set the metadata download every 24 hours, does the 24 hour data loss if the primary node fails before downloading the metadata? I know you can use changelog + metadata recovery , but this will not lose data? metalogger slave Modification log is loaded when starting mfsmaster is a log file Or is it based on the metadata time to load multiple log files?

@acid-maker acid-maker added the question Question label Apr 17, 2018
@guestisp
Copy link

+1 for this question

@zcalusic
Copy link
Contributor

On a good network, metalogger is probably less than a second behind a master. Meaning, if you suddenly completely lose a master server and all its data, everything up to the last second of metadata will already be on the metalogger, ready to be instantiated as a new master.

Actually, I tried that in practice long time ago, and it worked really well.

24 hour download you mention is probably full metadata backups, but there are always new changelogs after that to cover the period between full metadata backups. Just make sure metaloggers are up & running at all times.

@guestisp
Copy link

On a good network, metalogger is probably less than a second behind a master. Meaning, if you suddenly completely lose a master server and all its data, everything up to the last second of metadata will already be on the metalogger, ready to be instantiated as a new master.

In some environements, even 1 second lost could be a huge damage.
Ie, any DB, try to hard-revert to 1 seconds ago and see what's happens to data consistency.

@zcalusic
Copy link
Contributor

@guestisp if your database can't survive a crash, may I advise you choose another one? Classic hard disks will keep much more than 1 second of data in its caches, and it will all be lost when power suddenly goes off. Most of the databases will recover just fine even in such scenarios.

Then again, I wouldn't really advise running classic databases on MooseFS for many reasons, so we're getting offtopic here...

@guestisp
Copy link

If your database can't survive a crash, may I advise you choose another one?
Classic hard disks will keep much more than 1 second of data in its caches, and it will all be lost when power suddenly goes off. Most of the databases will recover just fine even in such scenarios.

That's why if you are using RAID, the disk cache is disabled.

Then again, I wouldn't really advise running classic databases on MooseFS for many reasons, so we're getting offtopic here...

So ,you won't use MooseFS for VM image hosting ? Because on many VMs, there is an high change that a database is running (in example, a basic Wordpress site running on a VM. There is also a MySQL database on the same machine running on top of MooseFS)

@zcalusic
Copy link
Contributor

Right now, no, I'm personally not using MooseFS for neither VM images nor databases.

Would I use it for VM images if need be? Probably yes, I've tested some loopback file systems on it, and it worked quite well, actually great. Would I put mission critical database that can't survive crash on it? No.

@guestisp
Copy link

Would I put mission critical database that can't survive crash on it? No

Neither do I.
But I think you agree with me that a properly HA system wouldn't have 1 second delay between master and slave. Data should be consistent across all servers, that's the meaning for an HA system.
Otherwise you'll have a fault tollerance system, where you'll be able to be back up and running in a couple of second, thus surviving to a disaster, but this is not an highly-available system.

If you lost some data, then that data wasn't highly available.

Even MySQL replication (that is very unstable, imho) on very heavy write workload doesn't have this delay. I've never seen an out-of-sync mysql replica (except when there was an error, but this is another story). In normal conditions, if a slave server is performing well (more or less equally to the master), a slave won't be out-of-sync. (in fact, you are able to point all reads to the slave server, because master and slaves are supposed to by in sync forever)

@zcalusic
Copy link
Contributor

Obviously we live in different worlds, 'cause I regularly see much longer delays in MySQL replication. Yes, on expensive and properly tuned hardware.

But, as you now started twisting my words, and claiming things I never said, this conversation is now over.

Good luck!

@guestisp
Copy link

But, as you now started twisting my words, and claiming things I never said, this conversation is now over.

Please point me where this is happened...

@4Dolio
Copy link

4Dolio commented May 20, 2018

Back to the original topic. Metaloggers and slaved masters record changlogs which should have all revisions up to the point at which the master is demoted or crashes. No loses should be expected unless you can demonstrate otherwise.

@guestisp
Copy link

No loses should be expected unless you can demonstrate otherwise.

or unless the metalogger is out of sync. 1 second behing master means dataloss for sure on a medium sized cluster with databases and VMs.

So, is this "less than a second behind" something we could expect or something rare ?

@pkonopelko
Copy link
Member

Guys, easy. It is definitely not 1 second, but much less. I will try to elaborate on this topic and how it is done in some time (later today).

Best,
Peter / MooseFS Team

@4Dolio
Copy link

4Dolio commented May 20, 2018

I my experiance metalogging real time low or zero latency/loss.

@guestisp
Copy link

I my experiance metalogging real time low or zero latency/loss.

👍 This is what I'm expecting.

@4Dolio
Copy link

4Dolio commented May 20, 2018

Looking forward to oxide94 actual low level run through! My laymans understanding is that all operations go through master and are propagated to loggers at nearly the same priority as it's own metalog stream to disk. The odds of a single op not reaching the log stream are very low and might only occur during a master crash/fault, which itself I expect to be rare. Most transitions would be planned and thus the metalog streams and promotions always consistent.

With that said, other failures of disk or network etc have their own fail modes, yet any non-faulty metalog consumer will still remain consistent up to at worst one operation.

To the best of my technical knowledge and practical experience the master metadata changelog is real time and ACID durable I believe. I consider it to be, given a properly operating environment, with innumerable variation is specific deployments.

@pkonopelko
Copy link
Member

Hi all,

referring to the original question:

does the 24 hour data loss if the primary node fails before downloading the metadata?

No. As @zcalusic wrote:

24 hour download you mention is probably full metadata backups, but there are always new changelogs after that to cover the period between full metadata backups.

... and you can easily recover in case of any failure using lastly saved "full" metadata file and changelogs.

Regarding the rest of the discussion:

First of all, as @acid-maker announced here: #39 (comment), in MooseFS 4.x our full Master Servers HA implementation is moved to MooseFS Community, so I will be also mentioning our HA (Leader Master Server and Follower Master Servers).

In general what @4Dolio wrote is true. When there is a change in metadata performed, it is:

  1. Obviously done in (Leader) Master Server metadata structures in RAM,
  2. Written to changelog on local disk and send to Follower Master Server(s) / Metalogger(s).

And now, regarding point number 2, you may ask: "What happens first?". Metadata changes are pushed to a simple FIFO queue and just sent / written. There is no easy answer, what happens first, as there are plenty of buffers - kernel TCP buffer, network interface buffer etc., also disk has its own physical buffer.

In MooseFS HA when you loose Leader Master Server, one of Followers is automatically elected to be a new Leader. This is quite complicated process and has a lot of protections (e.g. split-brain situation prevention, so to have HA you need to have at least "integer" half of Chunkservers + 1 up and running in order to elect a new Leader). But in this case, if you loose a Leader Master Server which returned status "OK" to the client for the last operation before failure, last metadata change most probably was able to go to Follower(s), so it will be there. If status "OK" was not returned to the client, the operation will be reinitiated by Client.

But given that, even if something would go terribly wrong (like a major power outage everywhere) and we are considering metadata loss (if any), the loss would be rather at the level of a single metadata change (and definitely not a second of changes). Moreover - in our HA implementation, after such failure, you need to bring up all the Master Servers and they will "see" which one has the newest metadata version and only that one can be elected to a new Leader.

Please also do not forget to distinguish metadata operations and data operations, e.g. file creation and writing to a file.

I wrote a lot, so if something would be unclear or if you lost somewhere, please ask - I will be happy to clarify / elaborate.

Thanks,
Peter / MooseFS Team

@4Dolio
Copy link

4Dolio commented May 21, 2018

Ty for the great explanation. And for MooseFS itself.

@guestisp
Copy link

@OXide94 Thank you for this clear explanation.

Some side questions: do you have any checksum during the communication between master and followers just to be sure that follower has a verbatim copy of all metadatas (in other words: how can we sure that our followers are 100% identical to the master ?). AFAIK, Lizard has some checksum even in master/shadow. If checksum from the shadow doesn't match, a full metadata dump is made from master to the shadow, just to be sure to replicate exactly.

What in case of multiple followers where some of them our out-of-sync ? Is the automatic leader election elect the most up-to-date based on metadata version? In example, follower1, follower2 and follower3 are at revision 1245, follower4, for some reason, is at 1244. The election should happen with one of follower1, follower2 and follower3. Follower4 must not be elected until it gets up to date.

@4Dolio
Copy link

4Dolio commented May 21, 2018

I think TCP takes care of some of the metalog transmission consistancy concerns. Still worth an official response though I suppose.

@linux-ops
Copy link
Author

The core of this issue is:

1 How to ensure consistency between master and slave data? (Like Zookpeer's design architecture)
2 After a power failure occurs, use Slave log to recover if there is any data loss...???

@acid-maker
Copy link
Member

@linux-ops

  1. In master-metalogger model we can't check consistency because metalooger doesn't know anything - it can only write changelog to HDD. In HA (Leader-Follower) model in version 3.x (PRO) we send some extra data together with changes (like number of files removed from trash, inode number of created inode, number of new chunk etc.). This is usually enough to make sure that servers are in sync. In version 4.x (HA will be in public code) we introduced extra CRC checks (once a day). Those CRCs mainly checks if there are no bugs in code that could lead to differences between servers - this is why we think that checking it once a day is more than enough.
  2. In master-metalogger model you have to manually check which server has more recent data and decide if you want to use metalogger or just start your master. In HA model it is automatic. You just start all masters using '-a' option, so they recover from changelogs as much as possible. When all are up (and only then) one of them will connect to each other, checks which one has highest metadata version number and chooses it as a new "ELECT".

@linux-ops
Copy link
Author

@acid-maker
Thanks!
In non-PRO versions, how do I ensure that when the primary service fails, I switch to a technology that does not lose data from the service...(metalogger)??

@linux-ops
Copy link
Author

@acid-maker DRDB ?

@linux-ops
Copy link
Author

In the production environment, when the main server fails, I will manually use the metalogger service, but the problem is very worried about whether there will be data loss. Such as 1 second of data? Because I observed that the synchronization of the change log is first to the master in the synchronization from the master to the metalogger service, there will be delays in the middle...

@guestisp
Copy link

In version 4.x (HA will be in public code) we introduced extra CRC checks (once a day). Those CRCs mainly checks if there are no bugs in code that could lead to differences between servers - this is why we think that checking it once a day is more than enough.

@acid-maker I'm not really sure that crc check once per day would be enough. What if a transmit error or something else not related to MooseFS will start to send garbage to the slave ? You won't be noticed until the daily crc check, but you are flooding the slave with garbage.

A CRC check is useful in each transaction, it will improve (a lot) data consistency and with modern hardware should't hurt too much (you don't need a sha256 but a simple CRC-8, very very fast and more then enough to detect corruption)

@acid-maker
Copy link
Member

@guestisp Sending extra data (inode number etc.) is enough to make it safe. This is why they can't desync and not notice it. We use this system in production succesfully since 2005 (same check was done by metarestore before HA). Any corruption that can pass this is merely some unimportant file attributes. This is why we want to check CRC (of whole metadata) from time to time. 24h is the default - you can reconfigure it to check every hour, but it is not necessary. The only cases we ever noticed CRC errors were cases of software bugs (not caused by errors in changelog itself but by wrong interpretation of changelog done by follower). In such cases only single inodes were affected and the difference was minor - this can't lead to avalanche of errors not noticed by follower.

CRC in TCP etc. prevents from flooding slave (follower) with garbage. In case of any network issue follower will desync.

Why should I add my own CRC to each changelog? This will duplicate OS checksums. We use CRC in I/O packets but only because we already have them in chunks, so we can send them.

We don't want to check each change. What we really want is to check CRC of whole metadata after applying the change - and it is not so simple, and not really needed.

@acid-maker
Copy link
Member

@linux-ops In case of classic HDD you also lose a lot of data in case of power outage. What do you expect from MFS? Use better UPS or two independent power suppplies.

@acid-maker
Copy link
Member

@linux-ops And next time use English please.

@guestisp
Copy link

@acid-maker is not clear to me why sending extra data like inode number should prevent data corruption in metadata, could you please elaborate this ?

@pkonopelko
Copy link
Member

pkonopelko commented May 30, 2018

is not clear to me why sending extra data like inode number should prevent data corruption in metadata, could you please elaborate this ?

As @acid-maker wrote:

This is usually enough to make sure that servers are in sync.

How? Because this "extra" or "excess" information is used on Followers to check if the operation was performed correctly. So when Leader performs an operation and sends changelog entry to Followers, they do perform the same operation in their metadata structures. For example - when there is mkdir (or rather CREATE()) operation performed, Followers also get CREATE() command. Redundant information which is sent with changelog is an i-node number (and Followers do not need this information to perform the operation). Then it is checked by Followers and if created i-node number would differ, the function which gathers changelog entries returns MFS_ERROR_MISMATCH, Follower becomes DESYNCED and downloads all metadata from Leader.

Similarly it is done with other operations - e.g. with EMPTYTRASH() changelog entry extra info which is sent is number of removed i-nodes. For WRITE() it is a number (ID) of a newly created chunk etc., etc. In each case this excess information is checked (otherwise there would be no sense to send it).

Best regards,
Peter / MooseFS Team

PS: Checksumming everything here on the fly would really be an "overkill", because it is CPU-consuming. Please keep in mind (again, as @acid-maker wrote), that we really do care about data consistency. There will always be a "competition" between speed and safety and we do everything in order to keep a good balance between these two, how important, aspects. And yes - sometimes maintaining this balance is not a piece of cake :)

Remember, that in MooseFS, data safety and consistency always has a higher priority than speed. However in some cases there is no justification to make dozens of inefficient protections against scenarios where simple solutions are just good and just work (and they also help to keep efficiency at a good level). This is something, which differentiates MooseFS from other solutions - we usually spend more time on thinking something out and doing everything we can do to see it from many perspectives before implementation.

@acid-maker
Copy link
Member

@guestisp: I think that oxide94 described it very well, but I can add something to it.

It is how I see it. We have at least four levels of possible errors:

  1. corrupted data on chunkserver
  2. corrupted chunkid chains in inode
  3. corrupted inode numbers in folder (corrupted inode tree)
  4. corrupted attributes (uid,gid,mode,atime,mtime etc.)

First we should ask question - How important are they.

  1. This is really serious, but it is covered by CRC32 which is not "just sum of bytes". Why we do that? Because there could be some problems with filesystem on harddrives used by CS (especially after power outage).

  2. This is also very serious. Bad assignment of chunkid to inode chain will lead to serious data corruptions (you will see content of one file as part of another). This is covered by sending chunkid with each WRITE changelog as a check - this is why sending chunkid with WRITE is really important.

  3. This is little less serious becuse even with totally corrupted tree you will stil have your files intact, but probably you experienced in your life case when filesystem corruption lead to move lots of files to 'lost+found' (I did once). Searching there for your precious files is horrible task. But do not worry this case is also covered in MFS by sending inode number with all CREATE, UNLINK etc. commands (commands that altered tree structure). Why we do that? Because simple mistake here can lead to chaos in your inode tree. Also one mistake (for example - hipothetical bug in code causes that on LEADER one more inode was marked as free after trash removal than on FOLLOWER) without such extra inode numbers will stay undetected for a long time. Imagine then scenario when you create file and write to it: on LEADER 'CREATE' got inode X, on FOLLOWER Y then next command is 'WRITE' - but you write to inodes, not file names, so you write inode X - on LEADER it is ok, but FOLLOWER will alter totally different file (because number X in FOLLOWER is something else). This is why checking inode numbers here is crucial for data safety.

  4. This is not so serious. We have been using MFS for more than ten years and never noticed such corruption, but still lets assume that something wrong had happened. Even if there were unnoticed corrupted byte in changelog stream in the worst case it would lead to one file in your whole system witch wrong uid for example. Usually it will be detected easily by next access to file (different UID will cause permission errors EPERM or EACCESS on FOLLOWER). In MFS 3.x this point is not covered at all. In MFS 4.x there are CRC's of whole metadata checked every 24 hours (or 1 hour if you like) and such errors will be detected there.

I think that you are somehow affraid that errors mentioned in point 4 can happened frequently undetected and this is why we should check EVERYTHING. No we shouldn't. oxide94 describbed it very well. We will concentrate our effort on thinks that are really important. But since you are not convinced I write little more about point 4.

I see here three possibilities:

  1. You have faulty hardware that somehow (becuse of errors in the kernel, hardware - whatever) can lead to frequent errors passing through TCP stack undetected.

  2. Such corruption happens once a year (cosmic radiation or whatever).

  3. Your network is fine but there is a bug in MFS that leads to misinterpretation of line in changelog.

And how we deal with that:

  1. Here you will notice a lot of errors and desyncs between LEADER and FOLLOWER - even on MFS 3.x because changelog has its own structure, our packets have their own structure. If you do not believe me do simple experiment. Few minutes before saving your metadata copy your metadata.mfs.back file and your changelog.0.mfs to same location but on new machine. If you run 'mfsmaster -a' there you should see that it is able to apply all changes from changelog to your backup file. Now move back to your metadata.mfs.back and changelog.0.mfs, change some random bytes in your changelog and do the same - see if master is able to apply all your corrupted changes without errors. You can repeat that experiment to see the odds of passing undetected random errors.

  2. This is so rare that at most can corrupt attributes of one (yes ONE) file. In MFS 3.x it will be undetected. In MFS 4.x it will be detected in 24 hours (in the worst scenario). Probablities are miniscule and curruption is not so serious. Should we spend CPU power to cover this? Not really in my personal opinion.

  3. This is most serious scenario here. It can be undetected in MFS 3.x. We even had it once. Bug in software caused that 'mask' in ACL's were different in FOLLOWER that in LEADER. We noticed the problem only after switching from LEADER to FOLLOWER. We had hundreds of affected files. Was it serious - not so. Could we repair it? Yes and we fixed it with ease. But this one case showed us that extra CRC of whole metadata has to be checked from time to time (24 hour is more than enough here) to catch such bugs and fix them as soon as possible (and likely on our oun instance - even before release code with such bug). This is why we added CRC check for whole metadata.

@acid-maker
Copy link
Member

One more thing. This is open source code and I can assure you that if you make correct pull request and add optional CRC to each changelog (switched off by default) I will accept such change to code, but I see no reason to do it myself.

@guestisp
Copy link

@OXide94 @acid-maker thank you for your detailed responses, you convinced me!

@zcalusic
Copy link
Contributor

I'd also like to thank you guys for such great explanations. I already have high opinion on MFS robustness, having never lost a single file on it, and seeing to what great lenghts you go to make sure everything's OK is reassuring. 👍

Slightly offtopic, when can we test improvements in MFS 4.x? 😋

@guestisp
Copy link

Slightly offtopic, when can we test improvements in MFS 4.x?

[ envy mode on ]
I'm already testing it 😄
[ /envy mode off ]

@pkonopelko
Copy link
Member

@zcalusic I have just sent you an e-mail regarding MooseFS 4.x testing

Best regards,
Peter / MooseFS Team

@linux-ops
Copy link
Author

linux-ops commented May 31, 2018

@OXide94 @acid-maker After the mfsmsater service was powered off, it did not generate mfsmetatdata.mfs. I tried to use mfsmaster -a for repair. I was lucky to start successfully, But I found that Will lose some of the file data created before the power outage, I can not understand why, because mfsmaster-a will load the change log, but the test result is lost data, again stated that I power off the mfsmaster service , not mfschunkserver. !!! Power off !!! Is it that the modified log file in the memory is not flushed to the hard disk in time?????

@zcalusic
Copy link
Contributor

@linux-ops to properly debug this, first check your underlying infrastructure. In the event of power loss, disks will lose everything from their caches, if you run them with write cache turned on (typically, the default). It also applies to RAID controllers, if you have write cache turned on, you need to have battery attached. Finally, filesystem where data resides needs to be mounted with barriers on (default for the last few years).

Only when you have all those settings checked and in place, can potential mfsmaster issues be further debugged.

N.B. Last week Hetzner had a nasty power outage in some of their DC, and yes, wherever we had write cache turned on on RAID controller, and no battery backup, MySQL databases were corrupted and had to be restored from backup and/or reloaded from master again (slaves).

@4Dolio
Copy link

4Dolio commented May 31, 2018

Can I also get an invite for 4.0 testing? Please.

@linux-ops
Copy link
Author

@zcalusic The question that I am most concerned about is: Does the mfsmaster's modified log file force the write to the hard disk cache, or does it write to the memory buffer successfully?

@linux-ops
Copy link
Author

A very good news tells you that in the environment where I simulated the hardware failure, I have been creating a file, but after the creation is complete, I will execute the sync command to force the buffer to the hard disk. After the mfsmaster server is powered off, it is again Starting to use the mfsmaster -a file is not lost, but mfsmaster does not force sync to the hard disk but directly to the buffer in the file that writes the modification log. I hope to fix this problem in the next release. In the production environment, the hardware The failure is unpredictable, and it is not possible to lose any files. The current solution is to periodically flush the disk to the hard disk....

@guestisp
Copy link

guestisp commented Jun 1, 2018

Are you talking about the changelog file ? If so, yes, I think that changelog should be written in sync mode

@guestisp
Copy link

guestisp commented Jun 1, 2018

In other words, changelog.0.mfs (and all metadata related files during dumps) should be opened with O_SYNC | O_DIRECT in addition to the other flags

@zcalusic
Copy link
Contributor

zcalusic commented Jun 1, 2018

But, you do understand that syncing every metadata change could be extremely expensive and kill performance a lot?

OTOH, I wouldn't mind seeing a tunable in the future, so if you really want it, you can turn it on. I.e. if you need consistency badly, but don't care about performance too much, well, then go on and turn it on.

@guestisp
Copy link

guestisp commented Jun 1, 2018

But, you do understand that syncing every metadata change could be extremely expensive and kill performance a lot?

It shouldn't, as metadata operation are done in ram and, as long you put /var/lib/mfs to SSD, changelog operations are sequential, thus syncing on SSD shouldn't harm too much.

Maybe by using ZFS on /var/lib/mfs could fix the same.

@zcalusic
Copy link
Contributor

zcalusic commented Jun 1, 2018

Ha ha, but what are changelog operations if not that same metadata operations in RAM? 😄

So, if you're after full consistency, you'd have to wait for both in RAM and on disk change, right (otherwise we wouldn't discuss this topic)? And that operation would finish as soon as the slower of the two, right? I hope we can agree then, that every metadata operation would be bounded by disk sync speed.

Don't have time right now, but I might do some baseline benchmark for you later today, with a terribly simple bench app I wrote a decade ago. You will be surprised what what fdatasync() does to otherwise very performant storage devices, on paper, at least. 😄

@guestisp
Copy link

guestisp commented Jun 1, 2018

I'm not talking about consistancy between RAM and disk but consistancy in data wrote on disk.

If you have power loss during a write, you'll end up with a corrupted changelog so you have a died master and a corrupted changelog at the same time

Even worse would be with metadata dump that AFAIK is in binary form, thus a partial write could lead to a bigger corruption

@zcalusic
Copy link
Contributor

zcalusic commented Jun 1, 2018

So you're completely fine if in-RAM version of metadata and on disk changelog are not consistent? Bravo! 👍

@linux-ops
Copy link
Author

I just want to say that I hope that the author can support the write confirmation log file must be confirmed sync fun

@guestisp
Copy link

guestisp commented Jun 1, 2018

So you're completely fine if in-RAM version of metadata and on disk changelog are not consistent?

Currently it's already like this. There isn't any consistency between RAM, disk and followers.

If metadata is written to RAM and to disk simultaneously, you are still limited to disk speed or you'll have inconsistency, thus, the "keep everything in RAM" advantage is only for reads as write are capped by the disk speed.

If metadata is writte to RAM and to disk in a way similiar to a writeback, you still have inconsistency because disk is not updated as quickly as ram

Standard SSDs have almost 0 latency and are able to write to 500MB/s (or Optanes are about 2000MB/s, 500.000 IOPS when writing), I don't think you are writing 2000MB/s of metadatas, in this case, a sync write doesn't hurt the performance as long you don't have more than 500.000 changes/s or more than 2000MB/s changes in metadatas.

I really dubt that someone would saturate an NVMe/Optane for metadata operations.

@4Dolio
Copy link

4Dolio commented Jun 2, 2018

This is entirely dependent on the environment. It would likely (nearly always) harm performance to force the master to wait on disks (even NVMe) to ack meta ops. You can not count on masters having NVNe to reduce the added latency this would add.

Things like battery backed power/disk and following master servers can also mitigate the potential for metadata loss.

... so, yes, this risk may be (is) acceptable in many (if not most) well architected environments.

It might be ok to add such an option, but it aught not be the default, because in a proper environment there are many masters with robust desync handling, proven over time to be very reliable in production use. I guess I can concede that such a feature might be useful in a single master environment, on low quality hardware, or if power is unreliable. But even my little RasPi personal home clusters does not need such an option, I would likely not use it even there.

@pkonopelko
Copy link
Member

@4Dolio:

Can I also get an invite for 4.0 testing? Please.

Sure, please write me at peter@mfs.io.

Best,
Peter / MooseFS Team

@linux-ops
Copy link
Author

Another bad news, when I simulated the test network packet loss situation and found metalogger service, I also lost some of the mfsmaster's modified log content. This means that if there is a problem with the network, the metalogger will not check The change log of the network problem during that time is only synchronized.

Means that there will be data loss, how does the moosefs team solve this problem?

@guestisp
Copy link

guestisp commented Jun 6, 2018

Another bad news, when I simulated the test network packet loss situation

How did you simulate this ?

@linux-ops
Copy link
Author

@guestisp hi
1 iptables drop mfsmaster server ip .
2 metalogger stop 30m.

@borkd borkd added documentation Issue related to documentation data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. labels Nov 26, 2018
@borkd borkd pinned this issue Dec 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. documentation Issue related to documentation PR welcome question Question
Projects
None yet
Development

No branches or pull requests

7 participants