-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
moosefs metalogger #85
Comments
+1 for this question |
On a good network, metalogger is probably less than a second behind a master. Meaning, if you suddenly completely lose a master server and all its data, everything up to the last second of metadata will already be on the metalogger, ready to be instantiated as a new master. Actually, I tried that in practice long time ago, and it worked really well. 24 hour download you mention is probably full metadata backups, but there are always new changelogs after that to cover the period between full metadata backups. Just make sure metaloggers are up & running at all times. |
In some environements, even 1 second lost could be a huge damage. |
@guestisp if your database can't survive a crash, may I advise you choose another one? Classic hard disks will keep much more than 1 second of data in its caches, and it will all be lost when power suddenly goes off. Most of the databases will recover just fine even in such scenarios. Then again, I wouldn't really advise running classic databases on MooseFS for many reasons, so we're getting offtopic here... |
That's why if you are using RAID, the disk cache is disabled.
So ,you won't use MooseFS for VM image hosting ? Because on many VMs, there is an high change that a database is running (in example, a basic Wordpress site running on a VM. There is also a MySQL database on the same machine running on top of MooseFS) |
Right now, no, I'm personally not using MooseFS for neither VM images nor databases. Would I use it for VM images if need be? Probably yes, I've tested some loopback file systems on it, and it worked quite well, actually great. Would I put mission critical database that can't survive crash on it? No. |
Neither do I. If you lost some data, then that data wasn't highly available. Even MySQL replication (that is very unstable, imho) on very heavy write workload doesn't have this delay. I've never seen an out-of-sync mysql replica (except when there was an error, but this is another story). In normal conditions, if a slave server is performing well (more or less equally to the master), a slave won't be out-of-sync. (in fact, you are able to point all reads to the slave server, because master and slaves are supposed to by in sync forever) |
Obviously we live in different worlds, 'cause I regularly see much longer delays in MySQL replication. Yes, on expensive and properly tuned hardware. But, as you now started twisting my words, and claiming things I never said, this conversation is now over. Good luck! |
Please point me where this is happened... |
Back to the original topic. Metaloggers and slaved masters record changlogs which should have all revisions up to the point at which the master is demoted or crashes. No loses should be expected unless you can demonstrate otherwise. |
or unless the metalogger is out of sync. 1 second behing master means dataloss for sure on a medium sized cluster with databases and VMs. So, is this "less than a second behind" something we could expect or something rare ? |
Guys, easy. It is definitely not 1 second, but much less. I will try to elaborate on this topic and how it is done in some time (later today). Best, |
I my experiance metalogging real time low or zero latency/loss. |
👍 This is what I'm expecting. |
Looking forward to oxide94 actual low level run through! My laymans understanding is that all operations go through master and are propagated to loggers at nearly the same priority as it's own metalog stream to disk. The odds of a single op not reaching the log stream are very low and might only occur during a master crash/fault, which itself I expect to be rare. Most transitions would be planned and thus the metalog streams and promotions always consistent. With that said, other failures of disk or network etc have their own fail modes, yet any non-faulty metalog consumer will still remain consistent up to at worst one operation. To the best of my technical knowledge and practical experience the master metadata changelog is real time and ACID durable I believe. I consider it to be, given a properly operating environment, with innumerable variation is specific deployments. |
Hi all, referring to the original question:
No. As @zcalusic wrote:
... and you can easily recover in case of any failure using lastly saved "full" metadata file and changelogs. Regarding the rest of the discussion: First of all, as @acid-maker announced here: #39 (comment), in MooseFS 4.x our full Master Servers HA implementation is moved to MooseFS Community, so I will be also mentioning our HA (Leader Master Server and Follower Master Servers). In general what @4Dolio wrote is true. When there is a change in metadata performed, it is:
And now, regarding point number 2, you may ask: "What happens first?". Metadata changes are pushed to a simple FIFO queue and just sent / written. There is no easy answer, what happens first, as there are plenty of buffers - kernel TCP buffer, network interface buffer etc., also disk has its own physical buffer. In MooseFS HA when you loose Leader Master Server, one of Followers is automatically elected to be a new Leader. This is quite complicated process and has a lot of protections (e.g. split-brain situation prevention, so to have HA you need to have at least "integer" half of Chunkservers + 1 up and running in order to elect a new Leader). But in this case, if you loose a Leader Master Server which returned status "OK" to the client for the last operation before failure, last metadata change most probably was able to go to Follower(s), so it will be there. If status "OK" was not returned to the client, the operation will be reinitiated by Client. But given that, even if something would go terribly wrong (like a major power outage everywhere) and we are considering metadata loss (if any), the loss would be rather at the level of a single metadata change (and definitely not a second of changes). Moreover - in our HA implementation, after such failure, you need to bring up all the Master Servers and they will "see" which one has the newest metadata version and only that one can be elected to a new Leader. Please also do not forget to distinguish metadata operations and data operations, e.g. file creation and writing to a file. I wrote a lot, so if something would be unclear or if you lost somewhere, please ask - I will be happy to clarify / elaborate. Thanks, |
Ty for the great explanation. And for MooseFS itself. |
@OXide94 Thank you for this clear explanation. Some side questions: do you have any checksum during the communication between master and followers just to be sure that follower has a verbatim copy of all metadatas (in other words: how can we sure that our followers are 100% identical to the master ?). AFAIK, Lizard has some checksum even in master/shadow. If checksum from the shadow doesn't match, a full metadata dump is made from master to the shadow, just to be sure to replicate exactly. What in case of multiple followers where some of them our out-of-sync ? Is the automatic leader election elect the most up-to-date based on metadata version? In example, follower1, follower2 and follower3 are at revision 1245, follower4, for some reason, is at 1244. The election should happen with one of follower1, follower2 and follower3. Follower4 must not be elected until it gets up to date. |
I think TCP takes care of some of the metalog transmission consistancy concerns. Still worth an official response though I suppose. |
The core of this issue is: 1 How to ensure consistency between master and slave data? (Like Zookpeer's design architecture) |
|
@acid-maker |
@acid-maker DRDB ? |
In the production environment, when the main server fails, I will manually use the metalogger service, but the problem is very worried about whether there will be data loss. Such as 1 second of data? Because I observed that the synchronization of the change log is first to the master in the synchronization from the master to the metalogger service, there will be delays in the middle... |
@acid-maker I'm not really sure that crc check once per day would be enough. What if a transmit error or something else not related to MooseFS will start to send garbage to the slave ? You won't be noticed until the daily crc check, but you are flooding the slave with garbage. A CRC check is useful in each transaction, it will improve (a lot) data consistency and with modern hardware should't hurt too much (you don't need a sha256 but a simple CRC-8, very very fast and more then enough to detect corruption) |
@guestisp Sending extra data (inode number etc.) is enough to make it safe. This is why they can't desync and not notice it. We use this system in production succesfully since 2005 (same check was done by metarestore before HA). Any corruption that can pass this is merely some unimportant file attributes. This is why we want to check CRC (of whole metadata) from time to time. 24h is the default - you can reconfigure it to check every hour, but it is not necessary. The only cases we ever noticed CRC errors were cases of software bugs (not caused by errors in changelog itself but by wrong interpretation of changelog done by follower). In such cases only single inodes were affected and the difference was minor - this can't lead to avalanche of errors not noticed by follower. CRC in TCP etc. prevents from flooding slave (follower) with garbage. In case of any network issue follower will desync. Why should I add my own CRC to each changelog? This will duplicate OS checksums. We use CRC in I/O packets but only because we already have them in chunks, so we can send them. We don't want to check each change. What we really want is to check CRC of whole metadata after applying the change - and it is not so simple, and not really needed. |
@linux-ops In case of classic HDD you also lose a lot of data in case of power outage. What do you expect from MFS? Use better UPS or two independent power suppplies. |
@linux-ops And next time use English please. |
@acid-maker is not clear to me why sending extra data like inode number should prevent data corruption in metadata, could you please elaborate this ? |
As @acid-maker wrote:
How? Because this "extra" or "excess" information is used on Followers to check if the operation was performed correctly. So when Leader performs an operation and sends changelog entry to Followers, they do perform the same operation in their metadata structures. For example - when there is Similarly it is done with other operations - e.g. with Best regards, PS: Checksumming everything here on the fly would really be an "overkill", because it is CPU-consuming. Please keep in mind (again, as @acid-maker wrote), that we really do care about data consistency. There will always be a "competition" between speed and safety and we do everything in order to keep a good balance between these two, how important, aspects. And yes - sometimes maintaining this balance is not a piece of cake :) Remember, that in MooseFS, data safety and consistency always has a higher priority than speed. However in some cases there is no justification to make dozens of inefficient protections against scenarios where simple solutions are just good and just work (and they also help to keep efficiency at a good level). This is something, which differentiates MooseFS from other solutions - we usually spend more time on thinking something out and doing everything we can do to see it from many perspectives before implementation. |
@guestisp: I think that oxide94 described it very well, but I can add something to it. It is how I see it. We have at least four levels of possible errors:
First we should ask question - How important are they.
I think that you are somehow affraid that errors mentioned in point 4 can happened frequently undetected and this is why we should check EVERYTHING. No we shouldn't. oxide94 describbed it very well. We will concentrate our effort on thinks that are really important. But since you are not convinced I write little more about point 4. I see here three possibilities:
And how we deal with that:
|
One more thing. This is open source code and I can assure you that if you make correct pull request and add optional CRC to each changelog (switched off by default) I will accept such change to code, but I see no reason to do it myself. |
@OXide94 @acid-maker thank you for your detailed responses, you convinced me! |
I'd also like to thank you guys for such great explanations. I already have high opinion on MFS robustness, having never lost a single file on it, and seeing to what great lenghts you go to make sure everything's OK is reassuring. 👍 Slightly offtopic, when can we test improvements in MFS 4.x? 😋 |
[ envy mode on ] |
@zcalusic I have just sent you an e-mail regarding MooseFS 4.x testing Best regards, |
@OXide94 @acid-maker After the mfsmsater service was powered off, it did not generate mfsmetatdata.mfs. I tried to use mfsmaster -a for repair. I was lucky to start successfully, But I found that Will lose some of the file data created before the power outage, I can not understand why, because mfsmaster-a will load the change log, but the test result is lost data, again stated that I power off the mfsmaster service , not mfschunkserver. !!! Power off !!! Is it that the modified log file in the memory is not flushed to the hard disk in time????? |
@linux-ops to properly debug this, first check your underlying infrastructure. In the event of power loss, disks will lose everything from their caches, if you run them with write cache turned on (typically, the default). It also applies to RAID controllers, if you have write cache turned on, you need to have battery attached. Finally, filesystem where data resides needs to be mounted with barriers on (default for the last few years). Only when you have all those settings checked and in place, can potential mfsmaster issues be further debugged. N.B. Last week Hetzner had a nasty power outage in some of their DC, and yes, wherever we had write cache turned on on RAID controller, and no battery backup, MySQL databases were corrupted and had to be restored from backup and/or reloaded from master again (slaves). |
Can I also get an invite for 4.0 testing? Please. |
@zcalusic The question that I am most concerned about is: Does the mfsmaster's modified log file force the write to the hard disk cache, or does it write to the memory buffer successfully? |
A very good news tells you that in the environment where I simulated the hardware failure, I have been creating a file, but after the creation is complete, I will execute the sync command to force the buffer to the hard disk. After the mfsmaster server is powered off, it is again Starting to use the mfsmaster -a file is not lost, but mfsmaster does not force sync to the hard disk but directly to the buffer in the file that writes the modification log. I hope to fix this problem in the next release. In the production environment, the hardware The failure is unpredictable, and it is not possible to lose any files. The current solution is to periodically flush the disk to the hard disk.... |
Are you talking about the changelog file ? If so, yes, I think that changelog should be written in sync mode |
In other words, |
But, you do understand that syncing every metadata change could be extremely expensive and kill performance a lot? OTOH, I wouldn't mind seeing a tunable in the future, so if you really want it, you can turn it on. I.e. if you need consistency badly, but don't care about performance too much, well, then go on and turn it on. |
It shouldn't, as metadata operation are done in ram and, as long you put /var/lib/mfs to SSD, changelog operations are sequential, thus syncing on SSD shouldn't harm too much. Maybe by using ZFS on /var/lib/mfs could fix the same. |
Ha ha, but what are changelog operations if not that same metadata operations in RAM? 😄 So, if you're after full consistency, you'd have to wait for both in RAM and on disk change, right (otherwise we wouldn't discuss this topic)? And that operation would finish as soon as the slower of the two, right? I hope we can agree then, that every metadata operation would be bounded by disk sync speed. Don't have time right now, but I might do some baseline benchmark for you later today, with a terribly simple bench app I wrote a decade ago. You will be surprised what what fdatasync() does to otherwise very performant storage devices, on paper, at least. 😄 |
I'm not talking about consistancy between RAM and disk but consistancy in data wrote on disk. If you have power loss during a write, you'll end up with a corrupted changelog so you have a died master and a corrupted changelog at the same time Even worse would be with metadata dump that AFAIK is in binary form, thus a partial write could lead to a bigger corruption |
So you're completely fine if in-RAM version of metadata and on disk changelog are not consistent? Bravo! 👍 |
I just want to say that I hope that the author can support the write confirmation log file must be confirmed sync fun |
Currently it's already like this. There isn't any consistency between RAM, disk and followers. If metadata is written to RAM and to disk simultaneously, you are still limited to disk speed or you'll have inconsistency, thus, the "keep everything in RAM" advantage is only for reads as write are capped by the disk speed. If metadata is writte to RAM and to disk in a way similiar to a writeback, you still have inconsistency because disk is not updated as quickly as ram Standard SSDs have almost 0 latency and are able to write to 500MB/s (or Optanes are about 2000MB/s, 500.000 IOPS when writing), I don't think you are writing 2000MB/s of metadatas, in this case, a sync write doesn't hurt the performance as long you don't have more than 500.000 changes/s or more than 2000MB/s changes in metadatas. I really dubt that someone would saturate an NVMe/Optane for metadata operations. |
This is entirely dependent on the environment. It would likely (nearly always) harm performance to force the master to wait on disks (even NVMe) to ack meta ops. You can not count on masters having NVNe to reduce the added latency this would add. Things like battery backed power/disk and following master servers can also mitigate the potential for metadata loss. ... so, yes, this risk may be (is) acceptable in many (if not most) well architected environments. It might be ok to add such an option, but it aught not be the default, because in a proper environment there are many masters with robust desync handling, proven over time to be very reliable in production use. I guess I can concede that such a feature might be useful in a single master environment, on low quality hardware, or if power is unreliable. But even my little RasPi personal home clusters does not need such an option, I would likely not use it even there. |
Sure, please write me at peter@mfs.io. Best, |
Another bad news, when I simulated the test network packet loss situation and found metalogger service, I also lost some of the mfsmaster's modified log content. This means that if there is a problem with the network, the metalogger will not check The change log of the network problem during that time is only synchronized. Means that there will be data loss, how does the moosefs team solve this problem? |
How did you simulate this ? |
@guestisp hi |
If I use the moosefs metalogger on the master and standby nodes of mfsmaster, if you set the metadata download every 24 hours, does the 24 hour data loss if the primary node fails before downloading the metadata? I know you can use changelog + metadata recovery , but this will not lose data? metalogger slave Modification log is loaded when starting mfsmaster is a log file Or is it based on the metadata time to load multiple log files?
The text was updated successfully, but these errors were encountered: