[BUG] : Storage Class change from 2 replicas to 1 removes second copy before data is moved #484

onlyjob · 2022-08-13T05:07:32Z

I have a server with dozen disks in RAID6-like ZFS configuration (raidz2) and I've thought of moving some of the least important data to ZFS-based chunkserver, as a single replica. In order to facilitate that I've created a STRICT storage class as follows: -C C,C -K L -A L -d0 zfs. Chunkservers labelled C are SSD-based so newly created chunks first land to fast chunkservers then eventually moved to the only chunkserver labelled L on ZFS that I was about to make but never did.

I had some data I've selected to be moved first. The data had been assigned to storage class 2, with two available replicas sitting on non-SSD (non-C labelled) chunkservers.
I have re-assigned it to newly created zfs storage class (mfssetsclass -r zfs {unfortunate_data_folder}) with chunkserver yet to be made. At that moment, there was no chunkserver labelled L and never have been. I was going to configure it later.

I walked away thinking that data will not migrate to nowhere, assuming it is still safe with two replicas. Imagine my surprise when few days later I've found that all data is gone, and all chunks have 0 (zero) replicas i.e. no valid copies !!!

The text was updated successfully, but these errors were encountered:

chogata · 2022-08-17T08:24:50Z

Sounds very strange. I'm setting up our test mfs3 instance right now to have storage class like yours and I will try to repeat your steps.

chogata · 2022-08-17T10:32:54Z

I tried to repeat your steps and it works for me. To be exact:

I have 6 chunkservers, 2 with label C, 4 with label D
I have 2 storage classes: mfsscadmin create -K D,D two_non_C, mfsscadmin create -m s -C C,C -K L -A L -d0 zfs
I created 100 10MB files in storage class two_non_C in directory teststrict
I moved the directory and files within to storage class zfs : mfssetsclass -r zfs teststrict/

What I got: files, that had 2 chunk copies now have 1 chunk copy. This happened rather fast (the instance has some other data on it, but not much and doesn't do anything else at the moment, no clients actively reading or writing). This is correct, as zfs class requires only 1 copy in KEEP and ARCHIVE modes.
Then for the last hour absolutely nothing else happened. The files are sitting there. MooseFS is not deleting them. The replication chart shows, that for the last hour only unsuccesful replications happened (obviously, MooseFS tries to copy the chunks to their proper location, but fails every time, because the proper location does not exist, but it counts as a failed replication attempt even if no copying is done).
The chunk loop on this instance is 5 minutes, so it means it run through many times since the class change.

So, obviously, MooseFS is not just deleting chunks nilly-willy. Something else must have happened.

With only 1 copy of each chunk, if that copy gets corrupted somehow (invalid, missing), you no longer have a file. It could have happened with a few files. With all files in one directory - that's a bit suspicious.

Is there no trace of what might have happened in master or chunkserver logs? When did the missing chunk messages start to appear? Were there any restarts in your instance, any other new chunksevers (non L labeled) added? Did you run out of space on any chunkserver at any moment? Were there i/o operations performed on chunks in zfs class? I need more info, because as it is, I can't explain what happened with your data easily.

Side note: STRICT mode is the only mode in which MooseFS can actually kind of deliberately loose chunks. Scenario: if you have chunks in many copies even (let's say 3), but all on wrongly labeled chunkservers in a class that is STRICT and correctly labeled chunksevers for that class do not exists or are otherwise unavailable (nearly 100% full or non-responsive), if any of those copies goes missing, MooseFS will not replicate it, because the class is STRICT. In NORMAL and LOOSE it would, but not in STRICT. So if, over time, all available copies get corrupted, the chunk will be gone. This is a dangerous behaviour, but the possibility of loosing data is described in the manpage for mfsscadmin. We were not entirely convinced, but the community was very adamant that the STRICT mode should be ... well, strict. So it is. With all the pitfalls it implies.

chogata · 2022-08-17T10:38:06Z

Forgot to ask: are your chunks really missing? Or just invalid/wrong version?

onlyjob · 2022-08-17T21:39:24Z

I've noticed loss of data few days later. Absolutely nothing extraordinary happened during that time - no lack of space on chunkservers or anything. All chunks in class zfs were lost alltogether. Chunks are missing according to mfscheckfile (chunks with 0 copies) and mfsfileinfo. Original storage class (non-zfs, before class change) was also STRICT. Nothing else happened, to the best of my knowledge.

chogata · 2022-08-18T16:06:29Z

Original class makes absolutely no difference, once files are moved to the new class, there is no information about the old class. The only "consequence" of the old class is where the chunks were stored prior to the change (because this is where the 1 copy stayed or should have stayed), but you said "non C-labeled" so that is what I did in my test instance. But it doesn't matter whether the previous class was strict or not.

"My" test files are still there, nothing is gone. I tried today some i/o on them, I read them all several times, modified a couple. Nothing bad happened.

Since I'm not able to replicate the problem, the only way I could help would be with access to system logs, changelogs and metadata of the affected instance (assuming you still have the information from the time period).

tokru66 · 2022-08-18T16:16:48Z

@onlyjob can you show your mfshdd.cfg containing zfs mountpoint on chunkserver affected? Does your zfs have compression enabled? I'm just wondering (not tested yet) if having compression-enabled zfs and mfshdd.cfg without ~prefix may cause moosefs to think the "disk" is damaged and in effect cause issue like yours?
Look at mfshdd.cfg(5) and meaning of ~ sign.

onlyjob · 2022-08-19T01:12:52Z

@onlyjob can you show your mfshdd.cfg containing zfs mountpoint on chunkserver affected?

I don't know how can I make it any more clear: I don't have ZFS chunkserver! That's the point. I've created Storage Class that refers to non-existing label, assigned to no chunkserver.

The data re-assigned to that new storage class was gone to nowhere - all of it. The was no chunkserver where it could be sent, according to storage class definition.

tokru66 · 2022-08-19T10:41:59Z

Oh, Understood.
Just out of curiosity, as we had similar idea to yours (however not using strict mode and using existing labels but still any configuration causing data loss sounds horrifying and it's worth checking), I tried to replicate your scenario and got the same results as @chogata. What I tested without any single data loss:
4 chunkservers:

2 without labels
2 with C labels
Two storage classes in addition to default classes:
stricttest: -C C,C -K L -A L -d 0 -m S
nestricttest: -K L -A L -d 0 -m S
Scenarios tested:
moving data between default 2 storage class and stricttest class back and forth
writing data to stricttest sclass folder
writing data to stricttest sclass folder after moving from 2 sclass
moving data between default 2 sclass and nestricttest class back and forth
moving data between default scritcttest sclass and nestricttest class back and forth
keeping data in stricttest sclass folder for a while after moving from 2 sclass
writing and overwriting data in all sclasses mentioned above - writes/overwrites to nestricttest froze due to lack of matching chunkservers, but still no loss of existing data.

So pretty much similar to @chogata tests and results

Some questions that came to my mind:

What was the state of original (source) chunkservers (any outages, network/drive issues) before or after changing storage class to your zfs sclass
Is 2 storage class modified in any way vs. the default? - just trying to find differences between test environment and yours

onlyjob · 2022-08-20T06:43:33Z

Agata, you are right, I have missed/forgotten something important: some time around changing storage class I've stopped and removed a chunkserver that held one copy of data. Your comment made me think that the data might have survived there and it did -- once I've started that chunkserver again, it had one (unmigrated) copy of all chunks in zfs class.

So the problem is not as severe as I originally thought but still bad enough: when storage class is changed from 2 replicas to 1, one replica is being removed before(!) data is physically moved to destination chunkserver. This is why I thought all data was lost: I expected one copy of data to be available on active chunkservers until data is replicated to destination. That was not the case, so the problem reminded me of #233.

@tokru66: 1) There was no outages, cluster was operating normally; 2) it was not the default storage class but a custom STRICT class with two replicas pinned to a different types of chunkservers.

I have modified zfs storage class as -C C,C -K M,W -A L -d1 zfs to keep two replicas for a day before archiving the only replica to dedicated ZFS-backed chunkserver.

@chogata, could you please check/confirm if archiving from 2 replicas to 1 removes second copy of data only after data is relocated to destination? Thanks.

chogata · 2022-08-24T13:01:02Z

@onlyjob no, it doesn't. It will remove 1 copy of data and try to replicate the other copy to the correct label - independently. MooseFS kind of tries to keep the data in the required number of copies as a first priority and on correct labels - as a second priority. The only thing is, if the class is strict, it might, in certain circumstances, give up on the first goal - number of copies - if no correct servers are available (like in the example I gave a couple of posts ago).
I see what you would like to do here: chunks replicated to smaller number of copies but on a machine, that is, let's say, "safer", and that would, indeed, require both conditions - number of copies and label - to be achieved at the same time. But that's not how MooseFS replication algorithm works right now.

onlyjob · 2022-08-25T03:37:47Z

When storage classes designed to reflect reliability of chunkservers, data placement is crucial to safety. Therefore reducing number of replicas before data is moved to designated location is a significant risk factor.

One copy of data on non-redundant HDD is endangered to much greater extent than if it is pinned to RAID-6 backed chunkserver. MooseFS does not have to know that but it should be able to make safe transition from 2 replicas to 1, when relocation is involved. And that requires to move data first and only then delete redundant replica (not vice versa).

(Label can't replicate no longer applies right?)

chogata · 2022-08-25T13:03:31Z

MooseFS does not assume any levels of safety on underlying storage, in that it deems them all safe or un-safe to the same degree. So it does not care that the user thinks one copy on this label is safer than one copy on that label. One copy is unsafe, period ;) Some of our team even wanted to not make "goal 1" (or its equivalent in labels) possible to use - start with 2 copies, always :)

We can discuss the change you propose internally, but I'm not sure how it would go with the efficiency/speed of the current replication process. We don't consider current behaviour of MooseFS as a bug - it's just designed like that.

onlyjob · 2022-08-28T06:13:43Z

I've demonstrated that goal change is done in unsafe manner that sometimes lead to data loss and you still say "We don't consider current behaviour of MooseFS as a bug"... Really?

Bug or not bug, this can be improved, probably without much difficulties. Does it make sens to you that redundant replicas should be removed only after placement of the data is compliant with storage class?

chogata · 2022-08-31T10:16:13Z

Using one copy and strict storage classes is using MooseFS in an unsafe manner. Yes, achieving both targets of replication in the same step would be better, that's why I wrote that we will analyse that and if it will be possible without significant performance downgrade, we will change it. But the current behaviour in itself cannot be considered unsafe. One copy is one copy and MooseFS does not have any clue that one location might be "safer" than another. From the system's point of view, one copy is equally unsafe on any server it is kept on, that's why 2 copies (and for larger instances 3 copies) is a recommended minimum in any scenario.

inkdot7 · 2022-10-02T12:20:38Z

Related question:

In the storage class manual, section 2.9 it sounds like strict mode applies to the original chunk creation, and not the migration from create to keep or keep to archive?

Since strict tries to report issues to the user with ENOSPC, that reporting can only happen on creation, and not at migration, such that using strict mode for forcing an archive class is not really possible anyhow?

chogata · 2022-10-05T10:09:49Z

@inkdot7 this section of the manual only describes what happens at creation time. However, if you look into the manpage of mfsscadmin command, you will find a table there:

This table sums up the modes:

                              DEFAULT     STRICT    LOOSE
       CREATE - BUSY          WAIT        WAIT      WRITE ANY
       CREATE - NO SPACE      WRITE ANY   ENOSPC    WRITE ANY
       REPLICATE - BUSY       WAIT        WAIT      WRITE ANY
       REPLICATE - NO SPACE   WRITE ANY   NO COPY   WRITE ANY

It shows both what happens when it's creation time of a chunk and what happens when a chunk needs to be replicated. Changing storage mode (CREATE to KEEP or KEEP to ARCHIVE) consists of a combination of deletions and replications (or only one of those, if the others are not necessary). And any and all replications in MooseFS will follow the rules in the above table. So in STRICT mode, which is of interest in this thread: if a chunk is created in a STRICT class and there are no appropriate labeled servers available, the chunk write operation will either hang (if there are servers with space, but just too busy at the moment to accept another write request) or return ENOSPC (if there is no space or simply no chunkservers with the requested label(s) ). If a chunk in a STRICT class needs to be replicated (for whatever reason: it is endangered, undergoal, on wrong labels, on unevenly balanced server), if all servers with appropriate label(s) are busy, the replication will wait, if there are no servers with appropriate label(s) or no space on them, the replication will not happen at all.

inkdot7 · 2022-10-07T09:23:47Z

@chogata Thanks for the more detailed info. What would then happen with data marked for several archive copies if all such servers are permanently full in strict more, e.g. -C C,C,C -K C,C,C -A L,L,L -d 0 -m S. As I understand the discussion above, where MooseFS could give up on the number-of-copy goal, this might mean that it drops to just one (?) C copy, even though the user has requested redundancy?

I did notice that the manual page was rather clear about strict having a potential for loosing data if labels which run full are used. In that sense, MooseFS is not breaking any promises from the manual.

However, since strict cannot be set to only apply to creation or archiving separately(?), if one wants to have the ability to prevent user creation of data when some certain storage class is full (by using strict), there is no currently no way to avoid the dangerous NO COPY for the REPLICATE - NO SPACE situation, i.e. for data that was at some point successfully written to MooseFS?

chogata · 2022-10-19T10:19:47Z

@inkdot7 You are right, strict mode is applied always to the whole class definition. Actually, you raised an interesting point and we had a short discussion about it. We made a note to research the possibility of applying different modes to different storage stages (CREATE, KEEP, ARCHIVE). Of course, in theory everything is possible ;) , we just need to evaluate the performance part of this idea.

chogata added the can't replicate label Aug 18, 2022

onlyjob changed the title ~~[BUG] : lost my data due to change of Storage Class~~ [BUG] : Storage Class change from 2 replicas to 1 removes second copy before data is moved Aug 20, 2022

chogata added feature Idea of a new feature to make MooseFS even better! :) data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. performance best practices and removed can't replicate labels Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] : Storage Class change from 2 replicas to 1 removes second copy before data is moved #484

[BUG] : Storage Class change from 2 replicas to 1 removes second copy before data is moved #484

onlyjob commented Aug 13, 2022 •

edited

chogata commented Aug 17, 2022

chogata commented Aug 17, 2022

chogata commented Aug 17, 2022

onlyjob commented Aug 17, 2022 •

edited

chogata commented Aug 18, 2022

tokru66 commented Aug 18, 2022

onlyjob commented Aug 19, 2022 •

edited

tokru66 commented Aug 19, 2022

onlyjob commented Aug 20, 2022

chogata commented Aug 24, 2022

onlyjob commented Aug 25, 2022

chogata commented Aug 25, 2022

onlyjob commented Aug 28, 2022

chogata commented Aug 31, 2022

inkdot7 commented Oct 2, 2022

chogata commented Oct 5, 2022

inkdot7 commented Oct 7, 2022

chogata commented Oct 19, 2022

[BUG] : Storage Class change from 2 replicas to 1 removes second copy before data is moved #484

[BUG] : Storage Class change from 2 replicas to 1 removes second copy before data is moved #484

Comments

onlyjob commented Aug 13, 2022 • edited

chogata commented Aug 17, 2022

chogata commented Aug 17, 2022

chogata commented Aug 17, 2022

onlyjob commented Aug 17, 2022 • edited

chogata commented Aug 18, 2022

tokru66 commented Aug 18, 2022

onlyjob commented Aug 19, 2022 • edited

tokru66 commented Aug 19, 2022

onlyjob commented Aug 20, 2022

chogata commented Aug 24, 2022

onlyjob commented Aug 25, 2022

chogata commented Aug 25, 2022

onlyjob commented Aug 28, 2022

chogata commented Aug 31, 2022

inkdot7 commented Oct 2, 2022

chogata commented Oct 5, 2022

inkdot7 commented Oct 7, 2022

chogata commented Oct 19, 2022

onlyjob commented Aug 13, 2022 •

edited

onlyjob commented Aug 17, 2022 •

edited

onlyjob commented Aug 19, 2022 •

edited