Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If the registry_cache files are corrupted overlaybd fails to load. #239

Open
1 task done
simha-db opened this issue Jul 13, 2023 · 8 comments
Open
1 task done

If the registry_cache files are corrupted overlaybd fails to load. #239

simha-db opened this issue Jul 13, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@simha-db
Copy link

What happened in your environment?

Looks like if the registry_cache has some corrupt files overlaybd fails to create image.

logs below

2023/07/13 07:02:37|ERROR|th=00007FE4F37F9F80|/src/src/overlaybd/lsmt/file.cpp:1159|verify_ht:header magic/type don't match
2023/07/13 07:02:37|ERROR|th=00007FE4F37F9F80|/src/src/overlaybd/lsmt/file.cpp:1553|do_parallel_load_index:failed to load index from 32-th file
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/src/overlaybd/lsmt/file.cpp:1581|load_merge_index:load index failed.
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/src/image_file.cpp:346|open_lowers:LSMT::open_files_ro(files, 76, 1) return NULL
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/src/image_file.cpp:470|init_image_file:open lower layer failed.
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/src/main.cpp:302|dev_open:create image file failed
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/build/_deps/tcmu-src/libtcmu.cpp:605|device_add:handler open failed for uio3

Looks like it would be good to have logic to delete the file and retry?

What did you expect to happen?

Recover by deleting the file and redownloading.

How can we reproduce it?

  1. Start a container on overlaybd.
  2. Corrupt the registry_cache files by writing junk to it
  3. observe the container doesn't start anymore.

What is the version of your Overlaybd?

0.5.3-1.

What is your OS environment?

Ubuntu 20.04

Are you willing to submit PRs to fix it?

  • Yes, I am willing to fix it.
@simha-db simha-db added the bug Something isn't working label Jul 13, 2023
@liulanzheng
Copy link
Member

@simha-db By now the only the zfile level has implemented a mechanism to evict corrupted data. One is on opening a zfile, if error occurs loading zfile jumptable, all file will be evicted. The other is when crc verification fails, the failed block will be evicted. Of course, that doesn't cover every scenario, any enhancements and fixes are welcome.

@simha-db
Copy link
Author

Ah ok - so the validation done is the same as the what i would get if i run

overlaybd-zfile --verify -t -x layerfile

?

@shuaichang
Copy link

@liulanzheng @BigVan if any block of the Overlaybd blob is corrupted, will overlaybd-zfile --verify -t -x layerfile be able to identify the corruption?

@shuaichang
Copy link

Also @BigVan do we know how fast the the validation of overlaybd-zfile --verify, is it faster than CRC32?

@BigVan
Copy link
Member

BigVan commented Jul 25, 2023

Also @BigVan do we know how fast the the validation of overlaybd-zfile --verify, is it faster than CRC32?

No, it just read the whole file and check the crc32 for each compressed block.

@liulanzheng @BigVan if any block of the Overlaybd blob is corrupted, will overlaybd-zfile --verify -t -x layerfile be able to identify the corruption?

'overlaybd-zfile' will print the block_id which mismatch its checksum, like:

2023/07/25 15:27:09|ERROR|th=0000562F26E92DE0|/root/work/dadi/overlaybd/src/overlaybd/zfile/zfile.cpp:934|zfile_validation_check:crc check error in block 132240

and exit the program with non-zero code.

@shuaichang
Copy link

Thanks @BigVan for the quick reply. Do we know how much time the verification would take for a 10GB (or some other layer size) to verify? Also is it possible to verify these blocks with multiple threads?

@shuaichang
Copy link

shuaichang commented Jul 26, 2023

@BigVan could you clarify the following scenarios?

Some background context: we download Overlaybd image blob layers by chunks concurrently and put them into registry_cache.

  1. If some part of a downloaded layer blob is corrupted, but by the time process start the corrupted data is not required, will the process still be able to start up correctly?
  2. Assuming 1. is true, during application execution, if process reads the corrupted data, will this cause unexpected application errors?
  3. If we call zfile verify after a blob is downloaded, will it be able to detect the corruption and return error?

Our goal is to ensure that we could call the zfile verify to identify any corruption in blob and fail at the container create time instead of causing unexpected application error.

@BigVan
Copy link
Member

BigVan commented Jul 27, 2023

  1. the corrupted data will not affect the container startup.
  2. overlaybd will try to evict the corrupted chunk and download this part of data from registry
  3. yes. as i mentioned before, overlaybd-zfile --verify will print the corrupted block data id and exit a non-zero code.
    If the registry_cache files are corrupted overlaybd fails to load. #239 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants