Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the sdr-ingest-transfer step in the accessionWF #4594

Open
ndushay opened this issue Sep 20, 2023 · 19 comments
Open

Speed up the sdr-ingest-transfer step in the accessionWF #4594

ndushay opened this issue Sep 20, 2023 · 19 comments
Assignees

Comments

@ndushay
Copy link
Contributor

ndushay commented Sep 20, 2023

It may be due to checksumming:

The accessionWF has a step sdr-ingest-transfer that computes checksums; Andrew has observed this step throwing errors when computing checksums.

The action(s) for this ticket:

  1. Find where the checksums are being computed.

    I think this is the code path

    and Andrew and I surmise that the Moab::Bagger is ultimately triggering checksum creation:

    But here my spelunking stopped.

  2. Determine if the checksum computation(s) can be optimized in some way. For example, Change compute checksums to digest in single pass. common-accessioning#1100 combines the file reads for checksum computations. Another possible improvement has to do with block size for the reads. Not sure if optimal block size depends on hardware, and no idea what our hardware is.

Note that if you find a change that should be made in MoabVersioning, this ticket might also be relevant to the work: sul-dlss/moab-versioning/issues/144

From a wiki page that has been removed, but was locally pulled:

 1   │ ## Moab sizes, number of versions
   2   │ 
   3   │ As of October 2017:
   4   │ 
   5   │ * Total Druids: 1,100,585 (excluding old druids on storage root 01)
   6   │ * Average Druid size: 313 MB
   7   │ * Average number of files per Druid: 42
   8   │ * Average Druid version: 2.41 (mean) 1 (mode) 2 (median)
   9   │ * Highest Druid version: 20
  10   │ * 95% of all Druids are of version 5 or less.
  11   │ * Total number of druid-versions (sum of all versions of all druids): 2,660,774
  12   │ 
  13   │ As of February 2019:
  14   │ 
  15   │ * Total Druids: 1,647,020 (excluding old druids on storage root 01)
  16   │ * Average Druid size: 336 MB
  17   │ * Average Druid version: 2.78 (mean) 1 (mode) 2 (median)
  18   │ * Highest Druid version: 21
  19   │ * 95% of all Druids are of version 7 or less.
  20   │ * Total number of druid-versions: 4,586,739
  21   │  
  22   │ ## Seeding the Catalog
  23   │ 
  24   │ Moved Seeding Stats to separate wiki page:  
  25   │ - https://github.com/sul-dlss/preservation_catalog/wiki/Stats-Seeding
  26   │ 
  27   │ ## Checksum computation
  28   │ 
  29   │ The table below holds checksumming test results from the `checksum_testing.rb` benchmarking script. This script computes MD-5, SHA-1 and SHA-256 on a given file with a user-configurable file input b
       │ uffer size. The test file was the 15G file `/services-disk12/sdr2objects/jf/215/hj/1444/jf215hj1444/v0001/data/content/jf215hj1444_1_pm.mxf`.
  30   │ 
  31   │ The header column shows the file input buffer size in bytes, and the results are shown as the number of seconds it took to complete the checksum computation.
  32   │ 
  33   │ | 4096 | 8192 | 16384 | 32768 | 65536 | 131072 | 262144 |
  34   │ |:----:|-----:|------:|------:|------:|-------:|-------:|
  35   │ | 249s | 224s | 215s  | 225s  | 254s  | 210s   | 244s   |
  36   │ | 214s | 226s | 206s  | 209s  | 231s  | 208s   | 213s   |
  37   │ | 214s | 215s | 209s  | 229s  | 216s  | 225s   | 208s   |
  38   │ 
  39   │ ## C2M Version Check and M2C Existence Check Stats
  40   │ 
  41   │ See the google doc:  https://docs.google.com/spreadsheets/d/1Xvk02asCm75lj5eCOrruhT0i9T8r8kHpUYjoEvLKbqM
@justinlittman
Copy link
Contributor

I'd suggest a more general inquiry into why sdr-ingest-transfer step is slow, which may or may not have something to do with fixity generation.

@ndushay ndushay changed the title Speed up checksum computation for the sdr-ingest-transfer step in the accessionWF Speed up the sdr-ingest-transfer step in the accessionWF Sep 20, 2023
@ndushay
Copy link
Contributor Author

ndushay commented Sep 20, 2023

I'd suggest a more general inquiry into why sdr-ingest-transfer step is slow, which may or may not have something to do with fixity generation.

Good idea. Consider it done.

@justinlittman justinlittman self-assigned this Sep 22, 2023
@justinlittman
Copy link
Contributor

https://github.com/sul-dlss/dor-services-app/blob/main/app/services/preservation_ingest_service.rb#L32 forces a regeneration of fixities since preservation requires md5, sha1, and sha256 but only md5 and sha1 are present.

@ndushay
Copy link
Contributor Author

ndushay commented Sep 22, 2023

Is there a place earlier in accessiioning where we are computing md5 and sha1 that we could add the sha256 so preservation wouldn't be re-reading all the files? @andrewjbtw

@andrewjbtw
Copy link

There are multiple places, including the SDR API, plus the Cocina model would need the new hash added.

It's still possible that there is a benefit to sdr-ingest-transfer checking fixity at this step and the problem is that its method for checking is slower than it needs to be. It's also possible that checking fixity at this step is redundant because bag validation will cover that need.

@justinlittman
Copy link
Contributor

The method for checking is not slower than it needs to be. I defer to your judgment whether there is value performing fixity check at this step.

@justinlittman
Copy link
Contributor

Also, while sha256 would need to be added in multiple places, each change should be minor.

@justinlittman
Copy link
Contributor

@andrewjbtw Based on reading the code, the fixities are re-generated when the number of algorithms != 3. (That is, it does NOT check the fixities and if they don't match then re-generate.)

@ndushay
Copy link
Contributor Author

ndushay commented Sep 25, 2023

FYI, code from moab-versioning:

https://github.com/sul-dlss/moab-versioning/blob/main/lib/moab/file_signature.rb#L77

    def self.from_file(pathname, algos_to_use = active_algos)
      raise(MoabRuntimeError, 'Unrecognized algorithm requested') unless algos_to_use.all? { |a| KNOWN_ALGOS.include?(a) }

      signatures = algos_to_use.to_h { |k| [k, KNOWN_ALGOS[k].call] }

      pathname.open('r') do |stream|
        while (buffer = stream.read(8192))
          signatures.each_value { |digest| digest.update(buffer) }
        end
      end

      new(signatures.transform_values(&:hexdigest).merge(size: pathname.size))
    end

code from common-accessioning

https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/assembly/checksum_compute.rb#L69

def generate_checksums(filepath)
          md5 = Digest::MD5.new
          sha1 = Digest::SHA1.new
          File.open(filepath, 'r') do |stream|
            while (buffer = stream.read(8192))
              md5.update(buffer)
              sha1.update(buffer)
            end
          end
          { md5: md5.hexdigest, sha1: sha1.hexdigest }
        end

Spitballing:

Differences I see (that may have no bearing)

  1. File.open(filepath, 'r') vs pathname.open('r')
  2. moab-versioning has new(signatures.transform_values(&:hexdigest).merge(size: pathname.size))

Differences that could be hidden to us

  1. I/O hardware
  2. CPU hardware
  3. is the place where the files live somehow slower to read when creating the bag?

Also: Is there any software that re-reads the files when creating the bag?

@ndushay
Copy link
Contributor Author

ndushay commented Sep 25, 2023

checksum-compute does the computation on the common-accessioning boxes.

sdr-ingest-transfer calls DSA -- would it be comparing the DSA hardware to the common-accessioning?

@ndushay
Copy link
Contributor Author

ndushay commented Sep 25, 2023

late breaking thought -- I think that we might be re-reading files in order to compute file size:

https://github.com/sul-dlss/moab-versioning/blob/main/lib/moab/file_inventory.rb

# @return [Integer] The total size (in bytes) in all files of all files in the inventory (dynamically calculated)
attribute :byte_count, Integer, tag: 'byteCount', on_save: proc { |t| t.to_s }

def byte_count
  groups.inject(0) { |sum, group| sum + group.byte_count }
end

# @attribute
# @return [Integer] The total disk usage (in 1 kB blocks) of all data files (estimating du -k result) (dynamically calculated)
attribute :block_count, Integer, tag: 'blockCount', on_save: proc { |t| t.to_s }

def block_count
  groups.inject(0) { |sum, group| sum + group.block_count }
end

@andrewjbtw
Copy link

Network bandwidth is another thing that could be limiting the speed. It's definitely possible that I'm overestimating how we've resourced the accessioning system, but I'm not ready to conclude that yet.

@ndushay
Copy link
Contributor Author

ndushay commented Sep 25, 2023

I think common-accessioning ends up using the dor-services-app hardware, because that's where the actual checksum computations are happening (via moab-versioning gem called by DSA)

@ndushay
Copy link
Contributor Author

ndushay commented Sep 25, 2023

JLitt will run a test to see if there differences between diff checksum algorithms.

Can test be run on dor-services-app VM vs common-accessioning VM?

something something technical metadata computing checksums; techmd may parallel-ize multiple files for an object.

@andrewjbtw
Copy link

something something technical metadata computing checksums; techmd may parallel-ize multiple files for an object.

I believe this was: It's possible that technical-metadata is generating checksums using a faster method. One possibility is that it's parallelizing the reads rather than reading files more than once. technical-metadata is only doing MD5, so the algorithm could also make a difference.

For what it's worth, this was my experience with checksums on large files (100+ GB) at the Computer History Museum:

  • above a certain CPU power, the network or disk I/O affected speed more than the algorithm
  • with parallelization
    • parallelization helped speed up generation on large numbers of small files
    • parallelization slowed down generation when dealing with multiple large files
    • for small files, the parallelization benefit seemed to be in handling the time spent opening and closing files (or file handles or whatever)
    • for large files, the problem seemed to be that reading just one file could consume all available bandwidth, so reading multiple large files meant the processes competed for bandwidth and that had an overall negative effect
  • my experience comes from
    • using the python bagit.py tool
    • writing my own bash scripts for pre-ingest processes, using md5sum and sha512sum
    • local testing on a Linux desktop
    • working with internally-networked servers on a gigabit connection

@andrewjbtw
Copy link

Last night, I ran sha256sum on the content in https://argo.stanford.edu/view/ch776mh8649 (about 880 GB), using dor-services-app-prod-a with the content in a folder on the /dor mount. It took about 6 hours, which is around what I'd expect. The sdr-ingest-transfer step took 19 hours when this was accessioned.

$ time find . -type f -exec sha256sum {} \;
b4ecf6f5cee3b997cdb9f001ad1db311ae9a62570953ac4241fd7a23e7157e2c  ./ch776mh8649/ch776mh8649_em_sh.mov.md5
d65165279105ca6773180500688df4bdc69a2c7b771752f0a46ef120b7fd8ec3  ./ch776mh8649/.DS_Store
77a2511522a62945093e23eb0c9d89bcc9dea8c9304092f768ccd165fcc8d4c8  ./ch776mh8649/._.DS_Store
a04b70fe8bc1336e5f816a5479acc5b9899d552e3394810be719907bf95113af  ./ch776mh8649/ch776mh8649_em_sh.mov
c182e18aa773adec7ced87a102f3f5c1ad69afe97a4bc752ca26ce0ea042af65  ./ch776mh8649/ch776mh8649_em_sl.mp4.md5
c96a8be954ab6d2a3cfddf6e341f1d2e891db068ebaf1a0d8415edc1577cc295  ./ch776mh8649/ch776mh8649_em_sl.mp4
2b663825455d29cf83c00bf8bbeef06603e4eb148f9500ad2ceb9fdb6dc82f3f  ./ch776mh8649/ch776mh8649_thumb.jp2.md5
f4114abce3e147dc10f5e156f5162d1e9245c8592ce7cc8a9e8495fd66d7fe26  ./ch776mh8649/ch776mh8649_md.csv
9fac75cd068d6db516b3ba9884c404cffcc8c7738e82257d0ce97ba04a231f58  ./ch776mh8649/ch776mh8649_md.csv.md5
415d066052675a87c31042ed0d39319e7f7d4f14977e5229a9d4f6c20b9d67c8  ./ch776mh8649/ch776mh8649_pm.tar
79259dba1f41a146f7f483b2d272b38fb20a2240a5d15d118f5231b62511724b  ./ch776mh8649/ch776mh8649_pm.tar.md5
dae00215cff24a47766417530c2866727a198fd630c415b6f6d473b0156078a8  ./ch776mh8649/ch776mh8649_thumb.jp2

real    365m2.198s
user    74m12.377s
sys     17m24.700s

@justinlittman
Copy link
Contributor

I can an experiment generating fixities in production on a 386GB file using Ruby code that resembled the current code:

time = Benchmark.measure {
	# digest = Digest::MD5.new
	digest = Digest::SHA2.new(256)
	File.open('/dor/workspace/jlit-20230925.cdxj', 'r') do |stream|
		while (buffer = stream.read(8192))
			digest.update(buffer)
		end
	end
	puts digest.hexdigest
}
puts time

The result where:

  • MD5 on DSA prod: 339 minutes
  • SHA256 on DSA prod: 395 minutes
  • MD5 on common accessioning prod: 295 minutes
  • SHA256 on common accessioning prod: 342 minutes

@andrewjbtw
Copy link

Thanks for the further analysis. It looks to me like the Ruby approach may be slow for some reason. I couldn't find the file you used in the workspace, but I did use a 280 GB. Since it's smaller, I would expect it to take less time for checksums. But the results I got on dor-services-app are all less than 1/3 the time of the results with the Ruby test:

dor_services@dor-services-app-prod-a:/dor/staging/checksumtest/ch776mh8649$ for checksum_type in {md5sum,sha1sum,sha256sum} ; do date ; time "$checksum_type" ch776mh8649_em_sh.mov ; echo "-----" ;  done
Fri 29 Sep 2023 10:06:15 AM PDT
57d6d1a20a7c509aea417e406280e624  ch776mh8649_em_sh.mov

real    81m45.471s
user    7m33.848s
sys     4m56.467s
-----
Fri 29 Sep 2023 11:28:00 AM PDT
62ea871e2f7d7526a0e5686658a4dbdd33e30e34  ch776mh8649_em_sh.mov

real    94m30.975s
user    9m15.775s
sys     5m15.006s
-----
Fri 29 Sep 2023 01:02:31 PM PDT
a04b70fe8bc1336e5f816a5479acc5b9899d552e3394810be719907bf95113af  ch776mh8649_em_sh.mov

real    107m20.550s
user    23m9.484s
sys     5m22.512s
-----

It does look like I was wrong about sdr-ingest-transfer and it is not reading files multiple times. I checked when we updated checksum-compute to stop doing multiple reads, and I think that was deployed before the latest set of large content was accessioned. So if checksum-compute and sdr-ingest-transfer take close to the same amount of time, both must be doing the same amount of reads.

@edsu
Copy link
Contributor

edsu commented Oct 2, 2023

This is waiting for evaluation from @andrewjbtw about whether to move forward with this now, or put it in the backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants