Skip to content

end-of-term/eot-s3-stats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

eot-s3-stats

Script to calculate sizes and counts from S3 file list.

Gathering file list from s3

Using the aws command line tool we will gather the file listing for a specific EOT year.

In the example below we are gathering the EOT-2012 file listing.

aws s3 ls s3://eotarchive/crawl-data/EOT-2012/ --recursive  > EOT-2012-S3.txt

Calucluating statistics.

Once we have the EOT year file listing we can generate the statistics.

python3 eot-s3-stats.py EOT-2012-S3.txt

This will result in an output that looks like this.

EOT Year: {'EOT-2012'}

File Sizes
##########
warc_size: 41.42 TB
wat_size: 885.15 GB
wet_size: 217.3 GB
cdx_size: 12.27 GB

File Counts
###########
warc_count: 78509
wat_count: 78509
wet_count: 78509
cdx_count: 78509

About

Script to calculate sizes and counts from S3 file list.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages