Skip to content

nh13/pbgzip

Repository files navigation

PBGZIP - Parallel Block GZIP

This tool and API implements parallel block gzip. For many formats, in particular Genomics Data Formats, data are compressed in fixed-length blocks such that they can be easily indexed based on a (genomic) coordinate order, since typically each block is sorted according to this order. This allows for each block to be individually compressed (deflated), or more importantly, decompressed (inflated), with the latter enabling random retrieval of data in large files (gigabytes to terabytes). pbgzip is not limited to any particular format, but certain features are tailored to Genomics Data Formats when enabled (see below). Parallel decompression is somewhat faster, but truly the speedup comes during compression.

Author: Nils Homer

Installation

  1. Compile PBGZIP:
sh autogen.sh && ./configure && make
  1. Install
make install

IGZIP Support

igzip is a high-performance compression library for gzip or deflate compression for Genome data. For BAM (or SAM) formats (see the SAM/BAM Specification), it can significantly speed up the compression, with minimal loss in compression ratio when compared to the fastest level of zlib. In pbgzip, we can utilize the BAM-specific speedup, so using igzip in this context may not yield good results for non-BAM data.

Intalling IGZIP

The BAM-specific igzip libraries must installed prior to compiling pbgzip. We have included the igzip source ready for installing the bam-specific igzip libraries. Please compile and install igzip:

  cd igzip
  make
  make install
  cd ..

To override the install path, set the INSTALL_PATH environment variable before compiling and installing igzip. You may also need to updated your LD_LIBRARY_PATH variable to include the INSTALL_PATH:

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib

If the compilation fails, you may be on a system that is not supported by igzip; please contact the original authors for igzip support.

Enable IGZIP when configuring PBGZIP

Make sure to enable igzip when running configure:

  ./configure --enable-igzip

IGZIP Acknowledgments

Please thank the original authors of igzip: Jim Guilford, Vinodh Gopal, Sean Gulley and Wajdi Feghali. Also thank Paolo Narvaez, Mishali Naik and Gil Wolrich for their contributions. Finally, thank Intel for sponsoring the igzip work.

Intel Integrated Performance Primitives (Intel IPP) Support

Intel's Integrated Performance Primitives (Intel IPP) can boost performance of compression (deflation) by 20-30%. Currently, Intel IPP 9.0 is supported.

Enable IGZIP when configuring PBGZIP

Make sure to enable ipp when running configure:

  ./configure --enable-ipp

Installing and Linking Intel IPP

  1. Install Intel's IPP (see Intel IPP]).

  2. Make sure that the path to the installed Intel IPP libraries is either (a)added to your relevant environment variables, or (b) or specified during the configure command.

(a) Add the following to your .bashrc. Please modify the path for your installation.

export LIBRARY_PATH=${LIBRARY_PATH}:/opt/intel/compilers_and_libraries_2016.0.083/mac/ipp/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/intel/compilers_and_libraries_2016.0.083/mac/ipp/lib

(b) Append the following to the arguments to configure. Please modify the path for your installation.

LDFLAGS=-L/opt/intel/compilers_and_libraries_2016.0.083/mac/ipp/lib CFLAGS=-I/opt/intel/compilers_and_libraries_2016.0.083/mac/ipp/include

Current Issues

For developer issues, see: https://github.com/nh13/pbgzip/issues

Compression Limitations

Due to the requirement that we must fit a chunk of compressed data into a maximum size block, we may fail at compression when the data has sufficiently high entropy (is too random). For example, try compressing a very random set of data:

head -c 10M </dev/urandom | pbgzip -c > myfile.gz

The compression should fail.
Fortunately, bgzip should succeed on this data, but pbgzip currently does not support this case. Developer Note: we would need to keep a buffer of data that was fed to consumers, such that we rewind back to try to compress fewer bytes in the given block that failed deflation.

undefined reference to `rpl_malloc'

In same cases, you may get: undefined reference to `rpl_malloc'. This is due to not being able to find the malloc function.

Try the following:

ac_cv_func_malloc_0_nonnull=yes ./configure <your configure options>

About

Parallel Block GZIP

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
COPYING

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published