Skip to content

Command line tool for detecting life science data types.

License

Notifications You must be signed in to change notification settings

sapporo-wes/tataki

Repository files navigation

Tataki

Tataki is a command-line tool designed primarily for detecting file formats in the bio-science field. The tool comes with the following features:

  • Supports various file formats mainly used in bio-science
    • bam
    • bcf
    • bed
    • cram
    • fasta
    • fastq
    • gff3
    • gtf
    • sam
    • vcf
    • will be added in the future
  • Allows for the invocation of a CWL document and enables users to define their own complex criteria for detection.
  • Can target both local files and remote URLs
  • Compatible with EDAM ontology

Installation

A single binary is available for Linux x86_64.

curl -fsSL -o ./tataki https://github.com/sapporo-wes/tataki/releases/latest/download/tataki-$(uname -m)
chmod +x ./tataki
./tataki --help

Or, you can run tataki using Docker.

docker run --rm -v $PWD:$PWD -w $PWD ghcr.io/sapporo-wes/tataki:latest --help

In case you want to execute the CWL document with external extension mode, please make sure to mount docker.sock, /tmp and any other necessary directories.

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/tmp -v $PWD:$PWD -w $PWD ghcr.io/sapporo-wes/tataki:latest --help

Quick Start

Determine the file format of a local file:

$ tataki path/to/unknown/file.txt -q
File Path,Edam ID,Label
path/to/unknown/file.txt,http://edamontology.org/format_2572,BAM

Determine the file format of remote file, and output result in YAML format:

$ tataki https://path/to/unknown/file.txt  -q -f yaml
https://path/to/unknown/file.txt:
  label: BAM
  id: http://edamontology.org/format_2572

Read the whole records from the input file:

This may take while depending on the file size.

$ tataki https://path/to/unknown/file.txt  -q --tidy
File Path,Edam ID,Label
https://path/to/unknown/file.txt,http://edamontology.org/format_2572,BAM

Usage

Specify the paths of the files as arguments to tataki. Both local file path and remote URL are supported.

tataki <FILE|URL>...

For more details:

$ tataki --help
Usage: tataki [OPTIONS] [FILE|URL]...

Arguments:
  [FILE|URL]...  Path to the file

Options:
  -o, --output <FILE>              Path to the output file [default: stdout]
  -f <OUTPUT_FORMAT>               [default: csv] [possible values: yaml, tsv, csv, json]
  -C, --cache-dir <DIR>            Specify the directory in which to create a temporary directory. If this option is not provided, a temporary directory will be created in the default system temporary directory (/tmp)
  -c, --conf <FILE>                Specify the tataki configuration file. If this option is not provided, the default configuration will be used. The option `--dry-run` shows the default configuration file
  -t, --tidy                       Attempt to read the whole lines from the input files
  -n, --num-records <NUM_RECORDS>  Number of records to read from the input file. Conflicts with `--tidy` option [default: 100000]
      --dry-run                    Output the configuration file in yaml format and exit the program. If `--conf` option is not provided, the default configuration file will be shown
  -v, --verbose                    Show verbose log messages
  -q, --quiet                      Suppress all log messages
  -h, --help                       Print help
  -V, --version                    Print version

Version: v0.3.0

Detailed Usage

Changing the number of records to read

By default, Tataki reads the first 100,000 records of the input file. You can change this number by using the -n|--num-records=<NUM_RECORDS> option.

tataki <FILE|URL> -n 1000

Avoiding misidentification of file formats of corrupted files

By using the -t|--tidy option, Tataki attempts to read the whole lines from the input files. This options helps when the file is truncated or its end is corrupted.

tataki <FILE|URL> -t

Determining Formats in Your Preferred Order

Using the -c|--conf=<FILE> option allows you to change the order or set the file formats to use for determination.

The configuration file is in YAML format. Please refer to the default configuration shown below for the schema.

The default configuration can be achieved by using the --dry-run option.

# $ tataki --dry-run
order:
  - bam
  - bcf
  - bed
  - cram
  - fasta
  - fastq
  - gff3
  - gtf
  - sam
  - vcf

Executing a CWL Document with External Extension Mode

Tataki can also be used to execute a CWL document with external extension mode. This is useful when determining file formats that are not supported in pre-built mode or when you want to perform complex detections.

This mode is dependent on Docker, so please ensure that 'docker' is in your PATH.

Here are the steps to execute a CWL document with external extension mode.

  1. Prepare a CWL document
  2. Specify the CWL document in the configuration file

And then, execute tataki with the -c|--conf=<FILE> option.

1. Preparation of a CWL Document

Tataki accepts a CWL document in a specific format. The following is an example of a CWL document that executes samtools view.

edam_Id and label are the two required fields for the CWL document. Both must be listed in the tataki prefix listed in the $namespaces section of the document.

cwlVersion: v1.2
class: CommandLineTool

requirements:
  DockerRequirement:
    dockerPull: quay.io/biocontainers/samtools:1.18--h50ea8bc_1
  InlineJavascriptRequirement: {}

baseCommand: [samtools, head]

successCodes: [0, 139]

inputs:
  input_file:
    type: File
    inputBinding:
      position: 1

outputs: {}

$namespaces:
  tataki: https://github.com/sapporo-wes/tataki
  
tataki:edam_id: http://edamontology.org/format_2573
tataki:label: SAM

2. Add Path to Configuration File

Insert a path to the CWL document in the configuration file. This example shown below executes the CWL document followed by SAM and BAM format detection.

order:
  - ./path/to/cwl_document.cwl
  - sam
  - bam

Contributing

!TODO

License

The contents of this deposit are basically licensed under the Apache License 2.0. See the LICENSE. However, the following files are licensed under Creative Commons Attribution Share Alike 4.0 International (https://spdx.org/licenses/CC-BY-SA-4.0.html).