Tataki is a command-line tool designed primarily for detecting file formats in the bio-science field. The tool comes with the following features:
- Supports various file formats mainly used in bio-science
- bam
- bcf
- bed
- cram
- fasta
- fastq
- gff3
- gtf
- sam
- vcf
- will be added in the future
- Allows for the invocation of a CWL document and enables users to define their own complex criteria for detection.
- Can target both local files and remote URLs
- Compatible with EDAM ontology
A single binary is available for Linux x86_64.
curl -fsSL -o ./tataki https://github.com/sapporo-wes/tataki/releases/latest/download/tataki-$(uname -m)
chmod +x ./tataki
./tataki --help
Or, you can run tataki using Docker.
docker run --rm -v $PWD:$PWD -w $PWD ghcr.io/sapporo-wes/tataki:latest --help
In case you want to execute the CWL document with external extension mode, please make sure to mount docker.sock
, /tmp
and any other necessary directories.
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/tmp -v $PWD:$PWD -w $PWD ghcr.io/sapporo-wes/tataki:latest --help
Determine the file format of a local file:
$ tataki path/to/unknown/file.txt -q
File Path,Edam ID,Label
path/to/unknown/file.txt,http://edamontology.org/format_2572,BAM
Determine the file format of remote file, and output result in YAML format:
$ tataki https://path/to/unknown/file.txt -q -f yaml
https://path/to/unknown/file.txt:
label: BAM
id: http://edamontology.org/format_2572
Read the whole records from the input file:
This may take while depending on the file size.
$ tataki https://path/to/unknown/file.txt -q --tidy
File Path,Edam ID,Label
https://path/to/unknown/file.txt,http://edamontology.org/format_2572,BAM
Specify the paths of the files as arguments to tataki
. Both local file path and remote URL are supported.
tataki <FILE|URL>...
For more details:
$ tataki --help
Usage: tataki [OPTIONS] [FILE|URL]...
Arguments:
[FILE|URL]... Path to the file
Options:
-o, --output <FILE> Path to the output file [default: stdout]
-f <OUTPUT_FORMAT> [default: csv] [possible values: yaml, tsv, csv, json]
-C, --cache-dir <DIR> Specify the directory in which to create a temporary directory. If this option is not provided, a temporary directory will be created in the default system temporary directory (/tmp)
-c, --conf <FILE> Specify the tataki configuration file. If this option is not provided, the default configuration will be used. The option `--dry-run` shows the default configuration file
-t, --tidy Attempt to read the whole lines from the input files
-n, --num-records <NUM_RECORDS> Number of records to read from the input file. Conflicts with `--tidy` option [default: 100000]
--dry-run Output the configuration file in yaml format and exit the program. If `--conf` option is not provided, the default configuration file will be shown
-v, --verbose Show verbose log messages
-q, --quiet Suppress all log messages
-h, --help Print help
-V, --version Print version
Version: v0.3.0
By default, Tataki reads the first 100,000 records of the input file. You can change this number by using the -n|--num-records=<NUM_RECORDS>
option.
tataki <FILE|URL> -n 1000
By using the -t|--tidy
option, Tataki attempts to read the whole lines from the input files. This options helps when the file is truncated or its end is corrupted.
tataki <FILE|URL> -t
Using the -c|--conf=<FILE>
option allows you to change the order or set the file formats to use for determination.
The configuration file is in YAML format. Please refer to the default configuration shown below for the schema.
The default configuration can be achieved by using the --dry-run
option.
# $ tataki --dry-run
order:
- bam
- bcf
- bed
- cram
- fasta
- fastq
- gff3
- gtf
- sam
- vcf
Tataki can also be used to execute a CWL document with external extension mode. This is useful when determining file formats that are not supported in pre-built mode or when you want to perform complex detections.
This mode is dependent on Docker, so please ensure that 'docker' is in your PATH.
Here are the steps to execute a CWL document with external extension mode.
And then, execute tataki
with the -c|--conf=<FILE>
option.
Tataki accepts a CWL document in a specific format. The following is an example of a CWL document that executes samtools view
.
edam_Id
and label
are the two required fields for the CWL document. Both must be listed in the tataki
prefix listed in the $namespaces
section of the document.
cwlVersion: v1.2
class: CommandLineTool
requirements:
DockerRequirement:
dockerPull: quay.io/biocontainers/samtools:1.18--h50ea8bc_1
InlineJavascriptRequirement: {}
baseCommand: [samtools, head]
successCodes: [0, 139]
inputs:
input_file:
type: File
inputBinding:
position: 1
outputs: {}
$namespaces:
tataki: https://github.com/sapporo-wes/tataki
tataki:edam_id: http://edamontology.org/format_2573
tataki:label: SAM
Insert a path to the CWL document in the configuration file. This example shown below executes the CWL document followed by SAM and BAM format detection.
order:
- ./path/to/cwl_document.cwl
- sam
- bam
!TODO
The contents of this deposit are basically licensed under the Apache License 2.0. See the LICENSE. However, the following files are licensed under Creative Commons Attribution Share Alike 4.0 International (https://spdx.org/licenses/CC-BY-SA-4.0.html).
./src/EDAM_1.25.id_label.csv
- Source: https://github.com/edamontology/edamontology/releases/download/1.25/EDAM_1.25.csv
- Removed the lines not related to 'format' and the columns other than 'Preferred Label' and 'Class ID'