Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with TrimmomaticPE log reporting; incompatible with MultiQC #961

Open
oligomyeggo opened this issue Dec 6, 2022 · 8 comments
Open
Labels
bug Something isn't working

Comments

@oligomyeggo
Copy link

oligomyeggo commented Dec 6, 2022

Snakemake version
snakemake v7.18.2
snakemake-wrapper-utils v0.5.0

Describe the bug
When using the TrimmomaticPE wrapper (v1.19.2; v1.19.2/bio/trimmomatic/pe), I noticed an issue with the log reporting. Instead of listing the actual file names in the input arguments, it lists them as /dev/fd/ with a numeric. This is causing an issue when trying to report the Trimmomatic logs with MultiQC in a snakemake workflow, as MultiQC is searching the Trimmomatic logs for the sample name.

Logs

Here is an example log that the TrimmomaticPE wrapper is outputting, which is incompatible with MultiQC:

TrimmomaticPE: Started with arguments:
 -threads 8 /dev/fd/63 /dev/fd/62 /dev/fd/61 /dev/fd/60 /dev/fd/59 /dev/fd/58 ILLUMINACLIP:/reference/adapters/RNAseq/TruSeq2-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:30 MINLEN:36
Using PrefixPair: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC'
Using Long Clipping Sequence: 'TTTTTTTTTTCAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 6 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 100000 Both Surviving: 76326 (76.33%) Forward Only Surviving: 8103 (8.10%) Reverse Only Surviving: 5775 (5.78%) Dropped: 9796 (9.80%)
TrimmomaticPE: Completed successfully

And here is what a MultiQC-compatible log should look like (note that the difference is just in the arguments section):

TrimmomaticPE: Started with arguments:
 -threads 8 /reads/SampleA.1.fastq.gz /reads/SampleA.2.fastq.gz /trimmed/SampleA.1.fastq.gz /trimmed/SampleA.1.unpaired.fastq.gz /trimmed/SampleA.2.fastq.gz /trimmed/SampleA.2.unpaired.fastq.gz ILLUMINACLIP:/reference/adapters/RNAseq/TruSeq2-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:30 MINLEN:36
Using PrefixPair: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC'
Using Long Clipping Sequence: 'TTTTTTTTTTCAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 6 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 100000 Both Surviving: 76326 (76.33%) Forward Only Surviving: 8103 (8.10%) Reverse Only Surviving: 5775 (5.78%) Dropped: 9796 (9.80%)
TrimmomaticPE: Completed successfully

Is there anyway to update the Trimmomatic wrapper such that the actual input file names are including in the log so that we can add the Trimmomatic logs to a MultiQC report in a snakemake pipeline? Thank you!

Attached trimmomatic.zip folder includes raw toy data, the resulting output from running the TrimmomaticPE rule (trimmomatic.smk), and the associated log file.

trimmomatic.zip

@oligomyeggo oligomyeggo added the bug Something isn't working label Dec 6, 2022
@fgvieira
Copy link
Collaborator

fgvieira commented Dec 7, 2022

That is because the wrapper uses pigz to speed up input/output file compression/decompression.
Just to be sure, can you try running the trimmomatic wrapper with all input and output files unzipped (raw fastq)?

On those two examples you showed, do you know how long each one took? Just to have an idea of how much faster using pigz is...

@oligomyeggo
Copy link
Author

Hi @fgvieira , thank you for your quick response! I ran the trimmomatic wrapper on one of my toy samples with all the input and output files unzipped. The log looks like this:

TrimmomaticPE: Started with arguments:
 -threads 10 /home/cwinkler/trimmomatic_test/raw_reads/SRR17984708_S1_L001_R1_001.fastq /home/cwinkler/trimmomatic_test/raw_reads/SRR17984708_S1_L001_R2_001.fastq results/reads/trimmed/SRR17984708_R1_trimmed.fastq results/reads/trimmed/SRR17984708_R1_trimmed.unpaired.fastq results/reads/trimmed/SRR17984708_R2_trimmed.fastq results/reads/trimmed/SRR17984708_R2_trimmed.unpaired.fastq ILLUMINACLIP:/reference/adapters/RNAseq/TruSeq2-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:30 MINLEN:36
Using PrefixPair: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC'
Using Long Clipping Sequence: 'TTTTTTTTTTCAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 6 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 100000 Both Surviving: 76326 (76.33%) Forward Only Surviving: 8103 (8.10%) Reverse Only Surviving: 5775 (5.78%) Dropped: 9796 (9.80%)
TrimmomaticPE: Completed successfully

So I guess that is confirmation that using pigz is changing the file paths in the log file. Honestly, I have only been working with small, toy data sets so far while constructing my pipeline, so the trimming step is moving quite quickly both using pigz on compressed files and using uncompressed files.

I think once I get into my real experiment (100+ human samples), any speed provided by pigz would be appreciated. Do you think there is a way to edit the wrapper to rename the input and output files in the log, or should I maybe think about writing up a little snakemake bash rule to edit the log files so that I can include them in my MultiQC report?

@fgvieira
Copy link
Collaborator

fgvieira commented Dec 7, 2022

I was just looking into Trimmomatic's repo, and it seems version 0.40 will have support for parallel Gzip/Bzip2. If so, then we can just remove all pigz code from the wrapper!

However, not sure when they will release version 0.40.

@oligomyeggo
Copy link
Author

Ah, good to know about Trimmomatic v0.40! Seems like an easy fix once that's available. In the meantime I'll just figure out a short rule/script to add the file names back into my logs. Thank you so much!

@fgvieira
Copy link
Collaborator

fgvieira commented Dec 8, 2022

If you come up with a way to do it in python, I can try to see if I can add it to the wrapper...

@oligomyeggo
Copy link
Author

oligomyeggo commented Dec 14, 2022

Hi @fgvieira, I don't know if this is the most elegant solution (and it is potentially a bit too hard-coded for my pipeline and file naming conventions), but it seems to be working for my pipeline and getting the Trimmomatic logs into my MultiQC report.

I have the following fix_log.py script:

import re

def fix_log(log_file, corrected_log_file, file_paths):
    """Adds input file paths back into Trimmomatic log generated by snakemake wrapper.

    :param log_file: path to original log file generated by Trimmomatic
    :type log_file: str
    :param corrected_log_file: path to fixed Trimmomatic log file
    :type corrected_log_file: str
    :param file_paths: list of file paths; [r1, r2, r1_trimmed, r1_unpaired, r2_trimmed, r2_unpaired]
    :type file_paths: list[]
    """

    search_text = r"/dev/fd/\d+"

    with open(log_file, "r") as in_file:
        log_data = in_file.read()
        for file_path in file_paths:
            log_data = re.sub(search_text, file_path, log_data, 1)

    with open(corrected_log_file, "w") as out_file:
        out_file.write(log_data)

fix_log(snakemake.input.log,
        snakemake.output.corrected_log,
        [snakemake.input.r1, 
         snakemake.input.r2, 
         snakemake.input.r1_trimmed, 
         snakemake.input.r1_unpaired,
         snakemake.input.r2_trimmed,
         snakemake.input.r2_unpaired])

Which I am calling from a rule in my trimmomatic.smk file:

rule trimmomatic:
    input:
        r1=os.path.join(config["RAW_DATA"], "{sample}_S1_L001_R1_001.fastq.gz"),
        r2=os.path.join(config["RAW_DATA"], "{sample}_S1_L001_R2_001.fastq.gz"),
    output:
        r1="results/reads/trimmed/{sample}_R1_trimmed.fastq.gz",
        r2="results/reads/trimmed/{sample}_R2_trimmed.fastq.gz",
        r1_unpaired="results/reads/trimmed/{sample}_R1_trimmed.unpaired.fastq.gz",
        r2_unpaired="results/reads/trimmed/{sample}_R2_trimmed.unpaired.fastq.gz",
    params:
        trimmer=config["TRIMMOMATIC"]["trimmer"],
        extra=config["TRIMMOMATIC"]["extra"],
    log:
        "logs/trimmomatic/{sample}.log",
    message:
        "Trimming adapters for sample {wildcards.sample} using Trimmomatic..."
    threads: 32
    wrapper:
        "v1.19.2/bio/trimmomatic/pe"


rule fix_log:
    input:
        log="logs/trimmomatic/{sample}.log",
        r1=os.path.join(config["RAW_DATA"], "{sample}_S1_L001_R1_001.fastq.gz"),
        r2=os.path.join(config["RAW_DATA"], "{sample}_S1_L001_R2_001.fastq.gz"),
        r1_trimmed="results/reads/trimmed/{sample}_R1_trimmed.fastq.gz",
        r1_unpaired="results/reads/trimmed/{sample}_R1_trimmed.unpaired.fastq.gz",
        r2_trimmed="results/reads/trimmed/{sample}_R2_trimmed.fastq.gz",
        r2_unpaired="results/reads/trimmed/{sample}_R2_trimmed.unpaired.fastq.gz",
    output:
        corrected_log="logs/trimmomatic/{sample}.corrected.log",
    script:
        "../scripts/fix_log.py"

I looked for the {sample}.corrected.log files in my MultiQC rule, and it is finding them and incorporating them nicely. I am not sure if the above code is something that you can work with/incorporate into the Trimmomatic wrapper, but it is at least a workable solution for now.

Copy link
Contributor

github-actions bot commented Nov 1, 2023

This issue was marked as stale because it has been open for 6 months with no activity.

@github-actions github-actions bot added the Stale label Nov 1, 2023
Copy link
Contributor

github-actions bot commented Dec 1, 2023

This issue was closed because it has been inactive for 1 month since being marked as stale. Feel free to re-open it if you have any further comments.

@github-actions github-actions bot closed this as completed Dec 1, 2023
@fgvieira fgvieira reopened this Dec 1, 2023
@github-actions github-actions bot removed the Stale label Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants