Trimmomatic
Trimmomatic is a tool developed to trim poor quality and adaptors from Illumina data. If you do not have Trimmomatic installed, please check Installing Software.
Let’s check and record the version we are using by typing:
trimmomatic -version
0.39
Here, we are analysing paired-end NGS data. Therefore, in this case, Trimmomatic uses as input the forward (ending in _1
) and reverse (ending in _2
) reads and outputs four files: forward paired, forward unpaired, reverse paired and reverse unpaired. In the paired output files we will have the reads for which both mates survived the trimming, and in the unpaired ones we will have the reads for which only one of the mates survived. Here is an example of a Trimmomatic command for a single accession:
trimmomatic PE -phred33 FG35_1.fastq.gz FG35_2.fastq.gz FG35_forward_paired.fq.gz FG35_forward_unpaired.fq.gz FG35_reverse_paired.fq.gz FG35_reverse_unpaired.fq.gz ILLUMINACLIP:/path/to/the/adaptor/sequence/file/e.g./TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Let’s go through each part of this long command, with adapted information from the manual:
PE
: indicated our data is paired-end-phred33
: -phred33 or -phred64 specifies the base quality encoding. If no quality encoding is specified, it will be determined automatically (since version 0.32). In our case, if your data is yourger than 5 years-old, it is very likely that it is following Phred+33. Here is more information about it: https://www.drive5.com/usearch/manual/quality_score.html.ILLUMINACLIP:TruSeq3-PE.fa:2:30:10
will use the file TruSeq3-PE.fa to remove adapters. Depending on how you have installed Trimmomatic, this file will be in different places. An alternative is to download it directly from their github page: https://github.com/usadellab/Trimmomatic. You can see that they have several adapter files available, and the correct one will depend on which type of data you have. From their manual: Suggested adapter sequences are provided for TruSeq2 (as used in GAII machines) and TruSeq3 (as used by HiSeq and MiSeq machines), for both single-end and paired-end mode. These sequences have not been extensively tested, and depending on specific issues which may occur in library preparation, other sequences may work better for a given dataset. And see some notes from Illumina here https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html.Once you downloaded it, you can take a quick look to see what it looks like withcat TruSeq3-PE.fa
for example.LEADING:3
removes leading low quality or N bases (below quality 3).TRAILING:3
removes trailing low quality or N bases (below quality 3).SLIDINGWINDOW:4:15
scans the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15.MINLEN:36
drops reads below the 36 bases long.
To save time, here is the same command but using a loop and our acc
file:
while read f ; do
trimmomatic PE -phred33 "$f"_1.fastq.gz "$f"_2.fastq.gz "$f"_forward_paired.fq.gz "$f"_forward_unpaired.fq.gz "$f"_reverse_paired.fq.gz "$f"_reverse_unpaired.fq.gz ILLUMINACLIP:/path/to/the/adaptor/sequence/file/e.g./TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 ; done < acc
Don’t forget to read the manual for more details and options: http://www.usadellab.org/cms/?page=trimmomatic.
This is what you should have in your folder after Trimmomatic finishes:
FG113 FG113_reverse_unpaired.fq.gz FG35_reverse_paired.fq.gz FGIntype_forward_unpaired.fq.gz KGD465_forward_paired.fq.gz zygia917_1_fastqc.zip
FG113_1.fastq.gz FG35 FG35_reverse_unpaired.fq.gz FGIntype_reverse_paired.fq.gz KGD465_forward_unpaired.fq.gz zygia917_2.fastq.gz
FG113_1_fastqc.html FG35_1.fastq.gz FGIntype FGIntype_reverse_unpaired.fq.gz KGD465_reverse_paired.fq.gz zygia917_2_fastqc.html
FG113_1_fastqc.zip FG35_1_fastqc.html FGIntype_1.fastq.gz KGD465 KGD465_reverse_unpaired.fq.gz zygia917_2_fastqc.zip
FG113_2.fastq.gz FG35_1_fastqc.zip FGIntype_1_fastqc.html KGD465_1.fastq.gz acc zygia917_forward_paired.fq.gz
FG113_2_fastqc.html FG35_2.fastq.gz FGIntype_1_fastqc.zip KGD465_1_fastqc.html fastqcfiles zygia917_forward_unpaired.fq.gz
FG113_2_fastqc.zip FG35_2_fastqc.html FGIntype_2.fastq.gz KGD465_1_fastqc.zip renaming.sh zygia917_reverse_paired.fq.gz
FG113_forward_paired.fq.gz FG35_2_fastqc.zip FGIntype_2_fastqc.html KGD465_2.fastq.gz zygia917 zygia917_reverse_unpaired.fq.gz
FG113_forward_unpaired.fq.gz FG35_forward_paired.fq.gz FGIntype_2_fastqc.zip KGD465_2_fastqc.html zygia917_1.fastq.gz
FG113_reverse_paired.fq.gz FG35_forward_unpaired.fq.gz FGIntype_forward_paired.fq.gz KGD465_2_fastqc.zip zygia917_1_fastqc.html
Let’s take a quick look in one of the output files. The .gz
extension indicated that those files are compressed. We can use zless
in this case:
zless FG35_forward_paired.fq.gz
@HWI-M01145:124:000000000-A78G3:1:1101:15521:1338 1:N:0:ACAGTG
GTTCATACAACACATAACTCAGAAAAATGGATTCAGCTGTTCACCTCCGCCAATACAAATTGATAAATGAAGAATTCTGCATCAGAATGAAAGTATGAAATTACATTTTAGAAGGGTGTTACCTCAGATTGAGTACACGCAGCAGTTGCAAGGACTTCCCAATTCCAAAAGACAGATGGGAATCACTCAAAGGCCTCCGCTTAAAGTCAGATAGCCATCC
TGCTATCTACAAGTGTGTGAAACTTTACGG
+
CCCCCFFFFFFDGGGGGGGGGGHHHHGHHHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHHHHGHHHHHHHHHHHHHHHGHHHHHHHHHHHHHGHHHGHHHHHGHHHHHHHHHHGGFGGHHHHHHHFHHHGHHHHHHGGGGGHHHHHHHHHGHHHHHHHHHHHHHHHHHHGHHHFHHHHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHGHHGHGHHHGHHHHHHHHHHHHHGFHFFFGGGGGGGGGGGG
@HWI-M01145:124:000000000-A78G3:1:1101:16121:1366 1:N:0:ACAGTG
CAGTCAACTGATCTCCCAACAATATTGAAGTGCCAAAAGTGATAGCCGAAACAAATATAGGAGAGCTCCAAAAGATAAAAGTGATGAAAGCCTGAGAATAGAGGGCTTTACGTAGCCACTTGAACTCTACTCCACGCATTTCTTCCAGCTGTATTCGATATCGATCTTCCCAAGCTTGCAACTTGAGGATCCTCATATTTCTCAAGCATTCAGATGTTTTTCTCATCCTTTCATCCTTAGCAGCCATTAA
+
BBCACFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHGHHHHHHHHHHHGGGGGHGHHHHHHHHHHGHHHHHHHHHGHHHHHHHGHHHGHHHHHHHHHHHHHHHHHGGHHHHHGGHHEHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHHHHHHHHHHGHGGHGHHHHHHHHHHHGGHHGHHGHHGFHHGHHGGHHHHHHHHHGHHHGHGHHHHHHHHHHGGHHHHHHHHHHHHHFHGHGGGGGGGGGGFF
Each read has four lines of information, so here we see two reads. In the first line beginning with @
we see information from the sequencing machine (in this case information about the flowcell in a Illumina machine), the nucleotide sequence is stored on the second line, the third line has a +
and the fourth line stores the quality scores.
Now, let’s run FastQC on the groups of reads output from Trimmomatic. First let’s create a file with the names of Trimmomatic output:
ls *fq.gz > fastqctrimfiles
cat fastqctrimfiles
FG113_forward_paired.fq.gz
FG113_forward_unpaired.fq.gz
FG113_reverse_paired.fq.gz
FG113_reverse_unpaired.fq.gz
FG35_forward_paired.fq.gz
FG35_forward_unpaired.fq.gz
FG35_reverse_paired.fq.gz
FG35_reverse_unpaired.fq.gz
FGIntype_forward_paired.fq.gz
FGIntype_forward_unpaired.fq.gz
FGIntype_reverse_paired.fq.gz
FGIntype_reverse_unpaired.fq.gz
KGD465_forward_paired.fq.gz
KGD465_forward_unpaired.fq.gz
KGD465_reverse_paired.fq.gz
KGD465_reverse_unpaired.fq.gz
zygia917_forward_paired.fq.gz
zygia917_forward_unpaired.fq.gz
zygia917_reverse_paired.fq.gz
zygia917_reverse_unpaired.fq.gz
And run:
while read f; do fastqc "$f" ; done < fastqctrimfiles
Once it is done, inspect the html
files visually to check the impact of trimming (and whether files look good to carry on if you are running this analysis with your own data). We are now going to the next step, further trimming with Cutadapt.