Raw reads

Warning

Never work directly on your raw data. Make a copy of it to a working directory (i.e. where you will run your analysis) and keep backup copies of your raw data elsewhere (see Where do I store my data?).

For this tutorial, we will follow the Hyb_baits_pipeline from the paper Nicholls et al. 2015, with scripts available on github: https://github.com/ckidner/Targeted_enrichment. We will use five accessions used in the manuscript, available in European Nucleotide Archive, study ID ERP009747: https://www.ebi.ac.uk/ena/browser/view/PRJEB8722?show=reads.

Here we have a list of five folders that came from the sequencing facility. The name of the folders represent the name of the accessions we are working with. Inside each folder are the forward and reverse reads of the accessions. Let’s see what their names are:

ls FG35/
140428_M01145_0124_000000000-A78G3_1_IL-TP-005_1.sanfastq.gz
140428_M01145_0124_000000000-A78G3_1_IL-TP-005_2.sanfastq.gz

The files have complicated names. Let’s change their names to more meaningful ones (i.e., the name of the accessions, which is the same name of their folders):

Let’s start by making a list of folders we have in our working directory:

find . -type d

This command is saying find here (the dot .) all the directories -type d. Here is what you will see in your terminal:

./FGIntype
./KGD465
./FG35
./zygia917
./FG113

Let’s delete the symbols that we don’t need ( . and / ). We will use a pipe | and the command cut:

find . -type d | cut -c 3-
FG113
FG35
FGIntype
KGD465
zygia917

Now let’s save their names in a file. The symbol > will redirect our output to a file called acc:

find . -type d | cut -c 3- > acc

Take a look at acc:

cat acc
FG113
FG35
FGIntype
KGD465
zygia917

We will now use the script renaming.sh to rename our raw read and move them to the folder we are working on.

You can see the content of the script renaming.sh in this dropdown menu:
#! /bin/bash -x
# Renaming the files by the folder names, getting them out of the folders
# For dealing with sequencing facility output

# In directory with the folders in make a list of folder names, to run through this on a while loop

acc=$1

input1=${acc}/*_1.sanfastq.gz
output1=${acc}_1.fastq.gz

input2=${acc}/*_2.sanfastq.gz
output2=${acc}_2.fastq.gz

mv $input1 $output1
mv $input2 $output2
./renaming.sh FG35

If you have many samples, this task can take a while. Let’s use the script renaming.sh but in a loop in order to change the names of all the accessions with a single command. Loops are nice to automate repetitive tasks:

while read f; do ./renaming.sh "$f"; done < acc

Note

In this loop, the acc is the input file that will be read one line at a time, and that value replaces the variable $f. So if our acc files has five lines, the command ./renaming.sh will be executed five times, each time replacing $f with a line from our input file acc.

This is what you should see printed in your terminal:
+ acc=FG113
+ input1='FG113/*_1.sanfastq.gz'
+ output1=FG113_1.fastq.gz
+ input2='FG113/*_2.sanfastq.gz'
+ output2=FG113_2.fastq.gz
+ mv FG113/140428_M01145_0124_000000000-A78G3_1_IL-TP-019_1.sanfastq.gz FG113_1.fastq.gz
+ mv FG113/140428_M01145_0124_000000000-A78G3_1_IL-TP-019_2.sanfastq.gz FG113_2.fastq.gz
+ acc=FG35
+ input1='FG35/*_1.sanfastq.gz'
+ output1=FG35_1.fastq.gz
+ input2='FG35/*_2.sanfastq.gz'
+ output2=FG35_2.fastq.gz
+ mv FG35/140428_M01145_0124_000000000-A78G3_1_IL-TP-005_1.sanfastq.gz FG35_1.fastq.gz
+ mv FG35/140428_M01145_0124_000000000-A78G3_1_IL-TP-005_2.sanfastq.gz FG35_2.fastq.gz
+ acc=FGIntype
+ input1='FGIntype/*_1.sanfastq.gz'
+ output1=FGIntype_1.fastq.gz
+ input2='FGIntype/*_2.sanfastq.gz'
+ output2=FGIntype_2.fastq.gz
+ mv FGIntype/140428_M01145_0124_000000000-A78G3_1_IL-TP-009_1.sanfastq.gz FGIntype_1.fastq.gz
+ mv FGIntype/140428_M01145_0124_000000000-A78G3_1_IL-TP-009_2.sanfastq.gz FGIntype_2.fastq.gz
+ acc=KGD465
+ input1='KGD465/*_1.sanfastq.gz'
+ output1=KGD465_1.fastq.gz
+ input2='KGD465/*_2.sanfastq.gz'
+ output2=KGD465_2.fastq.gz
+ mv KGD465/140428_M01145_0124_000000000-A78G3_1_IL-TP-020_1.sanfastq.gz KGD465_1.fastq.gz
+ mv KGD465/140428_M01145_0124_000000000-A78G3_1_IL-TP-020_2.sanfastq.gz KGD465_2.fastq.gz
+ acc=zygia917
+ input1='zygia917/*_1.sanfastq.gz'
+ output1=zygia917_1.fastq.gz
+ input2='zygia917/*_2.sanfastq.gz'
+ output2=zygia917_2.fastq.gz
+ mv zygia917/140428_M01145_0124_000000000-A78G3_1_IL-TP-027_1.sanfastq.gz zygia917_1.fastq.gz
+ mv zygia917/140428_M01145_0124_000000000-A78G3_1_IL-TP-027_2.sanfastq.gz zygia917_2.fastq.gz

Now let’s see what we have in our working directory:

ls
FG113               FG35_2.fastq.gz     KGD465_1.fastq.gz   zygia917
FG113_1.fastq.gz    FGIntype            KGD465_2.fastq.gz   zygia917_1.fastq.gz
FG113_2.fastq.gz    FGIntype_1.fastq.gz acc                 zygia917_2.fastq.gz
FG35                FGIntype_2.fastq.gz FG35_1.fastq.gz     KGD465              renaming.sh

The raw reads are in fastq.gz format. fastq is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. The gz means the files are compressed. We can peek into compressed files with zless. Let’s look into the forward reads (forward reads end with _1 in our case) of the accession FG113.

zless FG113_1.fastq.gz
@HWI-M01145:124:000000000-A78G3:1:1101:15884:1359 1:N:0:GTGAAA
AATCTAAAACCACCTCAATAAATGTAATCTTCATCTATCTTATCTATTTAAAGTCCTTATTATTCCTTTTGTTTCTTCAAATCCAGCTCCCTCTTCTCTCCTTTCTCAAGTATCAATATGTTGAGATGCATCAATACTTCTTCAGTTCGTTCTCCTTTTCCTCTTCCGCCTCTACAGCCTCTACCGCATCTACCTCCTCTACTGCAACCGTCAGAGTGACAAGTTCTTGACCACCTCAATGCGAGGGTGG
+
BBBBBFFFFFABGGGGGGGGGGGHHFAFAFGFGHHHHHHHHHGHHHHHHHHHHGGHHHHHHHHHHHHHHHHHHHHHHHGHHHHGHHGHHHGHHHHHGHHHHHHHHHHGHHFHHHHHHHHHHHGFHHGHHGHGHHHHHGHHHHHHHHHHGGHHGHHHHHHHHHHHHHHGGGGGHHHHGHHHHHHHHGGGGGHHHHHHHHHHHHHHHHHHHGGGGGGHFHHHGHHHHHHHHHHHHHGHHHHHHHGGGGGE?D
@HWI-M01145:124:000000000-A78G3:1:1101:14180:1422 1:N:0:GTGAAA
TTGAGATCAACTTGGTCATCCTTGTCGCGTCCCAATATCATCTTATAAACAATATATCTTCTAAAAGCCCATGCATTTTCTCACTAACGTCCACAATCTCACCCATCTTAGCAGCATTATCCAGTGACTTCACCACCATTTTCAACTCCTCCGATCCCTCCTAAGATGACCAAACGGCTCTTGATTTGAAGCACCAAAAAGATGTTGGAGGCTGAGTTTCTTCATGTCACGCAAGTATGAACCATGCCCA
+
>AAAAAFFFF1DEGGFGFFFGG311FEEAFGEA1FGGHHDEGGEAEHF2GDCFHHHHHHFHHBAHFGFFEHHBF1AGHHH12GHHHHFCFEFEFHHFGHHGHFGGGHHHHGHBGEGHHHHHGHH2GFFFGGFHGGHGHHHHHHHHGHHFFHHFCAC/0FFECGHH1GD1FGFH/G<EC?CHHD<GGHFBFGFGHEFC..<A.D<GDCE0C?.GFHHHFHHHHBGHFF09FB?@AG0FFGFFFAABFFGGE

To exit zless just type q.

Each read has four lines of information, so here we see two reads. In the first line beginning with @ we see information from the sequencing machine (in this case information about the flowcell in a Illumina machine), the nucleotide sequence is stored on the second line, the third line has a + and the fourth line stores the quality scores.

We are ready to the next step. Let’s run FastQC on them.