Chapter 7 Basic bioinformatics
The next chapters focus on the analysis of molecular sequence data in using command line. Nothing too fancy, just some basic commands to look at the data files and summarize information from the sequences.
7.1 Section 1: Basic commands for sequence data
7.1.1 Downloading the datasets
We are going to use two main datasets:
https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_3/AAB21188.seq
https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_3/S79522.seq
- Create a folder called
Lab_3
in your home folder at the cluster - Download the files. Answer the following question:
Question 1
- What commands you used to download the files?
- Check the files. Answer the following questions:
Question 2
-
What kind of sequence (DNA/RNA/Protein) is the
S79522.seq
file? Add the command you used to answer the question.
Question 3
-
What kind of of sequence (DNA/RNA/Protein) is the
AAB21188.seq
file? Add the command you used to answer the question.
7.1.2 Counting basic patterns for DNA data
Ok, after checking the files we know now that the S79522.seq
file contains a DNA sequence.
Lets do some command line exercises with it. Answer all of these questions:
Question 4
- How many lines are found on this file?
Question 5
- How many characters are found on this file? (Or, what is the length of this sequence?)
Question 6
- How many A, C, G, and T can you find in this sequence?
Question 7
-
Convert this sequence to RNA and save it in a file called
S79522_RNA.seq
. Please add the command you used to create the file. (Don’t create a complementary strand, just convert this one from DNA to RNA)
Question 8
-
How can we test that this new file is an RNA sequence
S79522_RNA.seq
. Please add a command you’d use to check if this is DNA or RNA.
Question 9
-
Finally, using your RNA sequence, check how many start codons
(
AUG
) or stop codons (UAA
,UAG
, orUGA
) may exist in your sequence
7.1.3 Counting basic patterns for Aminoacid data
Ok, after checking the files we know now that the AAB21188.seq
file contains a aminoacid sequence.
Question 10
- How many characters are found on this file? (Or, what is the length of this sequence?)
Question 11
- Select two aminoacids of interest and count them. Please add the command used to count them.
We have four candidate patterns for this gene:
VLKYYKVD
(A common pattern found in the Ribosomal S27 subunit domains)ITLEVE
(A common pattern found in Ubiquitin domains)KWCPRP
(A common pattern found in In Between Ring fingers (IBR) domains)LSVLSQKM
(A common pattern found in DNA polymerase unit A domains)
Question 12
Which patterns do you find in the sequence? Fill in this table:
Sequence | Ribosomal 27 | Ubiquitin | IBR | DNA poly A |
---|---|---|---|---|
AAB21188 | Yes/No | Yes/No | Yes/No | Yes/No |
7.2 Section 2: Creating pipelines
In this section we will learn something new and exciting:
We will make a simple pipeline to automate counting nucleotide sequences.
7.2.1 What is a pipeline?
Pipelines are sequential lines of code that can be used to use multiple commands at once.
For example, if you want to count all the U
in a DNA sequence that you have transformed into DNA, you can actually do this in one simple line of code.
Think about this as a logical set of steps (or, a pseudocode):
- Read the DNA file
- Replace the
T
forU
- Extract the matches to
U
- Count those matches
Using the command line the intructions would be like:
cat file.txt
sed 's/T/U/g' file.txt > file_rna.txt
grep -o U file_rna.txt > U_ocurrences.txt
wc -l U_ocurrences.txt
This code works, but its quite useless as it forces us to create two independent files: file_rna.txt
and U_ocurrences.txt
.
Imagine that the output of these files were large and you had no space left in your hard drive! You would have to run everything again from the beginning!
One of the options is to use pipelines, using the pipe (|
) modifier. This modifier works as an an then
.
So for example, a command that is like this: ls -lrt | grep seq | wc -l
can be read as:
Make a list of all the files and then extract all the matches to seq
and then count the number of occurences.
So, we can totally use this with out data.
7.2.1.1 Pipeline example using DNA data
Before we start, let me introduce you to two elements:
- Variables
- the
echo
command.
7.2.1.1.1 Variables
Variables can be defined as a place to temporarly store a piece of information.
For example, you wallet is a variable that is full of your money.
So, if you have $10 in your wallet, then your variable is:
wallet=10
So, that is how you define a variable in bash
. To see what the value of that variable is, you use the echo
command
7.2.1.1.2 The echo
command
The echo
command will print to your screen whatever you tell it to print.
For example, if you tell the command to echo 'hello'
, this will happen:
echo 'hello'
## hello
Meaning that you can tell it to say whatever you want to say. If you want to add spaces or other weird characters, make sure you surrond the statement with single quotes
echo 'This is the best course ever'
## This is the best course ever
7.2.1.1.3 Mixing variables and echo
So, we want to mix these two because then you can store a variable and print it in the screen, for example:
statement='This is the best course ever'
echo $statement
## This is the best course ever
Look at this! You can simplify the instructions and print variables!
Now, did you notice that the variable statement
had a $
in front? That is how you tell bash
that you are using a variable!
If you don’t add it this is what happens:
echo statement
## statement
So, you are just printing the word statement. Interesting ah?
7.2.1.2 Using a pipeline as an example of variables and pipes
Now, lets say we want to count all the nucelotides in our DNA file independently.
So, we want to extract each nucleotides, starting from the A
, using grep -c A
, and then counting lines with wc -l
:
grep -o A Lab_3/S79522.seq | wc -l
## grep: Lab_3/S79522.seq: No such file or directory
## 0
Question 12
- Did these results match what you did at the beginning of the lab? Expand briefly your answer.
Ok, now, to save this on a variable (you can call the variable anything as long as it doesn’t have any spaces or weird characters) we just put the entire pipeline inside a $()
, like this:
nucA=$(grep -o A Lab_3/S79522.seq | wc -l)
echo $nucA
## grep: Lab_3/S79522.seq: No such file or directory
## 0
WOW! that was easy!
Now, if you want to add more variables you can do it like this:
nucA=$(grep -o A Lab_3/S79522.seq | wc -l)
nucC=$(grep -o C Lab_3/S79522.seq | wc -l)
echo 'Number of A:'$nucA
echo 'Number of C:'$nucC
## grep: Lab_3/S79522.seq: No such file or directory
## grep: Lab_3/S79522.seq: No such file or directory
## Number of A: 0
## Number of C: 0
Right here, stop and make the same sound a rooster makes. Then answer the following questions:
Question 13
- How would you count the other nucleotides?. Add the code below:
Question 14
-
Write a set of instructions to count ten of the aminoacids in the
AAB21188.seq
file.
7.3 Section 3: Processing local files into the cluster
So, we know how to download files from different sources into the cluster. Today, we are going to learn how to move files from and to the cluster.
Download the
sequence_1
andsequence_2
files from the moodle.Create a new folder called
mystery_genes
in yourLab_3
folderUpload the files into the folder
Check if the files are there.
Question 15
- How did you check if the files were correctly uploaded?
7.3.1 Checking file types
If you remember, we saw some file types for sequence data in the theory class. These are FASTA
and GenBank
.
We are going to deal with these files today, and see how we can extract important information from these files.
Question 16
- Briefly summarize what is a FASTA and what is a GenBank file (one sentence each)
Question 17
- Can these formats include RNA, DNA and Aminoacid data? (Yes/No answer)
Now, use your knowledge of the command line to answer these questions:
Question 18
-
What format is the
sequence_1
file in? (FASTA or GenBank) Briefly explain your answer. -
What format is the
sequence_2
file in? (FASTA or GenBank) Briefly explain your answer.
If you answered correclty, you have seen that sequence_1
is in FASTA format, while sequence_2
is in GenBank format.
7.3.2 FASTA: Data handling
Question 19
-
How many lines does the
sequence_1
file has? -
How can you count the number of SEQUENCES that exist in the
sequence_1
file? -
What is the number of sequences in the
sequence_1
file? Is this result different than the number of lines in the file? Why?
Now, we know that the number of sequences (five) can be recognized by counting the number of >
symbols in the sequence, right?
This means that for this particular FASTA file, you can extract each sequence individually if you know the lines.
For example: - lines 1 and 2 have the first sequence - lines 2 and 3 the second
and so on.
That means we can extract each sequence. You can do this using the sed
command in the following way
sed -n 'staring_line,ending_line p' file_with_sequences
So, for example, to extract the first sequence in FASTA format:
sed -n '1,2 p' sequence_1
Now, because this is extracting in a fasta file, you will get the header and the sequence data. To only see the sequence data, use grep
to extract all lines that DON’T have a >
symbol. You can pipe the results of sed
into grep
!
sed -n '1,2 p' sequence_1 | grep -v ">"
in this case, we had to put the >
symbol between quotes because without them then the command line creates a new file.
Finally, we can even count the length of the sequence:
sed -n '1,2 p' sequence_1 | grep -v ">" | wc -c
This first sequence is 7478 nucleotides long! Ain’t this cool?
Question 21
- How can you save the new sequence in a new FASTA file?
- Extract all other sequences from the file and save them in 5 different FASTA files. Add the code you used below
STOP HERE!
We will learn about how to write a script to answer the take home HW. On the meantime, finish the Section 4 questions
Take Home HW
Can you create code that will do the following for each extracted FASTA sequence?
- Count the number of A,G,C and T?
- Count the length of each sequence?
- Count the number of open reading frames (ORF)?
Present the code and the results
7.4 Section 4: GenBank Files
GenBank files are more complex than FASTA files.
- These files are supremely complex to analyse only using the command line. However, we can take a look a this data and answer a simple question using the command line:
Question 21
-
How many lines does the
sequence_2
file has?
- Open the file using
less -NS
and answer the following questions:
Question 22
-
If you were looking for the information on the gene that codes for
lacZ
, what line would you use? - What species is this gene taken from?
- What genomic region does this represent?
- How many other genes can you find?
Question 23
Genic regions on GenBank files are represented by three different
types: mRNA
, gene
, CDS
.
-
What are the differences between these
mRNA
,gene
, andCDS
based in your GenBank and the NCBI help?
7.5 For loops and Solution to the HW
- Extracting all the sequences from
sequence_1
into different files (As asked in Question 21)
We have noticed that - Gene 1 is in lines one and two
sed -n '1,2 p' sequence_1 > gene_1.fasta
cat gene_1.fasta
## >J01636.1 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
## GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGGTGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTCCCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAGCTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCACCTCCAGTCTGGCCCTGCACGCGCCGTCGCAAATTGTCGCGGCGATTAAATCTCGCGCCGATCAACTGGGTGCCAGCGTGGTGGTGTCGATGGTAGAACGAAGCGGCGTCGAAGCCTGTAAAGCGGCGGTGCACAATCTTCTCGCGCAACGCGTCAGTGGGCTGATCATTAACTATCCGCTGGATGACCAGGATGCCATTGCTGTGGAAGCTGCCTGCACTAATGTTCCGGCGTTATTTCTTGATGTCTCTGACCAGACACCCATCAACAGTATTATTTTCTCCCATGAAGACGGTACGCGACTGGGCGTGGAGCATCTGGTCGCATTGGGTCACCAGCAAATCGCGCTGTTAGCGGGCCCATTAAGTTCTGTCTCGGCGCGTCTGCGTCTGGCTGGCTGGCATAAATATCTCACTCGCAATCAAATTCAGCCGATAGCGGAACGGGAAGGCGACTGGAGTGCCATGTCCGGTTTTCAACAAACCATGCAAATGCTGAATGAGGGCATCGTTCCCACTGCGATGCTGGTTGCCAACGATCAGATGGCGCTGGGCGCAATGCGCGCCATTACCGAGTCCGGGCTGCGCGTTGGTGCGGATATCTCGGTAGTGGGATACGACGATACCGAAGACAGCTCATGTTATATCCCGCCGTCAACCACCATCAAACAGGATTTTCGCCTGCTGGGGCAAACCAGCGTGGACCGCTTGCTGCAACTCTCTCAGGGCCAGGCGGTGAAGGGCAATCAGCTGTTGCCCGTCTCACTGGTGAAAAGAAAAACCACCCTGGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCATGATTACGGATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCTTTGCCTGGTTTCCGGCACCAGAAGCGGTGCCGGAAAGCTGGCTGGAGTGCGATCTTCCTGAGGCCGATACTGTCGTCGTCCCCTCAAACTGGCAGATGCACGGTTACGATGCGCCCATCTACACCAACGTAACCTATCCCATTACGGTCAATCCGCCGTTTGTTCCCACGGAGAATCCGACGGGTTGTTACTCGCTCACATTTAATGTTGATGAAAGCTGGCTACAGGAAGGCCAGACGCGAATTATTTTTGATGGCGTTAACTCGGCGTTTCATCTGTGGTGCAACGGGCGCTGGGTCGGTTACGGCCAGGACAGTCGTTTGCCGTCTGAATTTGACCTGAGCGCATTTTTACGCGCCGGAGAAAACCGCCTCGCGGTGATGGTGCTGCGTTGGAGTGACGGCAGTTATCTGGAAGATCAGGATATGTGGCGGATGAGCGGCATTTTCCGTGACGTCTCGTTGCTGCATAAACCGACTACACAAATCAGCGATTTCCATGTTGCCACTCGCTTTAATGATGATTTCAGCCGCGCTGTACTGGAGGCTGAAGTTCAGATGTGCGGCGAGTTGCGTGACTACCTACGGGTAACAGTTTCTTTATGGCAGGGTGAAACGCAGGTCGCCAGCGGCACCGCGCCTTTCGGCGGTGAAATTATCGATGAGCGTGGTGGTTATGCCGATCGCGTCACACTACGTCTGAACGTCGAAAACCCGAAACTGTGGAGCGCCGAAATCCCGAATCTCTATCGTGCGGTGGTTGAACTGCACACCGCCGACGGCACGCTGATTGAAGCAGAAGCCTGCGATGTCGGTTTCCGCGAGGTGCGGATTGAAAATGGTCTGCTGCTGCTGAACGGCAAGCCGTTGCTGATTCGAGGCGTTAACCGTCACGAGCATCATCCTCTGCATGGTCAGGTCATGGATGAGCAGACGATGGTGCAGGATATCCTGCTGATGAAGCAGAACAACTTTAACGCCGTGCGCTGTTCGCATTATCCGAACCATCCGCTGTGGTACACGCTGTGCGACCGCTACGGCCTGTATGTGGTGGATGAAGCCAATATTGAAACCCACGGCATGGTGCCAATGAATCGTCTGACCGATGATCCGCGCTGGCTACCGGCGATGAGCGAACGCGTAACGCGAATGGTGCAGCGCGATCGTAATCACCCGAGTGTGATCATCTGGTCGCTGGGGAATGAATCAGGCCACGGCGCTAATCACGACGCGCTGTATCGCTGGATCAAATCTGTCGATCCTTCCCGCCCGGTGCAGTATGAAGGCGGCGGAGCCGACACCACGGCCACCGATATTATTTGCCCGATGTACGCGCGCGTGGATGAAGACCAGCCCTTCCCGGCTGTGCCGAAATGGTCCATCAAAAAATGGCTTTCGCTACCTGGAGAGACGCGCCCGCTGATCCTTTGCGAATACGCCCACGCGATGGGTAACAGTCTTGGCGGTTTCGCTAAATACTGGCAGGCGTTTCGTCAGTATCCCCGTTTACAGGGCGGCTTCGTCTGGGACTGGGTGGATCAGTCGCTGATTAAATATGATGAAAACGGCAACCCGTGGTCGGCTTACGGCGGTGATTTTGGCGATACGCCGAACGATCGCCAGTTCTGTATGAACGGTCTGGTCTTTGCCGACCGCACGCCGCATCCAGCGCTGACGGAAGCAAAACACCAGCAGCAGTTTTTCCAGTTCCGTTTATCCGGGCAAACCATCGAAGTGACCAGCGAATACCTGTTCCGTCATAGCGATAACGAGCTCCTGCACTGGATGGTGGCGCTGGATGGTAAGCCGCTGGCAAGCGGTGAAGTGCCTCTGGATGTCGCTCCACAAGGTAAACAGTTGATTGAACTGCCTGAACTACCGCAGCCGGAGAGCGCCGGGCAACTCTGGCTCACAGTACGCGTAGTGCAACCGAACGCGACCGCATGGTCAGAAGCCGGGCACATCAGCGCCTGGCAGCAGTGGCGTCTGGCGGAAAACCTCAGTGTGACGCTCCCCGCCGCGTCCCACGCCATCCCGCATCTGACCACCAGCGAAATGGATTTTTGCATCGAGCTGGGTAATAAGCGTTGGCAATTTAACCGCCAGTCAGGCTTTCTTTCACAGATGTGGATTGGCGATAAAAAACAACTGCTGACGCCGCTGCGCGATCAGTTCACCCGTGCACCGCTGGATAACGACATTGGCGTAAGTGAAGCGACCCGCATTGACCCTAACGCCTGGGTCGAACGCTGGAAGGCGGCGGGCCATTACCAGGCCGAAGCAGCGTTGTTGCAGTGCACGGCAGATACACTTGCTGATGCGGTGCTGATTACGACCGCTCACGCGTGGCAGCATCAGGGGAAAACCTTATTTATCAGCCGGAAAACCTACCGGATTGATGGTAGTGGTCAAATGGCGATTACCGTTGATGTTGAAGTGGCGAGCGATACACCGCATCCGGCGCGGATTGGCCTGAACTGCCAGCTGGCGCAGGTAGCAGAGCGGGTAAACTGGCTCGGATTAGGGCCGCAAGAAAACTATCCCGACCGCCTTACTGCCGCCTGTTTTGACCGCTGGGATCTGCCATTGTCAGACATGTATACCCCGTACGTCTTCCCGAGCGAAAACGGTCTGCGCTGCGGGACGCGCGAATTGAATTATGGCCCACACCAGTGGCGCGGCGACTTCCAGTTCAACATCAGCCGCTACAGTCAACAGCAACTGATGGAAACCAGCCATCGCCATCTGCTGCACGCGGAAGAAGGCACATGGCTGAATATCGACGGTTTCCATATGGGGATTGGTGGCGACGACTCCTGGAGCCCGTCAGTATCGGCGGAATTCCAGCTGAGCGCCGGTCGCTACCATTACCAGTTGGTCTGGTGTCAAAAATAATAATAACCGGGCAGGCCATGTCTGCCCGTATTTCGCGTAAGGAAATCCATTATGTACTATTTAAAAAACACAAACTTTTGGATGTTCGGTTTATTCTTTTTCTTTTACTTTTTTATCATGGGAGCCTACTTCCCGTTTTTCCCGATTTGGCTACATGACATCAACCATATCAGCAAAAGTGATACGGGTATTATTTTTGCCGCTATTTCTCTGTTCTCGCTATTATTCCAACCGCTGTTTGGTCTGCTTTCTGACAAACTCGGGCTGCGCAAATACCTGCTGTGGATTATTACCGGCATGTTAGTGATGTTTGCGCCGTTCTTTATTTTTATCTTCGGGCCACTGTTACAATACAACATTTTAGTAGGATCGATTGTTGGTGGTATTTATCTAGGCTTTTGTTTTAACGCCGGTGCGCCAGCAGTAGAGGCATTTATTGAGAAAGTCAGCCGTCGCAGTAATTTCGAATTTGGTCGCGCGCGGATGTTTGGCTGTGTTGGCTGGGCGCTGTGTGCCTCGATTGTCGGCATCATGTTCACCATCAATAATCAGTTTGTTTTCTGGCTGGGCTCTGGCTGTGCACTCATCCTCGCCGTTTTACTCTTTTTCGCCAAAACGGATGCGCCCTCTTCTGCCACGGTTGCCAATGCGGTAGGTGCCAACCATTCGGCATTTAGCCTTAAGCTGGCACTGGAACTGTTCAGACAGCCAAAACTGTGGTTTTTGTCACTGTATGTTATTGGCGTTTCCTGCACCTACGATGTTTTTGACCAACAGTTTGCTAATTTCTTTACTTCGTTCTTTGCTACCGGTGAACAGGGTACGCGGGTATTTGGCTACGTAACGACAATGGGCGAATTACTTAACGCCTCGATTATGTTCTTTGCGCCACTGATCATTAATCGCATCGGTGGGAAAAACGCCCTGCTGCTGGCTGGCACTATTATGTCTGTACGTATTATTGGCTCATCGTTCGCCACCTCAGCGCTGGAAGTGGTTATTCTGAAAACGCTGCATATGTTTGAAGTACCGTTCCTGCTGGTGGGCTGCTTTAAATATATTACCAGCCAGTTTGAAGTGCGTTTTTCAGCGACGATTTATCTGGTCTGTTTCTGCTTCTTTAAGCAACTGGCGATGATTTTTATGTCTGTACTGGCGGGCAATATGTATGAAAGCATCGGTTTCCAGGGCGCTTATCTGGTGCTGGGTCTGGTGGCGCTGGGCTTCACCTTAATTTCCGTGTTCACGCTTAGCGGCCCCGGCCCGCTTTCCCTGCTGCGTCGTCAGGTGAATGAAGTCGCTTAAGCAATCAATGTCGGATGCGGCGCGACGCTTATCCGACCAACATATCATAACGGAGTGATCGCATTGAACATGCCAATGACCGAAAGAATAAGAGCAGGCAAGCTATTTACCGATATGTGCGAAGGCTTACCGGAAAAAAGACTTCGTGGGAAAACGTTAATGTATGAGTTTAATCACTCGCATCCATCAGAAGTTGAAAAAAGAGAAAGCCTGATTAAAGAAATGTTTGCCACGGTAGGGGAAAACGCCTGGGTAGAACCGCCTGTCTATTTCTCTTACGGTTCCAACATCCATATAGGCCGCAATTTTTATGCAAATTTCAATTTAACCATTGTCGATGACTACACGGTAACAATCGGTGATAACGTACTGATTGCACCCAACGTTACTCTTTCCGTTACGGGACACCCTGTACACCATGAATTGAGAAAAAACGGCGAGATGTACTCTTTTCCGATAACGATTGGCAATAACGTCTGGATCGGAAGTCATGTGGTTATTAATCCAGGCGTCACCATCGGGGATAATTCTGTTATTGGCGCGGGTAGTATCGTCACAAAAGACATTCCACCAAACGTCGTGGCGGCTGGCGTTCCTTGTCGGGTTATTCGCGAAATAAACGACCGGGATAAGCACTATTATTTCAAAGATTATAAAGTTGAATCGTCAGTTTAAATTATAAAAATTGCCTGATACGCTGCGCTTATCAGGCCTACAAGTTCAGCGATCTACATTAGCCGCATCCGGCATGAACAAAGCGCAGGAACAAGCGTCGCATCATGCCTCTTTGACCCACAGCTGCGGAAAACGTACTGGTGCAAAACGCAGGGTTATGATCATCAGCCCAACGACGCACAGCGCATGAAATGCCCAGTCCATCAGGTAATTGCCGCTGATACTACGCAGCACGCCAGAAAACCACGGGGCAAGCCCGGCGATGATAAAACCGATTCCCTGCATAAACGCCACCAGCTTGCCAGCAATAGCCGGTTGCACAGAGTGATCGAGCGCCAGCAGCAAACAGAGCGGAAACGCGCCGCCCAGACCTAACCCACACACCATCGCCCACAATACCGGCAATTGCATCGGCAGCCAGATAAAGCCGCAGAACCCCACCAGTTGTAACACCAGCGCCAGCATTAACAGTTTGCGCCGATCCTGATGGCGAGCCATAGCAGGCATCAGCAAAGCTCCTGCGGCTTGCCCAAGCGTCATCAATGCCAGTAAGGAACCGCTGTACTGCGCGCTGGCACCAATCTCAATATAGAAAGCGGGTAACCAGGCAATCAGGCTGGCGTAACCGCCGTTAATCAGACCGAAGTAAACACCCAGCGTCCACGCGCGGGGAGTGAATACCACGCGAACCGGAGTGGTTGTTGTCTTGTGGGAAGAGGCGACCTCGCGGGCGCTTTGCCACCACCAGGCAAAGAGCGCAACAACGGCAGGCAGCGCCACCAGGCGAGTGTTTGATACCAGGTTTCGCTATGTTGAACTAACCAGGGCGTTATGGCGGCACCAAGCCCACCGCCGCCCATCAGAGCCGCGGACCACAGCCCCATCACCAGTGGCGTGCGCTGCTGAAACCGCCGTTTAATCACCGAAGCATCACCGCCTGAATGATGCCGATCCCCACCCCACCAAGCAGTGCGCTGCTAAGCAGCAGCGCACTTTGCGGGTAAAGCTCACGCATCAATGCACCGACGGCAATCAGCAACAGACTGATGGCGACACTGCGACGTTCGCTGACATGCTGATGAAGCCAGCTTCCGGCCAGCGCCAGCCCGCCCATGGTAACCACCGGCAGAGCGGTCGAC
- Gene 2 is in lines three and four
sed -n '3,4 p' sequence_1 > gene_2.fasta
cat gene_2.fasta
## >lcl|J01636.1_cds_AAA24052.1_1 [gene=lacI] [protein_id=AAA24052.1] [location=79..1161] [gbkey=CDS]
## GTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTCCCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAGCTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCACCTCCAGTCTGGCCCTGCACGCGCCGTCGCAAATTGTCGCGGCGATTAAATCTCGCGCCGATCAACTGGGTGCCAGCGTGGTGGTGTCGATGGTAGAACGAAGCGGCGTCGAAGCCTGTAAAGCGGCGGTGCACAATCTTCTCGCGCAACGCGTCAGTGGGCTGATCATTAACTATCCGCTGGATGACCAGGATGCCATTGCTGTGGAAGCTGCCTGCACTAATGTTCCGGCGTTATTTCTTGATGTCTCTGACCAGACACCCATCAACAGTATTATTTTCTCCCATGAAGACGGTACGCGACTGGGCGTGGAGCATCTGGTCGCATTGGGTCACCAGCAAATCGCGCTGTTAGCGGGCCCATTAAGTTCTGTCTCGGCGCGTCTGCGTCTGGCTGGCTGGCATAAATATCTCACTCGCAATCAAATTCAGCCGATAGCGGAACGGGAAGGCGACTGGAGTGCCATGTCCGGTTTTCAACAAACCATGCAAATGCTGAATGAGGGCATCGTTCCCACTGCGATGCTGGTTGCCAACGATCAGATGGCGCTGGGCGCAATGCGCGCCATTACCGAGTCCGGGCTGCGCGTTGGTGCGGATATCTCGGTAGTGGGATACGACGATACCGAAGACAGCTCATGTTATATCCCGCCGTCAACCACCATCAAACAGGATTTTCGCCTGCTGGGGCAAACCAGCGTGGACCGCTTGCTGCAACTCTCTCAGGGCCAGGCGGTGAAGGGCAATCAGCTGTTGCCCGTCTCACTGGTGAAAAGAAAAACCACCCTGGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGA
- Gene 3 is in lines five and six
sed -n '5,6 p' sequence_1 > gene_3.fasta
cat gene_3.fasta
## >lcl|J01636.1_cds_AAA24053.1_2 [gene=lacZ] [protein_id=AAA24053.1] [location=1284..4358] [gbkey=CDS]
## ATGACCATGATTACGGATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCTTTGCCTGGTTTCCGGCACCAGAAGCGGTGCCGGAAAGCTGGCTGGAGTGCGATCTTCCTGAGGCCGATACTGTCGTCGTCCCCTCAAACTGGCAGATGCACGGTTACGATGCGCCCATCTACACCAACGTAACCTATCCCATTACGGTCAATCCGCCGTTTGTTCCCACGGAGAATCCGACGGGTTGTTACTCGCTCACATTTAATGTTGATGAAAGCTGGCTACAGGAAGGCCAGACGCGAATTATTTTTGATGGCGTTAACTCGGCGTTTCATCTGTGGTGCAACGGGCGCTGGGTCGGTTACGGCCAGGACAGTCGTTTGCCGTCTGAATTTGACCTGAGCGCATTTTTACGCGCCGGAGAAAACCGCCTCGCGGTGATGGTGCTGCGTTGGAGTGACGGCAGTTATCTGGAAGATCAGGATATGTGGCGGATGAGCGGCATTTTCCGTGACGTCTCGTTGCTGCATAAACCGACTACACAAATCAGCGATTTCCATGTTGCCACTCGCTTTAATGATGATTTCAGCCGCGCTGTACTGGAGGCTGAAGTTCAGATGTGCGGCGAGTTGCGTGACTACCTACGGGTAACAGTTTCTTTATGGCAGGGTGAAACGCAGGTCGCCAGCGGCACCGCGCCTTTCGGCGGTGAAATTATCGATGAGCGTGGTGGTTATGCCGATCGCGTCACACTACGTCTGAACGTCGAAAACCCGAAACTGTGGAGCGCCGAAATCCCGAATCTCTATCGTGCGGTGGTTGAACTGCACACCGCCGACGGCACGCTGATTGAAGCAGAAGCCTGCGATGTCGGTTTCCGCGAGGTGCGGATTGAAAATGGTCTGCTGCTGCTGAACGGCAAGCCGTTGCTGATTCGAGGCGTTAACCGTCACGAGCATCATCCTCTGCATGGTCAGGTCATGGATGAGCAGACGATGGTGCAGGATATCCTGCTGATGAAGCAGAACAACTTTAACGCCGTGCGCTGTTCGCATTATCCGAACCATCCGCTGTGGTACACGCTGTGCGACCGCTACGGCCTGTATGTGGTGGATGAAGCCAATATTGAAACCCACGGCATGGTGCCAATGAATCGTCTGACCGATGATCCGCGCTGGCTACCGGCGATGAGCGAACGCGTAACGCGAATGGTGCAGCGCGATCGTAATCACCCGAGTGTGATCATCTGGTCGCTGGGGAATGAATCAGGCCACGGCGCTAATCACGACGCGCTGTATCGCTGGATCAAATCTGTCGATCCTTCCCGCCCGGTGCAGTATGAAGGCGGCGGAGCCGACACCACGGCCACCGATATTATTTGCCCGATGTACGCGCGCGTGGATGAAGACCAGCCCTTCCCGGCTGTGCCGAAATGGTCCATCAAAAAATGGCTTTCGCTACCTGGAGAGACGCGCCCGCTGATCCTTTGCGAATACGCCCACGCGATGGGTAACAGTCTTGGCGGTTTCGCTAAATACTGGCAGGCGTTTCGTCAGTATCCCCGTTTACAGGGCGGCTTCGTCTGGGACTGGGTGGATCAGTCGCTGATTAAATATGATGAAAACGGCAACCCGTGGTCGGCTTACGGCGGTGATTTTGGCGATACGCCGAACGATCGCCAGTTCTGTATGAACGGTCTGGTCTTTGCCGACCGCACGCCGCATCCAGCGCTGACGGAAGCAAAACACCAGCAGCAGTTTTTCCAGTTCCGTTTATCCGGGCAAACCATCGAAGTGACCAGCGAATACCTGTTCCGTCATAGCGATAACGAGCTCCTGCACTGGATGGTGGCGCTGGATGGTAAGCCGCTGGCAAGCGGTGAAGTGCCTCTGGATGTCGCTCCACAAGGTAAACAGTTGATTGAACTGCCTGAACTACCGCAGCCGGAGAGCGCCGGGCAACTCTGGCTCACAGTACGCGTAGTGCAACCGAACGCGACCGCATGGTCAGAAGCCGGGCACATCAGCGCCTGGCAGCAGTGGCGTCTGGCGGAAAACCTCAGTGTGACGCTCCCCGCCGCGTCCCACGCCATCCCGCATCTGACCACCAGCGAAATGGATTTTTGCATCGAGCTGGGTAATAAGCGTTGGCAATTTAACCGCCAGTCAGGCTTTCTTTCACAGATGTGGATTGGCGATAAAAAACAACTGCTGACGCCGCTGCGCGATCAGTTCACCCGTGCACCGCTGGATAACGACATTGGCGTAAGTGAAGCGACCCGCATTGACCCTAACGCCTGGGTCGAACGCTGGAAGGCGGCGGGCCATTACCAGGCCGAAGCAGCGTTGTTGCAGTGCACGGCAGATACACTTGCTGATGCGGTGCTGATTACGACCGCTCACGCGTGGCAGCATCAGGGGAAAACCTTATTTATCAGCCGGAAAACCTACCGGATTGATGGTAGTGGTCAAATGGCGATTACCGTTGATGTTGAAGTGGCGAGCGATACACCGCATCCGGCGCGGATTGGCCTGAACTGCCAGCTGGCGCAGGTAGCAGAGCGGGTAAACTGGCTCGGATTAGGGCCGCAAGAAAACTATCCCGACCGCCTTACTGCCGCCTGTTTTGACCGCTGGGATCTGCCATTGTCAGACATGTATACCCCGTACGTCTTCCCGAGCGAAAACGGTCTGCGCTGCGGGACGCGCGAATTGAATTATGGCCCACACCAGTGGCGCGGCGACTTCCAGTTCAACATCAGCCGCTACAGTCAACAGCAACTGATGGAAACCAGCCATCGCCATCTGCTGCACGCGGAAGAAGGCACATGGCTGAATATCGACGGTTTCCATATGGGGATTGGTGGCGACGACTCCTGGAGCCCGTCAGTATCGGCGGAATTCCAGCTGAGCGCCGGTCGCTACCATTACCAGTTGGTCTGGTGTCAAAAATAA
- Gene 4 is in lines seven and eight
sed -n '7,8 p' sequence_1 > gene_4.fasta
cat gene_4.fasta
## >lcl|J01636.1_cds_AAA24054.1_3 [gene=lacY] [protein_id=AAA24054.1] [location=4410..5663] [gbkey=CDS]
## ATGTACTATTTAAAAAACACAAACTTTTGGATGTTCGGTTTATTCTTTTTCTTTTACTTTTTTATCATGGGAGCCTACTTCCCGTTTTTCCCGATTTGGCTACATGACATCAACCATATCAGCAAAAGTGATACGGGTATTATTTTTGCCGCTATTTCTCTGTTCTCGCTATTATTCCAACCGCTGTTTGGTCTGCTTTCTGACAAACTCGGGCTGCGCAAATACCTGCTGTGGATTATTACCGGCATGTTAGTGATGTTTGCGCCGTTCTTTATTTTTATCTTCGGGCCACTGTTACAATACAACATTTTAGTAGGATCGATTGTTGGTGGTATTTATCTAGGCTTTTGTTTTAACGCCGGTGCGCCAGCAGTAGAGGCATTTATTGAGAAAGTCAGCCGTCGCAGTAATTTCGAATTTGGTCGCGCGCGGATGTTTGGCTGTGTTGGCTGGGCGCTGTGTGCCTCGATTGTCGGCATCATGTTCACCATCAATAATCAGTTTGTTTTCTGGCTGGGCTCTGGCTGTGCACTCATCCTCGCCGTTTTACTCTTTTTCGCCAAAACGGATGCGCCCTCTTCTGCCACGGTTGCCAATGCGGTAGGTGCCAACCATTCGGCATTTAGCCTTAAGCTGGCACTGGAACTGTTCAGACAGCCAAAACTGTGGTTTTTGTCACTGTATGTTATTGGCGTTTCCTGCACCTACGATGTTTTTGACCAACAGTTTGCTAATTTCTTTACTTCGTTCTTTGCTACCGGTGAACAGGGTACGCGGGTATTTGGCTACGTAACGACAATGGGCGAATTACTTAACGCCTCGATTATGTTCTTTGCGCCACTGATCATTAATCGCATCGGTGGGAAAAACGCCCTGCTGCTGGCTGGCACTATTATGTCTGTACGTATTATTGGCTCATCGTTCGCCACCTCAGCGCTGGAAGTGGTTATTCTGAAAACGCTGCATATGTTTGAAGTACCGTTCCTGCTGGTGGGCTGCTTTAAATATATTACCAGCCAGTTTGAAGTGCGTTTTTCAGCGACGATTTATCTGGTCTGTTTCTGCTTCTTTAAGCAACTGGCGATGATTTTTATGTCTGTACTGGCGGGCAATATGTATGAAAGCATCGGTTTCCAGGGCGCTTATCTGGTGCTGGGTCTGGTGGCGCTGGGCTTCACCTTAATTTCCGTGTTCACGCTTAGCGGCCCCGGCCCGCTTTCCCTGCTGCGTCGTCAGGTGAATGAAGTCGCTTAA
- Gene 5 is in lines nine and ten
sed -n '9,10 p' sequence_1 > gene_5.fasta
cat gene_5.fasta
## >lcl|J01636.1_cds_AAA24055.1_4 [gene=lacA] [protein_id=AAA24055.1] [location=5727..6338] [gbkey=CDS]
## TTGAACATGCCAATGACCGAAAGAATAAGAGCAGGCAAGCTATTTACCGATATGTGCGAAGGCTTACCGGAAAAAAGACTTCGTGGGAAAACGTTAATGTATGAGTTTAATCACTCGCATCCATCAGAAGTTGAAAAAAGAGAAAGCCTGATTAAAGAAATGTTTGCCACGGTAGGGGAAAACGCCTGGGTAGAACCGCCTGTCTATTTCTCTTACGGTTCCAACATCCATATAGGCCGCAATTTTTATGCAAATTTCAATTTAACCATTGTCGATGACTACACGGTAACAATCGGTGATAACGTACTGATTGCACCCAACGTTACTCTTTCCGTTACGGGACACCCTGTACACCATGAATTGAGAAAAAACGGCGAGATGTACTCTTTTCCGATAACGATTGGCAATAACGTCTGGATCGGAAGTCATGTGGTTATTAATCCAGGCGTCACCATCGGGGATAATTCTGTTATTGGCGCGGGTAGTATCGTCACAAAAGACATTCCACCAAACGTCGTGGCGGCTGGCGTTCCTTGTCGGGTTATTCGCGAAATAAACGACCGGGATAAGCACTATTATTTCAAAGATTATAAAGTTGAATCGTCAGTTTAA
- The homework said: Can you create code that will do the following for each extracted FASTA sequence?
- Count the number of A,G,C and T?
# Number of A
grep -v ">" gene_1.fasta | grep -o "A" | wc -l
grep -v ">" gene_2.fasta | grep -o "A" | wc -l
grep -v ">" gene_3.fasta | grep -o "A" | wc -l
grep -v ">" gene_4.fasta | grep -o "A" | wc -l
grep -v ">" gene_5.fasta | grep -o "A" | wc -l
## 1739
## 240
## 679
## 240
## 189
In this case, because the command grep -v ">" VARIABLE | grep -o "A" | wc -l
is the same, the only thing that changes (THE VARIABLE) is the gene name
7.5.1 How do we fill up variables?
What is the variable? A list of the files of interest. What do all these files have in common?
- They start with the word
gene
- They end with the word
.fasta
So, what this means, is that we can create a list of these elements that have things in common!
ls
## 01-intro.Rmd
## 02-cross-refs.Rmd
## 03-parts.Rmd
## 04-citations.Rmd
## 05-blocks.Rmd
## 06-share.Rmd
## 07-references.Rmd
## 08-phylogenetics.Rmd
## 09-script.Rmd
## Lab0
## Lab1
## Lab2
## Lab4
## Lab_3
## Lab_5
## Lab_notebook.Rproj
## README.md
## Rplots.pdf
## _book
## _bookdown.yml
## _main.Rmd
## _main.pdf
## _main.tex
## _output.yml
## badada
## book.bib
## docs
## font-awesome.min.css
## gene_1.fasta
## gene_2.fasta
## gene_3.fasta
## gene_4.fasta
## gene_5.fasta
## index.Rmd
## preamble.tex
## sequence_1
## style.css
A wildcard (*
: means ANYTHING) allows you to select something that has a matching pattern with a variable:
- They start with the word
gene
:gene*
ls gene*
## gene_1.fasta
## gene_2.fasta
## gene_3.fasta
## gene_4.fasta
## gene_5.fasta
- They end with the word
.fasta
”*.fasta
ls *.fasta
## gene_1.fasta
## gene_2.fasta
## gene_3.fasta
## gene_4.fasta
## gene_5.fasta
You can put them on a list with any name. In this case, the variable name will be seqfile
:
seqfile=$(ls *.fasta)
echo $seqfile
## gene_1.fasta gene_2.fasta gene_3.fasta gene_4.fasta gene_5.fasta
Now, you want to do a simple for loop
:
FOR EACH ELEMENT IN VARIABLE_LIST;
DO COMMAND;
END
You command is to say “The gene file is: $VARIABLE
”. The $
means EACH element of the variable list called $seqfile
for seqfile in *.fasta;
do
echo "The gene file is: $seqfile"
done
## The gene file is: gene_1.fasta
## The gene file is: gene_2.fasta
## The gene file is: gene_3.fasta
## The gene file is: gene_4.fasta
## The gene file is: gene_5.fasta
Now that we know how a loop works, we can use real commands:
- Lets count the number of A’s
for seqfile in *.fasta;
do
seqname=$(grep ">" $seqfile)
echo "Sequence in progress: $seqname"
numA=$(grep -v ">" $seqfile | grep -o "A" | wc -l)
echo "The number of A's in: $numA"
numC=$(grep -v ">" $seqfile | grep -o "C" | wc -l)
echo "The number of C's is: $numC"
numT=$(grep -v ">" $seqfile | grep -o "T" | wc -l)
echo "The number of T's is: $numT"
numG=$(grep -v ">" $seqfile | grep -o "G" | wc -l)
echo "The number of G's is: $numG"
echo "End of sequence $seqname"
echo
done
## Sequence in progress: >J01636.1 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
## The number of A's in: 1739
## The number of C's is: 1991
## The number of T's is: 1743
## The number of G's is: 2004
## End of sequence >J01636.1 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
##
## Sequence in progress: >lcl|J01636.1_cds_AAA24052.1_1 [gene=lacI] [protein_id=AAA24052.1] [location=79..1161] [gbkey=CDS]
## The number of A's in: 240
## The number of C's is: 300
## The number of T's is: 232
## The number of G's is: 311
## End of sequence >lcl|J01636.1_cds_AAA24052.1_1 [gene=lacI] [protein_id=AAA24052.1] [location=79..1161] [gbkey=CDS]
##
## Sequence in progress: >lcl|J01636.1_cds_AAA24053.1_2 [gene=lacZ] [protein_id=AAA24053.1] [location=1284..4358] [gbkey=CDS]
## The number of A's in: 679
## The number of C's is: 841
## The number of T's is: 668
## The number of G's is: 887
## End of sequence >lcl|J01636.1_cds_AAA24053.1_2 [gene=lacZ] [protein_id=AAA24053.1] [location=1284..4358] [gbkey=CDS]
##
## Sequence in progress: >lcl|J01636.1_cds_AAA24054.1_3 [gene=lacY] [protein_id=AAA24054.1] [location=4410..5663] [gbkey=CDS]
## The number of A's in: 240
## The number of C's is: 284
## The number of T's is: 433
## The number of G's is: 297
## End of sequence >lcl|J01636.1_cds_AAA24054.1_3 [gene=lacY] [protein_id=AAA24054.1] [location=4410..5663] [gbkey=CDS]
##
## Sequence in progress: >lcl|J01636.1_cds_AAA24055.1_4 [gene=lacA] [protein_id=AAA24055.1] [location=5727..6338] [gbkey=CDS]
## The number of A's in: 189
## The number of C's is: 125
## The number of T's is: 161
## The number of G's is: 137
## End of sequence >lcl|J01636.1_cds_AAA24055.1_4 [gene=lacA] [protein_id=AAA24055.1] [location=5727..6338] [gbkey=CDS]