Chapter 7 Basic bioinformatics

The next chapters focus on the analysis of molecular sequence data in using command line. Nothing too fancy, just some basic commands to look at the data files and summarize information from the sequences.

7.1 Section 1: Basic commands for sequence data

7.1.1 Downloading the datasets

We are going to use two main datasets:

https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_3/AAB21188.seq
https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_3/S79522.seq
  1. Create a folder called Lab_3 in your home folder at the cluster
  2. Download the files. Answer the following question:

Question 1

  • What commands you used to download the files?
  1. Check the files. Answer the following questions:

Question 2

  • What kind of sequence (DNA/RNA/Protein) is the S79522.seq file? Add the command you used to answer the question.

Question 3

  • What kind of of sequence (DNA/RNA/Protein) is the AAB21188.seq file? Add the command you used to answer the question.

7.1.2 Counting basic patterns for DNA data

Ok, after checking the files we know now that the S79522.seq file contains a DNA sequence.

Lets do some command line exercises with it. Answer all of these questions:

Question 4

  • How many lines are found on this file?

Question 5

  • How many characters are found on this file? (Or, what is the length of this sequence?)

Question 6

  • How many A, C, G, and T can you find in this sequence?

Question 7

  • Convert this sequence to RNA and save it in a file called S79522_RNA.seq. Please add the command you used to create the file. (Don’t create a complementary strand, just convert this one from DNA to RNA)

Question 8

  • How can we test that this new file is an RNA sequence S79522_RNA.seq. Please add a command you’d use to check if this is DNA or RNA.

Question 9

  • Finally, using your RNA sequence, check how many start codons (AUG) or stop codons (UAA,UAG, or UGA) may exist in your sequence

7.1.3 Counting basic patterns for Aminoacid data

Ok, after checking the files we know now that the AAB21188.seq file contains a aminoacid sequence.

Question 10

  • How many characters are found on this file? (Or, what is the length of this sequence?)

Question 11

  • Select two aminoacids of interest and count them. Please add the command used to count them.

We have four candidate patterns for this gene:

  • VLKYYKVD (A common pattern found in the Ribosomal S27 subunit domains)
  • ITLEVE (A common pattern found in Ubiquitin domains)
  • KWCPRP (A common pattern found in In Between Ring fingers (IBR) domains)
  • LSVLSQKM (A common pattern found in DNA polymerase unit A domains)

Question 12

Which patterns do you find in the sequence? Fill in this table:

Sequence Ribosomal 27 Ubiquitin IBR DNA poly A
AAB21188 Yes/No Yes/No Yes/No Yes/No

7.2 Section 2: Creating pipelines

In this section we will learn something new and exciting:

We will make a simple pipeline to automate counting nucleotide sequences.

7.2.1 What is a pipeline?

Pipelines are sequential lines of code that can be used to use multiple commands at once.

For example, if you want to count all the U in a DNA sequence that you have transformed into DNA, you can actually do this in one simple line of code.

Think about this as a logical set of steps (or, a pseudocode):

  1. Read the DNA file
  2. Replace the T for U
  3. Extract the matches to U
  4. Count those matches

Using the command line the intructions would be like:

cat file.txt
sed 's/T/U/g' file.txt > file_rna.txt
grep -o U file_rna.txt > U_ocurrences.txt
wc -l U_ocurrences.txt

This code works, but its quite useless as it forces us to create two independent files: file_rna.txt and U_ocurrences.txt.

Imagine that the output of these files were large and you had no space left in your hard drive! You would have to run everything again from the beginning!

One of the options is to use pipelines, using the pipe (|) modifier. This modifier works as an an then.

So for example, a command that is like this: ls -lrt | grep seq | wc -l can be read as:

Make a list of all the files and then extract all the matches to seq and then count the number of occurences.

So, we can totally use this with out data.

7.2.1.1 Pipeline example using DNA data

Before we start, let me introduce you to two elements:

  1. Variables
  2. the echo command.
7.2.1.1.1 Variables

Variables can be defined as a place to temporarly store a piece of information.

For example, you wallet is a variable that is full of your money.

So, if you have $10 in your wallet, then your variable is:

wallet=10

So, that is how you define a variable in bash. To see what the value of that variable is, you use the echo command

7.2.1.1.2 The echo command

The echo command will print to your screen whatever you tell it to print.

For example, if you tell the command to echo 'hello', this will happen:

echo 'hello'
## hello

Meaning that you can tell it to say whatever you want to say. If you want to add spaces or other weird characters, make sure you surrond the statement with single quotes

echo 'This is the best course ever'
## This is the best course ever
7.2.1.1.3 Mixing variables and echo

So, we want to mix these two because then you can store a variable and print it in the screen, for example:

statement='This is the best course ever'
echo $statement
## This is the best course ever

Look at this! You can simplify the instructions and print variables!

Now, did you notice that the variable statement had a $ in front? That is how you tell bash that you are using a variable!

If you don’t add it this is what happens:

echo statement
## statement

So, you are just printing the word statement. Interesting ah?

7.2.1.2 Using a pipeline as an example of variables and pipes

Now, lets say we want to count all the nucelotides in our DNA file independently.

So, we want to extract each nucleotides, starting from the A, using grep -c A, and then counting lines with wc -l:

grep -o A Lab_3/S79522.seq | wc -l
## grep: Lab_3/S79522.seq: No such file or directory
##        0

Question 12

  • Did these results match what you did at the beginning of the lab? Expand briefly your answer.

Ok, now, to save this on a variable (you can call the variable anything as long as it doesn’t have any spaces or weird characters) we just put the entire pipeline inside a $(), like this:

nucA=$(grep -o A Lab_3/S79522.seq | wc -l)
echo $nucA
## grep: Lab_3/S79522.seq: No such file or directory
## 0

WOW! that was easy!

Now, if you want to add more variables you can do it like this:

nucA=$(grep -o A Lab_3/S79522.seq | wc -l)
nucC=$(grep -o C Lab_3/S79522.seq | wc -l)

echo 'Number of A:'$nucA
echo 'Number of C:'$nucC
## grep: Lab_3/S79522.seq: No such file or directory
## grep: Lab_3/S79522.seq: No such file or directory
## Number of A: 0
## Number of C: 0

Right here, stop and make the same sound a rooster makes. Then answer the following questions:

Question 13

  • How would you count the other nucleotides?. Add the code below:

Question 14

  • Write a set of instructions to count ten of the aminoacids in the AAB21188.seq file.

7.3 Section 3: Processing local files into the cluster

So, we know how to download files from different sources into the cluster. Today, we are going to learn how to move files from and to the cluster.

  1. Download the sequence_1 and sequence_2 files from the moodle.

  2. Create a new folder called mystery_genes in your Lab_3 folder

  3. Upload the files into the folder

  4. Check if the files are there.

Question 15

  • How did you check if the files were correctly uploaded?

7.3.1 Checking file types

If you remember, we saw some file types for sequence data in the theory class. These are FASTA and GenBank.

We are going to deal with these files today, and see how we can extract important information from these files.

Question 16

  • Briefly summarize what is a FASTA and what is a GenBank file (one sentence each)

Question 17

  • Can these formats include RNA, DNA and Aminoacid data? (Yes/No answer)

Now, use your knowledge of the command line to answer these questions:

Question 18

  • What format is the sequence_1 file in? (FASTA or GenBank) Briefly explain your answer.
  • What format is the sequence_2 file in? (FASTA or GenBank) Briefly explain your answer.

If you answered correclty, you have seen that sequence_1 is in FASTA format, while sequence_2 is in GenBank format.

7.3.2 FASTA: Data handling

Question 19

  • How many lines does the sequence_1 file has?
  • How can you count the number of SEQUENCES that exist in the sequence_1 file?
  • What is the number of sequences in the sequence_1 file? Is this result different than the number of lines in the file? Why?

Now, we know that the number of sequences (five) can be recognized by counting the number of > symbols in the sequence, right?

This means that for this particular FASTA file, you can extract each sequence individually if you know the lines.

For example: - lines 1 and 2 have the first sequence - lines 2 and 3 the second

and so on.

That means we can extract each sequence. You can do this using the sed command in the following way

sed -n 'staring_line,ending_line p' file_with_sequences

So, for example, to extract the first sequence in FASTA format:

sed -n '1,2 p' sequence_1

Now, because this is extracting in a fasta file, you will get the header and the sequence data. To only see the sequence data, use grep to extract all lines that DON’T have a > symbol. You can pipe the results of sed into grep!

sed -n '1,2 p' sequence_1 | grep -v ">" 

in this case, we had to put the > symbol between quotes because without them then the command line creates a new file.

Finally, we can even count the length of the sequence:

sed -n '1,2 p' sequence_1 | grep -v ">" | wc -c 

This first sequence is 7478 nucleotides long! Ain’t this cool?

Question 21

  • How can you save the new sequence in a new FASTA file?
  • Extract all other sequences from the file and save them in 5 different FASTA files. Add the code you used below

STOP HERE!

We will learn about how to write a script to answer the take home HW. On the meantime, finish the Section 4 questions

Take Home HW

Can you create code that will do the following for each extracted FASTA sequence?

  1. Count the number of A,G,C and T?
  2. Count the length of each sequence?
  3. Count the number of open reading frames (ORF)?

Present the code and the results


7.4 Section 4: GenBank Files

GenBank files are more complex than FASTA files.

  1. These files are supremely complex to analyse only using the command line. However, we can take a look a this data and answer a simple question using the command line:

Question 21

  • How many lines does the sequence_2 file has?
  1. Open the file using less -NS and answer the following questions:

Question 22

  • If you were looking for the information on the gene that codes for lacZ, what line would you use?
  • What species is this gene taken from?
  • What genomic region does this represent?
  • How many other genes can you find?

Question 23

Genic regions on GenBank files are represented by three different types: mRNA, gene, CDS.

  • What are the differences between these mRNA, gene, and CDS based in your GenBank and the NCBI help?

7.5 For loops and Solution to the HW

  1. Extracting all the sequences from sequence_1 into different files (As asked in Question 21)

We have noticed that - Gene 1 is in lines one and two

sed -n '1,2 p' sequence_1 > gene_1.fasta
cat gene_1.fasta
## >J01636.1 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
## GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGGTGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTCCCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAGCTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCACCTCCAGTCTGGCCCTGCACGCGCCGTCGCAAATTGTCGCGGCGATTAAATCTCGCGCCGATCAACTGGGTGCCAGCGTGGTGGTGTCGATGGTAGAACGAAGCGGCGTCGAAGCCTGTAAAGCGGCGGTGCACAATCTTCTCGCGCAACGCGTCAGTGGGCTGATCATTAACTATCCGCTGGATGACCAGGATGCCATTGCTGTGGAAGCTGCCTGCACTAATGTTCCGGCGTTATTTCTTGATGTCTCTGACCAGACACCCATCAACAGTATTATTTTCTCCCATGAAGACGGTACGCGACTGGGCGTGGAGCATCTGGTCGCATTGGGTCACCAGCAAATCGCGCTGTTAGCGGGCCCATTAAGTTCTGTCTCGGCGCGTCTGCGTCTGGCTGGCTGGCATAAATATCTCACTCGCAATCAAATTCAGCCGATAGCGGAACGGGAAGGCGACTGGAGTGCCATGTCCGGTTTTCAACAAACCATGCAAATGCTGAATGAGGGCATCGTTCCCACTGCGATGCTGGTTGCCAACGATCAGATGGCGCTGGGCGCAATGCGCGCCATTACCGAGTCCGGGCTGCGCGTTGGTGCGGATATCTCGGTAGTGGGATACGACGATACCGAAGACAGCTCATGTTATATCCCGCCGTCAACCACCATCAAACAGGATTTTCGCCTGCTGGGGCAAACCAGCGTGGACCGCTTGCTGCAACTCTCTCAGGGCCAGGCGGTGAAGGGCAATCAGCTGTTGCCCGTCTCACTGGTGAAAAGAAAAACCACCCTGGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCATGATTACGGATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCTTTGCCTGGTTTCCGGCACCAGAAGCGGTGCCGGAAAGCTGGCTGGAGTGCGATCTTCCTGAGGCCGATACTGTCGTCGTCCCCTCAAACTGGCAGATGCACGGTTACGATGCGCCCATCTACACCAACGTAACCTATCCCATTACGGTCAATCCGCCGTTTGTTCCCACGGAGAATCCGACGGGTTGTTACTCGCTCACATTTAATGTTGATGAAAGCTGGCTACAGGAAGGCCAGACGCGAATTATTTTTGATGGCGTTAACTCGGCGTTTCATCTGTGGTGCAACGGGCGCTGGGTCGGTTACGGCCAGGACAGTCGTTTGCCGTCTGAATTTGACCTGAGCGCATTTTTACGCGCCGGAGAAAACCGCCTCGCGGTGATGGTGCTGCGTTGGAGTGACGGCAGTTATCTGGAAGATCAGGATATGTGGCGGATGAGCGGCATTTTCCGTGACGTCTCGTTGCTGCATAAACCGACTACACAAATCAGCGATTTCCATGTTGCCACTCGCTTTAATGATGATTTCAGCCGCGCTGTACTGGAGGCTGAAGTTCAGATGTGCGGCGAGTTGCGTGACTACCTACGGGTAACAGTTTCTTTATGGCAGGGTGAAACGCAGGTCGCCAGCGGCACCGCGCCTTTCGGCGGTGAAATTATCGATGAGCGTGGTGGTTATGCCGATCGCGTCACACTACGTCTGAACGTCGAAAACCCGAAACTGTGGAGCGCCGAAATCCCGAATCTCTATCGTGCGGTGGTTGAACTGCACACCGCCGACGGCACGCTGATTGAAGCAGAAGCCTGCGATGTCGGTTTCCGCGAGGTGCGGATTGAAAATGGTCTGCTGCTGCTGAACGGCAAGCCGTTGCTGATTCGAGGCGTTAACCGTCACGAGCATCATCCTCTGCATGGTCAGGTCATGGATGAGCAGACGATGGTGCAGGATATCCTGCTGATGAAGCAGAACAACTTTAACGCCGTGCGCTGTTCGCATTATCCGAACCATCCGCTGTGGTACACGCTGTGCGACCGCTACGGCCTGTATGTGGTGGATGAAGCCAATATTGAAACCCACGGCATGGTGCCAATGAATCGTCTGACCGATGATCCGCGCTGGCTACCGGCGATGAGCGAACGCGTAACGCGAATGGTGCAGCGCGATCGTAATCACCCGAGTGTGATCATCTGGTCGCTGGGGAATGAATCAGGCCACGGCGCTAATCACGACGCGCTGTATCGCTGGATCAAATCTGTCGATCCTTCCCGCCCGGTGCAGTATGAAGGCGGCGGAGCCGACACCACGGCCACCGATATTATTTGCCCGATGTACGCGCGCGTGGATGAAGACCAGCCCTTCCCGGCTGTGCCGAAATGGTCCATCAAAAAATGGCTTTCGCTACCTGGAGAGACGCGCCCGCTGATCCTTTGCGAATACGCCCACGCGATGGGTAACAGTCTTGGCGGTTTCGCTAAATACTGGCAGGCGTTTCGTCAGTATCCCCGTTTACAGGGCGGCTTCGTCTGGGACTGGGTGGATCAGTCGCTGATTAAATATGATGAAAACGGCAACCCGTGGTCGGCTTACGGCGGTGATTTTGGCGATACGCCGAACGATCGCCAGTTCTGTATGAACGGTCTGGTCTTTGCCGACCGCACGCCGCATCCAGCGCTGACGGAAGCAAAACACCAGCAGCAGTTTTTCCAGTTCCGTTTATCCGGGCAAACCATCGAAGTGACCAGCGAATACCTGTTCCGTCATAGCGATAACGAGCTCCTGCACTGGATGGTGGCGCTGGATGGTAAGCCGCTGGCAAGCGGTGAAGTGCCTCTGGATGTCGCTCCACAAGGTAAACAGTTGATTGAACTGCCTGAACTACCGCAGCCGGAGAGCGCCGGGCAACTCTGGCTCACAGTACGCGTAGTGCAACCGAACGCGACCGCATGGTCAGAAGCCGGGCACATCAGCGCCTGGCAGCAGTGGCGTCTGGCGGAAAACCTCAGTGTGACGCTCCCCGCCGCGTCCCACGCCATCCCGCATCTGACCACCAGCGAAATGGATTTTTGCATCGAGCTGGGTAATAAGCGTTGGCAATTTAACCGCCAGTCAGGCTTTCTTTCACAGATGTGGATTGGCGATAAAAAACAACTGCTGACGCCGCTGCGCGATCAGTTCACCCGTGCACCGCTGGATAACGACATTGGCGTAAGTGAAGCGACCCGCATTGACCCTAACGCCTGGGTCGAACGCTGGAAGGCGGCGGGCCATTACCAGGCCGAAGCAGCGTTGTTGCAGTGCACGGCAGATACACTTGCTGATGCGGTGCTGATTACGACCGCTCACGCGTGGCAGCATCAGGGGAAAACCTTATTTATCAGCCGGAAAACCTACCGGATTGATGGTAGTGGTCAAATGGCGATTACCGTTGATGTTGAAGTGGCGAGCGATACACCGCATCCGGCGCGGATTGGCCTGAACTGCCAGCTGGCGCAGGTAGCAGAGCGGGTAAACTGGCTCGGATTAGGGCCGCAAGAAAACTATCCCGACCGCCTTACTGCCGCCTGTTTTGACCGCTGGGATCTGCCATTGTCAGACATGTATACCCCGTACGTCTTCCCGAGCGAAAACGGTCTGCGCTGCGGGACGCGCGAATTGAATTATGGCCCACACCAGTGGCGCGGCGACTTCCAGTTCAACATCAGCCGCTACAGTCAACAGCAACTGATGGAAACCAGCCATCGCCATCTGCTGCACGCGGAAGAAGGCACATGGCTGAATATCGACGGTTTCCATATGGGGATTGGTGGCGACGACTCCTGGAGCCCGTCAGTATCGGCGGAATTCCAGCTGAGCGCCGGTCGCTACCATTACCAGTTGGTCTGGTGTCAAAAATAATAATAACCGGGCAGGCCATGTCTGCCCGTATTTCGCGTAAGGAAATCCATTATGTACTATTTAAAAAACACAAACTTTTGGATGTTCGGTTTATTCTTTTTCTTTTACTTTTTTATCATGGGAGCCTACTTCCCGTTTTTCCCGATTTGGCTACATGACATCAACCATATCAGCAAAAGTGATACGGGTATTATTTTTGCCGCTATTTCTCTGTTCTCGCTATTATTCCAACCGCTGTTTGGTCTGCTTTCTGACAAACTCGGGCTGCGCAAATACCTGCTGTGGATTATTACCGGCATGTTAGTGATGTTTGCGCCGTTCTTTATTTTTATCTTCGGGCCACTGTTACAATACAACATTTTAGTAGGATCGATTGTTGGTGGTATTTATCTAGGCTTTTGTTTTAACGCCGGTGCGCCAGCAGTAGAGGCATTTATTGAGAAAGTCAGCCGTCGCAGTAATTTCGAATTTGGTCGCGCGCGGATGTTTGGCTGTGTTGGCTGGGCGCTGTGTGCCTCGATTGTCGGCATCATGTTCACCATCAATAATCAGTTTGTTTTCTGGCTGGGCTCTGGCTGTGCACTCATCCTCGCCGTTTTACTCTTTTTCGCCAAAACGGATGCGCCCTCTTCTGCCACGGTTGCCAATGCGGTAGGTGCCAACCATTCGGCATTTAGCCTTAAGCTGGCACTGGAACTGTTCAGACAGCCAAAACTGTGGTTTTTGTCACTGTATGTTATTGGCGTTTCCTGCACCTACGATGTTTTTGACCAACAGTTTGCTAATTTCTTTACTTCGTTCTTTGCTACCGGTGAACAGGGTACGCGGGTATTTGGCTACGTAACGACAATGGGCGAATTACTTAACGCCTCGATTATGTTCTTTGCGCCACTGATCATTAATCGCATCGGTGGGAAAAACGCCCTGCTGCTGGCTGGCACTATTATGTCTGTACGTATTATTGGCTCATCGTTCGCCACCTCAGCGCTGGAAGTGGTTATTCTGAAAACGCTGCATATGTTTGAAGTACCGTTCCTGCTGGTGGGCTGCTTTAAATATATTACCAGCCAGTTTGAAGTGCGTTTTTCAGCGACGATTTATCTGGTCTGTTTCTGCTTCTTTAAGCAACTGGCGATGATTTTTATGTCTGTACTGGCGGGCAATATGTATGAAAGCATCGGTTTCCAGGGCGCTTATCTGGTGCTGGGTCTGGTGGCGCTGGGCTTCACCTTAATTTCCGTGTTCACGCTTAGCGGCCCCGGCCCGCTTTCCCTGCTGCGTCGTCAGGTGAATGAAGTCGCTTAAGCAATCAATGTCGGATGCGGCGCGACGCTTATCCGACCAACATATCATAACGGAGTGATCGCATTGAACATGCCAATGACCGAAAGAATAAGAGCAGGCAAGCTATTTACCGATATGTGCGAAGGCTTACCGGAAAAAAGACTTCGTGGGAAAACGTTAATGTATGAGTTTAATCACTCGCATCCATCAGAAGTTGAAAAAAGAGAAAGCCTGATTAAAGAAATGTTTGCCACGGTAGGGGAAAACGCCTGGGTAGAACCGCCTGTCTATTTCTCTTACGGTTCCAACATCCATATAGGCCGCAATTTTTATGCAAATTTCAATTTAACCATTGTCGATGACTACACGGTAACAATCGGTGATAACGTACTGATTGCACCCAACGTTACTCTTTCCGTTACGGGACACCCTGTACACCATGAATTGAGAAAAAACGGCGAGATGTACTCTTTTCCGATAACGATTGGCAATAACGTCTGGATCGGAAGTCATGTGGTTATTAATCCAGGCGTCACCATCGGGGATAATTCTGTTATTGGCGCGGGTAGTATCGTCACAAAAGACATTCCACCAAACGTCGTGGCGGCTGGCGTTCCTTGTCGGGTTATTCGCGAAATAAACGACCGGGATAAGCACTATTATTTCAAAGATTATAAAGTTGAATCGTCAGTTTAAATTATAAAAATTGCCTGATACGCTGCGCTTATCAGGCCTACAAGTTCAGCGATCTACATTAGCCGCATCCGGCATGAACAAAGCGCAGGAACAAGCGTCGCATCATGCCTCTTTGACCCACAGCTGCGGAAAACGTACTGGTGCAAAACGCAGGGTTATGATCATCAGCCCAACGACGCACAGCGCATGAAATGCCCAGTCCATCAGGTAATTGCCGCTGATACTACGCAGCACGCCAGAAAACCACGGGGCAAGCCCGGCGATGATAAAACCGATTCCCTGCATAAACGCCACCAGCTTGCCAGCAATAGCCGGTTGCACAGAGTGATCGAGCGCCAGCAGCAAACAGAGCGGAAACGCGCCGCCCAGACCTAACCCACACACCATCGCCCACAATACCGGCAATTGCATCGGCAGCCAGATAAAGCCGCAGAACCCCACCAGTTGTAACACCAGCGCCAGCATTAACAGTTTGCGCCGATCCTGATGGCGAGCCATAGCAGGCATCAGCAAAGCTCCTGCGGCTTGCCCAAGCGTCATCAATGCCAGTAAGGAACCGCTGTACTGCGCGCTGGCACCAATCTCAATATAGAAAGCGGGTAACCAGGCAATCAGGCTGGCGTAACCGCCGTTAATCAGACCGAAGTAAACACCCAGCGTCCACGCGCGGGGAGTGAATACCACGCGAACCGGAGTGGTTGTTGTCTTGTGGGAAGAGGCGACCTCGCGGGCGCTTTGCCACCACCAGGCAAAGAGCGCAACAACGGCAGGCAGCGCCACCAGGCGAGTGTTTGATACCAGGTTTCGCTATGTTGAACTAACCAGGGCGTTATGGCGGCACCAAGCCCACCGCCGCCCATCAGAGCCGCGGACCACAGCCCCATCACCAGTGGCGTGCGCTGCTGAAACCGCCGTTTAATCACCGAAGCATCACCGCCTGAATGATGCCGATCCCCACCCCACCAAGCAGTGCGCTGCTAAGCAGCAGCGCACTTTGCGGGTAAAGCTCACGCATCAATGCACCGACGGCAATCAGCAACAGACTGATGGCGACACTGCGACGTTCGCTGACATGCTGATGAAGCCAGCTTCCGGCCAGCGCCAGCCCGCCCATGGTAACCACCGGCAGAGCGGTCGAC
  • Gene 2 is in lines three and four
sed -n '3,4 p' sequence_1 > gene_2.fasta
cat gene_2.fasta
## >lcl|J01636.1_cds_AAA24052.1_1 [gene=lacI] [protein_id=AAA24052.1] [location=79..1161] [gbkey=CDS]
## GTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTCCCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAGCTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCACCTCCAGTCTGGCCCTGCACGCGCCGTCGCAAATTGTCGCGGCGATTAAATCTCGCGCCGATCAACTGGGTGCCAGCGTGGTGGTGTCGATGGTAGAACGAAGCGGCGTCGAAGCCTGTAAAGCGGCGGTGCACAATCTTCTCGCGCAACGCGTCAGTGGGCTGATCATTAACTATCCGCTGGATGACCAGGATGCCATTGCTGTGGAAGCTGCCTGCACTAATGTTCCGGCGTTATTTCTTGATGTCTCTGACCAGACACCCATCAACAGTATTATTTTCTCCCATGAAGACGGTACGCGACTGGGCGTGGAGCATCTGGTCGCATTGGGTCACCAGCAAATCGCGCTGTTAGCGGGCCCATTAAGTTCTGTCTCGGCGCGTCTGCGTCTGGCTGGCTGGCATAAATATCTCACTCGCAATCAAATTCAGCCGATAGCGGAACGGGAAGGCGACTGGAGTGCCATGTCCGGTTTTCAACAAACCATGCAAATGCTGAATGAGGGCATCGTTCCCACTGCGATGCTGGTTGCCAACGATCAGATGGCGCTGGGCGCAATGCGCGCCATTACCGAGTCCGGGCTGCGCGTTGGTGCGGATATCTCGGTAGTGGGATACGACGATACCGAAGACAGCTCATGTTATATCCCGCCGTCAACCACCATCAAACAGGATTTTCGCCTGCTGGGGCAAACCAGCGTGGACCGCTTGCTGCAACTCTCTCAGGGCCAGGCGGTGAAGGGCAATCAGCTGTTGCCCGTCTCACTGGTGAAAAGAAAAACCACCCTGGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGA
  • Gene 3 is in lines five and six
sed -n '5,6 p' sequence_1 > gene_3.fasta
cat gene_3.fasta
## >lcl|J01636.1_cds_AAA24053.1_2 [gene=lacZ] [protein_id=AAA24053.1] [location=1284..4358] [gbkey=CDS]
## ATGACCATGATTACGGATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCTTTGCCTGGTTTCCGGCACCAGAAGCGGTGCCGGAAAGCTGGCTGGAGTGCGATCTTCCTGAGGCCGATACTGTCGTCGTCCCCTCAAACTGGCAGATGCACGGTTACGATGCGCCCATCTACACCAACGTAACCTATCCCATTACGGTCAATCCGCCGTTTGTTCCCACGGAGAATCCGACGGGTTGTTACTCGCTCACATTTAATGTTGATGAAAGCTGGCTACAGGAAGGCCAGACGCGAATTATTTTTGATGGCGTTAACTCGGCGTTTCATCTGTGGTGCAACGGGCGCTGGGTCGGTTACGGCCAGGACAGTCGTTTGCCGTCTGAATTTGACCTGAGCGCATTTTTACGCGCCGGAGAAAACCGCCTCGCGGTGATGGTGCTGCGTTGGAGTGACGGCAGTTATCTGGAAGATCAGGATATGTGGCGGATGAGCGGCATTTTCCGTGACGTCTCGTTGCTGCATAAACCGACTACACAAATCAGCGATTTCCATGTTGCCACTCGCTTTAATGATGATTTCAGCCGCGCTGTACTGGAGGCTGAAGTTCAGATGTGCGGCGAGTTGCGTGACTACCTACGGGTAACAGTTTCTTTATGGCAGGGTGAAACGCAGGTCGCCAGCGGCACCGCGCCTTTCGGCGGTGAAATTATCGATGAGCGTGGTGGTTATGCCGATCGCGTCACACTACGTCTGAACGTCGAAAACCCGAAACTGTGGAGCGCCGAAATCCCGAATCTCTATCGTGCGGTGGTTGAACTGCACACCGCCGACGGCACGCTGATTGAAGCAGAAGCCTGCGATGTCGGTTTCCGCGAGGTGCGGATTGAAAATGGTCTGCTGCTGCTGAACGGCAAGCCGTTGCTGATTCGAGGCGTTAACCGTCACGAGCATCATCCTCTGCATGGTCAGGTCATGGATGAGCAGACGATGGTGCAGGATATCCTGCTGATGAAGCAGAACAACTTTAACGCCGTGCGCTGTTCGCATTATCCGAACCATCCGCTGTGGTACACGCTGTGCGACCGCTACGGCCTGTATGTGGTGGATGAAGCCAATATTGAAACCCACGGCATGGTGCCAATGAATCGTCTGACCGATGATCCGCGCTGGCTACCGGCGATGAGCGAACGCGTAACGCGAATGGTGCAGCGCGATCGTAATCACCCGAGTGTGATCATCTGGTCGCTGGGGAATGAATCAGGCCACGGCGCTAATCACGACGCGCTGTATCGCTGGATCAAATCTGTCGATCCTTCCCGCCCGGTGCAGTATGAAGGCGGCGGAGCCGACACCACGGCCACCGATATTATTTGCCCGATGTACGCGCGCGTGGATGAAGACCAGCCCTTCCCGGCTGTGCCGAAATGGTCCATCAAAAAATGGCTTTCGCTACCTGGAGAGACGCGCCCGCTGATCCTTTGCGAATACGCCCACGCGATGGGTAACAGTCTTGGCGGTTTCGCTAAATACTGGCAGGCGTTTCGTCAGTATCCCCGTTTACAGGGCGGCTTCGTCTGGGACTGGGTGGATCAGTCGCTGATTAAATATGATGAAAACGGCAACCCGTGGTCGGCTTACGGCGGTGATTTTGGCGATACGCCGAACGATCGCCAGTTCTGTATGAACGGTCTGGTCTTTGCCGACCGCACGCCGCATCCAGCGCTGACGGAAGCAAAACACCAGCAGCAGTTTTTCCAGTTCCGTTTATCCGGGCAAACCATCGAAGTGACCAGCGAATACCTGTTCCGTCATAGCGATAACGAGCTCCTGCACTGGATGGTGGCGCTGGATGGTAAGCCGCTGGCAAGCGGTGAAGTGCCTCTGGATGTCGCTCCACAAGGTAAACAGTTGATTGAACTGCCTGAACTACCGCAGCCGGAGAGCGCCGGGCAACTCTGGCTCACAGTACGCGTAGTGCAACCGAACGCGACCGCATGGTCAGAAGCCGGGCACATCAGCGCCTGGCAGCAGTGGCGTCTGGCGGAAAACCTCAGTGTGACGCTCCCCGCCGCGTCCCACGCCATCCCGCATCTGACCACCAGCGAAATGGATTTTTGCATCGAGCTGGGTAATAAGCGTTGGCAATTTAACCGCCAGTCAGGCTTTCTTTCACAGATGTGGATTGGCGATAAAAAACAACTGCTGACGCCGCTGCGCGATCAGTTCACCCGTGCACCGCTGGATAACGACATTGGCGTAAGTGAAGCGACCCGCATTGACCCTAACGCCTGGGTCGAACGCTGGAAGGCGGCGGGCCATTACCAGGCCGAAGCAGCGTTGTTGCAGTGCACGGCAGATACACTTGCTGATGCGGTGCTGATTACGACCGCTCACGCGTGGCAGCATCAGGGGAAAACCTTATTTATCAGCCGGAAAACCTACCGGATTGATGGTAGTGGTCAAATGGCGATTACCGTTGATGTTGAAGTGGCGAGCGATACACCGCATCCGGCGCGGATTGGCCTGAACTGCCAGCTGGCGCAGGTAGCAGAGCGGGTAAACTGGCTCGGATTAGGGCCGCAAGAAAACTATCCCGACCGCCTTACTGCCGCCTGTTTTGACCGCTGGGATCTGCCATTGTCAGACATGTATACCCCGTACGTCTTCCCGAGCGAAAACGGTCTGCGCTGCGGGACGCGCGAATTGAATTATGGCCCACACCAGTGGCGCGGCGACTTCCAGTTCAACATCAGCCGCTACAGTCAACAGCAACTGATGGAAACCAGCCATCGCCATCTGCTGCACGCGGAAGAAGGCACATGGCTGAATATCGACGGTTTCCATATGGGGATTGGTGGCGACGACTCCTGGAGCCCGTCAGTATCGGCGGAATTCCAGCTGAGCGCCGGTCGCTACCATTACCAGTTGGTCTGGTGTCAAAAATAA
  • Gene 4 is in lines seven and eight
sed -n '7,8 p' sequence_1 > gene_4.fasta
cat gene_4.fasta
## >lcl|J01636.1_cds_AAA24054.1_3 [gene=lacY] [protein_id=AAA24054.1] [location=4410..5663] [gbkey=CDS]
## ATGTACTATTTAAAAAACACAAACTTTTGGATGTTCGGTTTATTCTTTTTCTTTTACTTTTTTATCATGGGAGCCTACTTCCCGTTTTTCCCGATTTGGCTACATGACATCAACCATATCAGCAAAAGTGATACGGGTATTATTTTTGCCGCTATTTCTCTGTTCTCGCTATTATTCCAACCGCTGTTTGGTCTGCTTTCTGACAAACTCGGGCTGCGCAAATACCTGCTGTGGATTATTACCGGCATGTTAGTGATGTTTGCGCCGTTCTTTATTTTTATCTTCGGGCCACTGTTACAATACAACATTTTAGTAGGATCGATTGTTGGTGGTATTTATCTAGGCTTTTGTTTTAACGCCGGTGCGCCAGCAGTAGAGGCATTTATTGAGAAAGTCAGCCGTCGCAGTAATTTCGAATTTGGTCGCGCGCGGATGTTTGGCTGTGTTGGCTGGGCGCTGTGTGCCTCGATTGTCGGCATCATGTTCACCATCAATAATCAGTTTGTTTTCTGGCTGGGCTCTGGCTGTGCACTCATCCTCGCCGTTTTACTCTTTTTCGCCAAAACGGATGCGCCCTCTTCTGCCACGGTTGCCAATGCGGTAGGTGCCAACCATTCGGCATTTAGCCTTAAGCTGGCACTGGAACTGTTCAGACAGCCAAAACTGTGGTTTTTGTCACTGTATGTTATTGGCGTTTCCTGCACCTACGATGTTTTTGACCAACAGTTTGCTAATTTCTTTACTTCGTTCTTTGCTACCGGTGAACAGGGTACGCGGGTATTTGGCTACGTAACGACAATGGGCGAATTACTTAACGCCTCGATTATGTTCTTTGCGCCACTGATCATTAATCGCATCGGTGGGAAAAACGCCCTGCTGCTGGCTGGCACTATTATGTCTGTACGTATTATTGGCTCATCGTTCGCCACCTCAGCGCTGGAAGTGGTTATTCTGAAAACGCTGCATATGTTTGAAGTACCGTTCCTGCTGGTGGGCTGCTTTAAATATATTACCAGCCAGTTTGAAGTGCGTTTTTCAGCGACGATTTATCTGGTCTGTTTCTGCTTCTTTAAGCAACTGGCGATGATTTTTATGTCTGTACTGGCGGGCAATATGTATGAAAGCATCGGTTTCCAGGGCGCTTATCTGGTGCTGGGTCTGGTGGCGCTGGGCTTCACCTTAATTTCCGTGTTCACGCTTAGCGGCCCCGGCCCGCTTTCCCTGCTGCGTCGTCAGGTGAATGAAGTCGCTTAA
  • Gene 5 is in lines nine and ten
sed -n '9,10 p' sequence_1 > gene_5.fasta
cat gene_5.fasta
## >lcl|J01636.1_cds_AAA24055.1_4 [gene=lacA] [protein_id=AAA24055.1] [location=5727..6338] [gbkey=CDS]
## TTGAACATGCCAATGACCGAAAGAATAAGAGCAGGCAAGCTATTTACCGATATGTGCGAAGGCTTACCGGAAAAAAGACTTCGTGGGAAAACGTTAATGTATGAGTTTAATCACTCGCATCCATCAGAAGTTGAAAAAAGAGAAAGCCTGATTAAAGAAATGTTTGCCACGGTAGGGGAAAACGCCTGGGTAGAACCGCCTGTCTATTTCTCTTACGGTTCCAACATCCATATAGGCCGCAATTTTTATGCAAATTTCAATTTAACCATTGTCGATGACTACACGGTAACAATCGGTGATAACGTACTGATTGCACCCAACGTTACTCTTTCCGTTACGGGACACCCTGTACACCATGAATTGAGAAAAAACGGCGAGATGTACTCTTTTCCGATAACGATTGGCAATAACGTCTGGATCGGAAGTCATGTGGTTATTAATCCAGGCGTCACCATCGGGGATAATTCTGTTATTGGCGCGGGTAGTATCGTCACAAAAGACATTCCACCAAACGTCGTGGCGGCTGGCGTTCCTTGTCGGGTTATTCGCGAAATAAACGACCGGGATAAGCACTATTATTTCAAAGATTATAAAGTTGAATCGTCAGTTTAA
  1. The homework said: Can you create code that will do the following for each extracted FASTA sequence?
  • Count the number of A,G,C and T?
# Number of A
grep -v ">" gene_1.fasta | grep -o "A" | wc -l
grep -v ">" gene_2.fasta | grep -o "A" | wc -l
grep -v ">" gene_3.fasta | grep -o "A" | wc -l
grep -v ">" gene_4.fasta | grep -o "A" | wc -l
grep -v ">" gene_5.fasta | grep -o "A" | wc -l
##     1739
##      240
##      679
##      240
##      189

In this case, because the command grep -v ">" VARIABLE | grep -o "A" | wc -l is the same, the only thing that changes (THE VARIABLE) is the gene name

7.5.1 How do we fill up variables?

What is the variable? A list of the files of interest. What do all these files have in common?

  1. They start with the word gene
  2. They end with the word .fasta

So, what this means, is that we can create a list of these elements that have things in common!

ls
## 01-intro.Rmd
## 02-cross-refs.Rmd
## 03-parts.Rmd
## 04-citations.Rmd
## 05-blocks.Rmd
## 06-share.Rmd
## 07-references.Rmd
## 08-phylogenetics.Rmd
## 09-script.Rmd
## Lab0
## Lab1
## Lab2
## Lab4
## Lab_3
## Lab_5
## Lab_notebook.Rproj
## README.md
## Rplots.pdf
## _book
## _bookdown.yml
## _main.Rmd
## _main.pdf
## _main.tex
## _output.yml
## badada
## book.bib
## docs
## font-awesome.min.css
## gene_1.fasta
## gene_2.fasta
## gene_3.fasta
## gene_4.fasta
## gene_5.fasta
## index.Rmd
## preamble.tex
## sequence_1
## style.css

A wildcard (*: means ANYTHING) allows you to select something that has a matching pattern with a variable:

  1. They start with the word gene: gene*
ls gene*
## gene_1.fasta
## gene_2.fasta
## gene_3.fasta
## gene_4.fasta
## gene_5.fasta
  1. They end with the word .fasta*.fasta
ls *.fasta
## gene_1.fasta
## gene_2.fasta
## gene_3.fasta
## gene_4.fasta
## gene_5.fasta

You can put them on a list with any name. In this case, the variable name will be seqfile:

seqfile=$(ls *.fasta)
echo $seqfile
## gene_1.fasta gene_2.fasta gene_3.fasta gene_4.fasta gene_5.fasta

Now, you want to do a simple for loop:

FOR EACH ELEMENT IN VARIABLE_LIST;
  DO COMMAND;
  END

You command is to say “The gene file is: $VARIABLE”. The $ means EACH element of the variable list called $seqfile

for seqfile in *.fasta;
  do
  echo "The gene file is: $seqfile"
  done
## The gene file is: gene_1.fasta
## The gene file is: gene_2.fasta
## The gene file is: gene_3.fasta
## The gene file is: gene_4.fasta
## The gene file is: gene_5.fasta

Now that we know how a loop works, we can use real commands:

  • Lets count the number of A’s
for seqfile in *.fasta;
  do
  seqname=$(grep ">" $seqfile)
  echo "Sequence in progress: $seqname"
  numA=$(grep -v ">" $seqfile | grep -o "A" | wc -l)
  echo "The number of A's in: $numA"
  numC=$(grep -v ">" $seqfile | grep -o "C" | wc -l)
  echo "The number of C's is: $numC"
  numT=$(grep -v ">" $seqfile | grep -o "T" | wc -l)
  echo "The number of T's is: $numT"
  numG=$(grep -v ">" $seqfile | grep -o "G" | wc -l)
  echo "The number of G's is: $numG"
  echo "End of sequence $seqname"
  echo
  done
## Sequence in progress: >J01636.1 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
## The number of A's in:     1739
## The number of C's is:     1991
## The number of T's is:     1743
## The number of G's is:     2004
## End of sequence >J01636.1 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
## 
## Sequence in progress: >lcl|J01636.1_cds_AAA24052.1_1 [gene=lacI] [protein_id=AAA24052.1] [location=79..1161] [gbkey=CDS]
## The number of A's in:      240
## The number of C's is:      300
## The number of T's is:      232
## The number of G's is:      311
## End of sequence >lcl|J01636.1_cds_AAA24052.1_1 [gene=lacI] [protein_id=AAA24052.1] [location=79..1161] [gbkey=CDS]
## 
## Sequence in progress: >lcl|J01636.1_cds_AAA24053.1_2 [gene=lacZ] [protein_id=AAA24053.1] [location=1284..4358] [gbkey=CDS]
## The number of A's in:      679
## The number of C's is:      841
## The number of T's is:      668
## The number of G's is:      887
## End of sequence >lcl|J01636.1_cds_AAA24053.1_2 [gene=lacZ] [protein_id=AAA24053.1] [location=1284..4358] [gbkey=CDS]
## 
## Sequence in progress: >lcl|J01636.1_cds_AAA24054.1_3 [gene=lacY] [protein_id=AAA24054.1] [location=4410..5663] [gbkey=CDS]
## The number of A's in:      240
## The number of C's is:      284
## The number of T's is:      433
## The number of G's is:      297
## End of sequence >lcl|J01636.1_cds_AAA24054.1_3 [gene=lacY] [protein_id=AAA24054.1] [location=4410..5663] [gbkey=CDS]
## 
## Sequence in progress: >lcl|J01636.1_cds_AAA24055.1_4 [gene=lacA] [protein_id=AAA24055.1] [location=5727..6338] [gbkey=CDS]
## The number of A's in:      189
## The number of C's is:      125
## The number of T's is:      161
## The number of G's is:      137
## End of sequence >lcl|J01636.1_cds_AAA24055.1_4 [gene=lacA] [protein_id=AAA24055.1] [location=5727..6338] [gbkey=CDS]

7.5.2 Complete, for extra points if desired:

  • Count the length of each sequence?
  • Count the number of open reading frames (ORF)?