Chapter 8 Advanced Bioinformatics

Objectives:

Section 1 will focus on how to identify open reading frames and a gene based on searches on the ORF finder, translate the largest ORF and navigate through NCBI as we did on the theory class
Section 2 will focus on how to use the cluster to create a BLAST and HMMER database of genes of interest.

8.1 Section 1: Open reading frames, translating sequences and BLAST searches to identify the role and functionality of a mystery gene

Download the mystery_gene.fasta file and open it on R. This is a nucleotide strand from a segment of a chromosome of a cool species that we will try to identify.
Go to NCBI’s ORF Finder and upload your sequence by copying and pasting it. Keep the parameters by default and submit the form. Answer the following questions:

Questions 1 - 5

How many ORF do you find in the sequence?
Which ORF is the longest, how long it is (Both in NT and AA), and in which strand (Plus or minus) is located?
In what frame does the longest ORF is in? What does this mean?
Click on Display ORF and copy and paste the Protein sequence and the nucleotide sequence of the longest ORF here
Is there any ORF on the backwards direction? (-)

Now, take your original sequence and go to NCBI BLAST. Answer the following question:

Question 6

This is DNA data, so what kinds of BLAST’s can you use?

First, lets do a nucleotide BLAST. Click on nucleotide BLAST, and copy and paste your original sequence. Now, select the Nucleotide collection database and run a megablast. Answer the following questions:

Questions 7-8

What are the best 3 hits. Why do you consider them the best hits?
Include the accession numbers of the best three hits.

Go to the Graphic Summary of the BLAST results. Take a screenshot and add it below and answer the following questions

Question 9

Does this graphic summary support your results from question 7? Explain why briefly.

Go to the Taxonomy window of the BLAST results. Answer the following question:

Questions 10

What species has the highest score? What about the other species? Are they closely related to each other?

After doing the BLAST, click on the hit with the best score. You should find the accession number there (AY190508.1). Click on it and answer the following questions

Questions 11 - 12

What kind of is this? A FASTA or a GenBank
Copy and paste the protein region coded by the CDS

On the panel on the right, you’ll find a link to the More about the gene LOC106470356. This means we have a whole gene associated to our query gene!

Questions 13 - 15

What’s the name of the gene?
Check the organism’s name. Is this what you found at the beginning of the lab?
Go to the genomic regions? How many exons does the gene has? How did you come to this conclusion?

Go down to the NCBI Reference Sequences (RefSeq) - mRNA and Proteins section.
Click on the UniProtKB/Swiss-Prot link. Answer the following questions

Questions 16- 19

What database are we in?
Is the organism still the same?
What is the proposed function for this gene?
What are the molecular functions and biological processes that this gene is involved with?

Extract the FASTA protein sequence from the LOC106470356 gene (You can find it on the RefSeq submenu, on the mRNA and Protein segment, click on the NP_001301089.1 link). Then, go to Pfam scan and search the sequence you just copied on the Sequence Search window. In the Output format menu, select Text.

After the search is done, click on the link under Tool Output. You will see a new page with the output in simple text form.

Answer the following questions based on the simple text form table (I would copy and paste the table that starts with # <seq id> in Excel or Google Spreadsheets):

Questions 20 - 23

How many protein domains can we find in this sequence? You can find this info on the column
What’s the name of the domain according to the the column?
What’s the name of the PFAM accession according to the the column?
Is this a good E-value to determine a good hit?

Select the domain name with the best E-value. Go to InterPRO’s database and paste the domain name in the text box in front of the jump to link and press GO. Answer these questions:

Questions 24 - 25

What is the name of this PFAM domain? How many protein sequences are found that contain this domain?
How many architectures of this domain are found?

Click on the the Taxonomy link on the menu on the left. Answer these questions:

Question 26

What are the kingdom, phylum and class with more sequences that contain this domain?

Question 27

Based on all the information that we found today, write a summary of the gene and all the information we were able to obtain from the search. Include gene ID, accession numbers for mRNA and protein sequence. Two paragraphs are more than enough.

8.2 Section 2: Creating our own searches using BLAST and HMMER in the cluster.

8.2.1 Creating databases using BLAST

Today we are going to learn how to create our own databases in the cluster based on a list of sequences and search to identify query sequences from species of interest.

In this case, let’s do something similar to what most labs have been doing since the COVID pandemic: Lets try and identify several sequences of a region from the spike protein from samples across the world: One from Oregon, one from Massachussetts and one from the UK.

The idea is to figure out if each of the sequences belongs to COVID or any other related respiratory viruses such as H1N1, common cold (Rhinovirus) or actual SARS-CoVID-2.

I have added 11 sequences of idfferent samples of COVID and other related viruses into a file called coronavirus_references.fasta.

This is an evolutionary tree of the reference sequences as described by Sahu et al. (2021)

Phylogenetic tree

So, to do the analysis:

Create a new folder inside your home folder called Lab_4
Create a second folder inside of Lab_4 called blastDB and go into it.
Copy the sequences from /home/jtabima/MBB101/lab4/coronavirus_references.fasta into your Lab_4/blastDB folder
We are going to use the /home/jtabima/MBB101/lab4/unknown_locations.fasta file as query for the unknown files. Copy it into your folder.

Question 28 - 29

How many sequences are in the unknown_locations.fasta and the coronavirus_references.fasta files? Add the code you used to check the sequences.
Are these files nucleotide or protein sequences? Add the code you added to check the sequences.

Now, we are going to create the BLAST database. To create this we will use the following template of code:

makeblastdb -in list_of_reference_seq -dbtype type_of_molecule -out database_name

Question 30

What is a BLAST database? Why do we care about them?
Modify the template code by changing the list_of_reference_seq, type_of_molecule and database_name and add it below.

Run the code after modification and make sure it worked by doing an ls and checking that the database_name.nhn, database_name.nir, database_name.nsq files are there.
Now, we are going to run a BLAST search to identify the unknown sequences in the unknown_locations.fasta

The following is template code to run BLAST:

blast_program -query query_file -db database_name -out output_file -evalue 1e-10 -outfmt 6 -max_target_seqs 7

Questions 31 - 33

Modify the template code by changing the blast_program to the blast program we should be using (blastp, blastn, tblastn, tblastx). Explain your answer briefly.
Complete the template code by modifying the query_file and database_name and add it below.
Explain briefly what the -evalue 1e-10, -outfmt 6, -max_target_seqs 7 and flags mean.

Run the modified code. Check that you have some results by making sure that your output_file has content on it.

Question 34

-In the following table, add the top 3 hits for each of the query sequences.

| qseqid  | sseqid | pident | length | mismatch | gapopen | qstart | qend | sstart | send | evalue | bitscore |
|---------|--------|--------|--------|----------|---------|--------|------|--------|------|--------|----------|
| Query_1 |        |        |        |          |         |        |      |        |      |        |          |
| Query_1 |        |        |        |          |         |        |      |        |      |        |          |
| Query_1 |        |        |        |          |         |        |      |        |      |        |          |

| qseqid  | sseqid | pident | length | mismatch | gapopen | qstart | qend | sstart | send | evalue | bitscore |
|---------|--------|--------|--------|----------|---------|--------|------|--------|------|--------|----------|
| Query_2 |        |        |        |          |         |        |      |        |      |        |          |
| Query_2 |        |        |        |          |         |        |      |        |      |        |          |
| Query_2 |        |        |        |          |         |        |      |        |      |        |          |

Questions 35 - 36

Based on your results, what are the most likely identities of each query? Explain why.
Within the reference FASTA file you can extract the names of each of the best hits of the database. What are the species of each of the most likely sequences?

8.2.2 Using HMMER to search for proteins with domains of interest

We are going to learn also how to use the information from databases to identify genes of interest that can serve us.

In this case, you are a Veterinarian Bioinformatitian.

You have detected that a community of three animals (A chicken, Gallus gallus; a dog, Canis familiaris; and a cow, Bos taurus) that may have a disease associated with a mutation on the Sonic The Hedgehog gene.

You sequenced one copy of each the genes per species, and translated them intro proteins.

However, due to an error in the machine, your sequences have been mixed with another hundred protein sequences from other species and genes in a file called oh_this_is_a_mess.fasta

A BLAST would be too much to run as you have a very small machine in your office, but you remember (thanks to your bioinformatics course ;) ) that you can run HMMER, a faster way of identifying protein sequences using Hidden Markov Models (HMM).

You also remember that Pfam, in Interpro, is a great database of protein domains that also contains HMM for these proteins.

Using your bioinformatics software, you discover that there is a reference gene for Sonic The Hedgehog in humans, in NCBI, in the accession number NP_000184.1

Copy the FASTA file and search for the domains in [InterPro]. When the result appears, answer these questions:

Questions 35-38

To which InterPro family does this protein belong?
Look at the Domains part of the sequence. Which domains do you see? To which source database do they belong (SMART, PFAM)?
Which are the two domains from PFAM you can find in this protein?
Go to the GO terms part of the results. What do you think is the function of this protein based on the Biological Process information?

EXTRA CREDIT (2 points)

Go to the Predictions section of your results. Hover over each of the annotations and do a summary of the different predictions in this protein (i.e. Is this protein moved outside of the cytoplasm? Where does this protein reside in the cell?)

** If you draw and include the drawing of the protein you get 5 points total**

Under the InterProScan Search Result title, you will see a sub-menu that says entries. Click on it and in the 1-8 entries .. click and select PFAM. You should see two domains:

Question 39:

Are these two domains the same domains as in Q37?

Click on the PFAM profile, go to Curation and Download the raw HMM for each domain into a new hmm_example folder inside your Lab_4 folder.
InterPro now saves the files as compressed files (gzip files). We need to uncompressed them. To do this, you need to rename your files by adding a .gz, for example:

mv file.hmm file.hmm.gz

and then uncompress them by using gunzip:

gunzip file.hmm.gz

Upload the oh_this_is_a_mess.fasta file from moodle into the new folder
Merge both HMM files into a single file called SHH.hmm. Answer the following question:

Question 36

How would you merge the two files? Add the code below.

The following is a template to run HMMER searches:

hmmsearch --tblout output_table HMM_file query_sequences

Question 37

Modify the template code by changing the output_table, HMM_file and query_sequences and add it below.

Run the modified code.
Check the output_table file.

Question 38

Could you find the sequences of interest? Explain the results below:
How can you do a further check if your candidate sequences are indeed the Sonic The Hedgehog genes? (i.e. Using BLAST? Number of domains? Type of domains?)