Chapter 9 Homology and multiple sequence alignments
Homologous genes are genes that share common ancestry and can be functionally similar. These homologous genes can be identified between different samples, species, ect, based on similarity and other metrics such as inflation in MCL, ect.
In this section, we will learn how to obtain and analyze data sets of orthologous sets of genes using online repositories and observing its differences and similarities with multiple sequence alignments.
9.1 Downloading sequences from Homologene
There are many ways to obtain homologous genes. For example, running OrthoMCL/OrthoFinder
in a set of genomes will identify all the genes of interest between the set of genomes.
In other cases, we can obtain genes of interest from online databases, such as the OrthoMCL DB
or NCBI homologene
.
Today, we will use NCBI Homologene.
- Go to this link
Questions 1 - 4
- What number of homologene is this?
- What gene/genic family is represented in this set of sequences?
- Based on the schematics on the right, which domains of the protein are conserved across the different species?
- Fill the following table using all the sequences from the homologene info
Species | Common Name | Protein ID | Protein length | Number of domains found |
---|---|---|---|---|
Homo sapiens | Human | NP_000242.2 | Three | (MutS_I, MutS_II, ABC_ATPase) |
Go to the Download links and download the mRNA and Protein sequences in FASTA format
Connect to the cluster and upload the sequences into a new folder inside of
MBB101
calledLab_5
Answer the following questions:
Questions 5 - 7
- What is the number of sequences in each of the files uploaded?
- Did you expect this number to be similar or different across the files?
9.2 Multiple sequence alignment in MAFFT
Now, we will do some multiple sequence alignments using MAFFT.
If we remember from class, MAFFT is one of the most complete multiple sequence aligner programs that exist.
It allows us to do many different kinds of searches, such as local alignments, global alignments, and iterative and progressive alignments focused on aligning highly conserved domains
- Before you proceed with the alignment, go to the MAFFT algortihms page and answer the follwing questions based on the information on the algorithms MAFFT uses:
Questions 8 - 11
- What kinds of algorithms from MAFFT (E-INS-i, L-INS-i or G-INS-i) would you use for the following datasets, and briefly explain your answers:
- Proteins with multiple conserved domains
- Proteins with single conserved domains
- Proteins with a single short domain
- Based on your answer above, align the protein sequence you downloaded using the algorithm of your choice in the cluster.
To do this, in your Lab_5
do the following:
- Use the following code for your alignment:
- For L-INS-i
mafft --localpair --maxiterate 1000 input_file > output_file
- For E-INS-i
mafft --genafpair --maxiterate 1000 input_file > output_file mafft --genafpair --maxiterate 1000 homologene_prot.fasta > alin.fasta
- For G-INS-i
mafft --globalpair --maxiterate 1000 input_file > output_file
Remember to change your input_file
and output_file
for the FASTA file and the output file (also in FASTA file).
- Download your
output_file
into your local computer. Open the NCBI MSA viewer page on your browser and import your alignment using the Upload button and the Data file link.
Question 12
-
Add the image of the MSA
-
Write a short summary of the MSA by playing with the options in the MSA aligner (Are there regions that are highly conserved? Regions that are not that conserved? What are the species that look more similar to each other?)
9.3 Creating and searching a HMM model using our alignment
Now, we will search across a data set of FASTA proteins from the file we used last week but updated with some human proteins:
https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_5/mess_with_human_proteins.fasta
- Download the new FASTA file to your
Lab_5
folder. Count the number of sequences and answer the next question:
Question 13
- What is the number of sequences in the downloaded FASTA file?
Now, we want to identify if there are proteins that match the MSH homologs we have in the new FASTA file. That means we will have to create a HMM profile of our MSH homologs and then search for the sequences!
To create the hmmer profile you need your alignment file of the MSH homologs in protein format, and the
hmmbuild
program.
The hmmbuild
program creates a HMM model of your alignment (so, of your conserved domains!!) that you can use to search using hmmsearch
how we did on last class!
To run it use the following command:
hmmbuild MSH_homolog.hmm alignmnent_file.fasta
Answer the following question:
Question 14
- Summarize the process of building a HMM model
- Finally, run
hmmsearch
and identify the protein with positive hits. Add them to the following table:
Target Name | Full Sequence E-value | Full Seq. Score | Best 1 domain E-value | Best 1 domain score |
---|---|---|---|---|
Query_1 |