Chapter 9 Homology and multiple sequence alignments

Orthologous genes are genes that share common ancestry and can be functionally similar. These genes can be identified between different samples, species, ect, based on similarity and other metrics such as inflation in MCL, ect.

In this section, we will learn how to obtain and analyze data sets of orthologous sets of genes using online repositories and observing its differences and similarities with multiple sequence alignments.

9.1 Downloading Orthologous Sequences

There are many ways to obtain homologous genes. For example, running OrthoMCL/OrthoFinder in a set of genomes will identify all the genes of interest between the set of genomes.

In other cases, we can obtain genes of interest from online databases, such as the OrthoMCL DB or NCBI.

Today, we will use a set of genes of unknown function that can be found in our Canvas page. Look for the homologene_mRNA.fasta and homologene_protein.fasta files in Week 10, Lab Data.

Connect to the cluster and upload the sequences into a new folder inside of MBB101 called Lab_5
Answer the following questions:

Questions 1 - 2

What is the number of sequences in each of the files uploaded?
Choose one of the sequences and check their function in Interpro. What is it?

9.2 Multiple sequence alignment in MAFFT

Now, we will do some multiple sequence alignments using MAFFT.

If we remember from class, MAFFT is one of the most complete multiple sequence aligner programs that exist.

It allows us to do many different kinds of searches, such as local alignments, global alignments, and iterative and progressive alignments focused on aligning highly conserved domains

Before you proceed with the alignment, go to the MAFFT algortihms page and answer the follwing questions based on the information on the algorithms MAFFT uses:

Questions 3 - 6

What kinds of algorithms from MAFFT (E-INS-i, L-INS-i or G-INS-i) would you use for the following datasets, and briefly explain your answers:
Proteins with multiple conserved domains
Proteins with single conserved domains
Proteins with a single short domain

Based on your answer above, align the protein sequence you downloaded using the algorithm of your choice in the cluster.

To do this, in your Lab_5 do the following:

Use the following code for your alignment:

For L-INS-i

mafft --localpair --maxiterate 1000 input_file > output_file

For E-INS-i

mafft --genafpair --maxiterate 1000 input_file > output_file
mafft --genafpair --maxiterate 1000 homologene_prot.fasta > alin.fasta

For G-INS-i

mafft --globalpair --maxiterate 1000 input_file > output_file

Remember to change your input_file and output_file for the FASTA file and the output file (also in FASTA file).

Download your output_file into your local computer. Open the NCBI MSA viewer page on your browser and import your alignment using the Upload button and the Data file link.

Question 7

Add the image of the MSA
Write a short summary of the MSA by playing with the options in the MSA aligner (Are there regions that are highly conserved? Regions that are not that conserved? What are the species that look more similar to each other?)

9.3 Creating and searching a HMM model using our alignment

Now, we will search across a data set of FASTA proteins from the file we used last week but updated with some human proteins:

https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_5/mess_with_human_proteins.fasta

Download the new FASTA file to your Lab_5 folder. Count the number of sequences and answer the next question:

Question 8

What is the number of sequences in the downloaded FASTA file?

Now, we want to identify if there are proteins that match the MSH homologs we have in the new FASTA file. That means we will have to create a HMM profile of our MSH homologs and then search for the sequences!
To create the hmmer profile you need your alignment file of the MSH homologs in protein format, and the hmmbuild program.

The hmmbuild program creates a HMM model of your alignment (so, of your conserved domains!!) that you can use to search using hmmsearch how we did on last class!

To run it use the following command:

hmmbuild MSH_homolog.hmm alignmnent_file.fasta

Answer the following question:

Question 9

Summarize the process of building a HMM model

Finally, run hmmsearch and identify the protein with positive hits. Add them to the following table:

Target Name	Full Sequence E-value	Full Seq. Score	Best 1 domain E-value	Best 1 domain score
Query_1