Chapter 9 Homology and multiple sequence alignments

Homologous genes are genes that share common ancestry and can be functionally similar. These homologous genes can be identified between different samples, species, ect, based on similarity and other metrics such as inflation in MCL, ect.

In this section, we will learn how to obtain and analyze data sets of orthologous sets of genes using online repositories and observing its differences and similarities with multiple sequence alignments.

9.1 Downloading sequences from Homologene

There are many ways to obtain homologous genes. For example, running OrthoMCL/OrthoFinder in a set of genomes will identify all the genes of interest between the set of genomes.

In other cases, we can obtain genes of interest from online databases, such as the OrthoMCL DB or NCBI homologene.

Today, we will use NCBI Homologene.

  1. Go to this link

Questions 1 - 4

  • What number of homologene is this?
  • What gene/genic family is represented in this set of sequences?
  • Based on the schematics on the right, which domains of the protein are conserved across the different species?
  • Fill the following table using all the sequences from the homologene info
Species Common Name Protein ID Protein length Number of domains found
Homo sapiens Human NP_000242.2 Three (MutS_I, MutS_II, ABC_ATPase)
  1. Go to the Download links and download the mRNA and Protein sequences in FASTA format

  2. Connect to the cluster and upload the sequences into a new folder inside of MBB101 called Lab_5

  3. Answer the following questions:

Questions 5 - 7

  • What is the number of sequences in each of the files uploaded?
  • Did you expect this number to be similar or different across the files?

9.2 Multiple sequence alignment in MAFFT

Now, we will do some multiple sequence alignments using MAFFT.

If we remember from class, MAFFT is one of the most complete multiple sequence aligner programs that exist.

It allows us to do many different kinds of searches, such as local alignments, global alignments, and iterative and progressive alignments focused on aligning highly conserved domains

  1. Before you proceed with the alignment, go to the MAFFT algortihms page and answer the follwing questions based on the information on the algorithms MAFFT uses:

Questions 8 - 11

  • What kinds of algorithms from MAFFT (E-INS-i, L-INS-i or G-INS-i) would you use for the following datasets, and briefly explain your answers:
  • Proteins with multiple conserved domains
  • Proteins with single conserved domains
  • Proteins with a single short domain
  1. Based on your answer above, align the protein sequence you downloaded using the algorithm of your choice in the cluster.

To do this, in your Lab_5 do the following:

  • Use the following code for your alignment:
    • For L-INS-i
    mafft --localpair --maxiterate 1000 input_file > output_file
    • For E-INS-i
    mafft --genafpair --maxiterate 1000 input_file > output_file
    mafft --genafpair --maxiterate 1000 homologene_prot.fasta > alin.fasta
    • For G-INS-i
    mafft --globalpair --maxiterate 1000 input_file > output_file

Remember to change your input_file and output_file for the FASTA file and the output file (also in FASTA file).

  1. Download your output_file into your local computer. Open the NCBI MSA viewer page on your browser and import your alignment using the Upload button and the Data file link.

Question 12

  • Add the image of the MSA

  • Write a short summary of the MSA by playing with the options in the MSA aligner (Are there regions that are highly conserved? Regions that are not that conserved? What are the species that look more similar to each other?)


9.3 Creating and searching a HMM model using our alignment

Now, we will search across a data set of FASTA proteins from the file we used last week but updated with some human proteins:

https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_5/mess_with_human_proteins.fasta
  1. Download the new FASTA file to your Lab_5 folder. Count the number of sequences and answer the next question:

Question 13

  • What is the number of sequences in the downloaded FASTA file?
  1. Now, we want to identify if there are proteins that match the MSH homologs we have in the new FASTA file. That means we will have to create a HMM profile of our MSH homologs and then search for the sequences!

  2. To create the hmmer profile you need your alignment file of the MSH homologs in protein format, and the hmmbuild program.

The hmmbuild program creates a HMM model of your alignment (so, of your conserved domains!!) that you can use to search using hmmsearch how we did on last class!

To run it use the following command:

hmmbuild MSH_homolog.hmm alignmnent_file.fasta

Answer the following question:

Question 14

  • Summarize the process of building a HMM model
  1. Finally, run hmmsearch and identify the protein with positive hits. Add them to the following table:
Target Name Full Sequence E-value Full Seq. Score Best 1 domain E-value Best 1 domain score
Query_1