Creating a simple multiFASTA analyzer program in the command line

10.5 Answering each part of the question:

  1. Count the number of A,G,C and T?
  2. Count the length of each sequence?
  3. Count the number of open reading frames (ORF)?

10.5.1 Number of nucleotides

    A=$(grep -o "A" | wc -l)
    C=$(grep -o "C" | wc -l)
    G=$(grep -o "G" | wc -l)
    T=$(grep -o "T" | wc -l)
    echo "Number of A": $A;
    echo "Number of C": $C;
    echo "Number of G": $G;
    echo "Number of T": $T;

10.5.2 Length of sequence

   seqlen= wc -c 
   echo "Length:" $seqlen

10.5.3 Number of ORF

   orf=$(grep -o "ATG" | wc -l)
   echo "NUmber of ORF:" $seqlen

How do we put this together to avoid the header and only use the sequence?


10.6 Total script

echo "Welcome to Sequence Analyzer by Prof. Tabima"
echo
sleep 1

total_num=$(grep -c ">" Lab_3/mystery_genes/sequence_1) 
echo "### Total Number of sequences in the file ###"
echo $total_num
echo

grep -v ">" Lab_3/mystery_genes/sequence_1 | while read -r line ; do
    echo "############## Processing Sequence ##############"
    echo
    A=$(echo $line | grep -o "A" | wc -l)
    C=$(echo $line | grep -o "C" | wc -l)
    G=$(echo $line | grep -o "G" | wc -l)
    T=$(echo $line | grep -o "T" | wc -l)
    seqlen=$(echo $line | grep -o "T" | wc -c)
    orf=$(echo $line | grep -o "ATG" | wc -l)
    echo "--- Count per nucleotide ---"
    echo "Number of A": $A;
    echo "Number of C": $C;
    echo "Number of G": $G;
    echo "Number of T": $T;
    echo "---"
    echo
    echo "--- Total length of the sequence ---"
    echo $seqlen
    echo "---"
    echo
    echo "--- Number of ORF ---"
    echo $orf
    echo "---"
    echo 
    echo "############### Sequence processed #############"
    echo 
    sleep 2
done

echo "Program done!"
## Welcome to Sequence Analyzer by Prof. Tabima
## 
## ### Total Number of sequences in the file ###
## 5
## 
## ############## Processing Sequence ##############
## 
## --- Count per nucleotide ---
## Number of A: 1739
## Number of C: 1991
## Number of G: 2004
## Number of T: 1743
## ---
## 
## --- Total length of the sequence ---
## 3486
## ---
## 
## --- Number of ORF ---
## 135
## ---
## 
## ############### Sequence processed #############
## 
## ############## Processing Sequence ##############
## 
## --- Count per nucleotide ---
## Number of A: 240
## Number of C: 300
## Number of G: 311
## Number of T: 232
## ---
## 
## --- Total length of the sequence ---
## 464
## ---
## 
## --- Number of ORF ---
## 18
## ---
## 
## ############### Sequence processed #############
## 
## ############## Processing Sequence ##############
## 
## --- Count per nucleotide ---
## Number of A: 679
## Number of C: 841
## Number of G: 887
## Number of T: 668
## ---
## 
## --- Total length of the sequence ---
## 1336
## ---
## 
## --- Number of ORF ---
## 58
## ---
## 
## ############### Sequence processed #############
## 
## ############## Processing Sequence ##############
## 
## --- Count per nucleotide ---
## Number of A: 240
## Number of C: 284
## Number of G: 297
## Number of T: 433
## ---
## 
## --- Total length of the sequence ---
## 866
## ---
## 
## --- Number of ORF ---
## 21
## ---
## 
## ############### Sequence processed #############
## 
## ############## Processing Sequence ##############
## 
## --- Count per nucleotide ---
## Number of A: 189
## Number of C: 125
## Number of G: 137
## Number of T: 161
## ---
## 
## --- Total length of the sequence ---
## 322
## ---
## 
## --- Number of ORF ---
## 11
## ---
## 
## ############### Sequence processed #############
## 
## Program done!