To extract the fragments product of restriction enzymes, we took the P. rubi genome and extracted all the regions that matched to the gc[at]gc cut site of ApeKi.
library(ape)
library(stringr)
#Reading the data, transforming DNAbin to char and concatenating
#rubi <- read.dna("~/Documents/rubi/maker/rubi/Pr4671.fa",format = "fasta")
rubi <- read.dna("~/Documents/P_rubi/Genomes/Genomes/rubi/Pr4671.fa",format = "fasta")
contig <- as.character(rubi)
contig <- lapply(contig, function(x)paste(x,collapse = ""))
coord <- lapply(contig, function(x) str_locate_all(x,"gc[at]gc"))
coord <- lapply(coord,as.data.frame)
#Adiing 100bp to each side
coord_plus <- lapply(coord,function (x) (data.frame(x[,1]-100,x[,1]+100)))
#Binding with names
bind_plus <- do.call(rbind, coord_plus)
#Creating a row with the scaffolds
bind_plus <- data.frame(sapply(strsplit(rownames(bind_plus),split = ".",fixed = T), "[",1), bind_plus)
colnames(bind_plus) <- c("chrom","start","end")
full_coord <- bind_plus
#getting rid of the negative values
full_coord$start[full_coord$start < 0] <- 0
#Exporting the table
write.table(full_coord,"~/Documents/P_rubi/GBS/GBS_Data/coverage/coordinates.bed",sep = "\t",row.names = F,col.names = F,quote = F)
We have a total of 359657 fragments WITH OVERLAP. To determine the regions without overlap we used the mergeBED tool from Bed Tools and recovered the full data set:
tn <- read.table("~/Documents/P_rubi/GBS/GBS_Data/coverage/overlap_coords.bed")
total_nuc <- sum(tn$V3 - tn$V2)
The ApeKI enzyme will have 123677 non-overlapping cut sites, and a total of 42725124 nucleotides. When compared to the total number of nucleotides of P. rubi (74’863,594), we have approx. 57% of the total genome covered.
First, we combined the old reads (GBS_023) with the newly sequenced regions (GBS_0023 with GBS_0039) using the /nfs0/Grunwald_Lab/home/tabimaj/GBS/GBS_AllData/join_fastq.sh script. Then we mapped the GBS reads using bowtie to the reference genome (script found in /nfs0/Grunwald_Lab/home/tabimaj/GBS/GBS_AllData/bt2b.sh), and used samtools with the cut-site coordinate file (/nfs0/Grunwald_Lab/home/tabimaj/GBS/GBS_AllData/coverage/coordinates.bed) to only get the stats from the cut-site fragments (script: /nfs0/Grunwald_Lab/home/tabimaj/GBS/GBS_AllData/coverage/pileup_BED.sh)
Using R and the *.pileups I created a script to determine the Percentage per-base coverage for the entire genome. The logic to do it this way instead of fragment wide is:
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20
## 21
## 22
## 23
## 24
## 25
## 26
## 27
## 28
## 29
## 30
## 31
## 32
## 33
## 34
## 35
## 36
## 37
## 38
## 39
## 40
## 41
## 42
## 43
## 44
## 45
## 46
## 47
## 48
## 49
## 50
## 51
## 52
## 53
## 54
## 55
## 56
## 57
## 58
## 59
## 60
To determine if there is a lineal distribution of number of nucleotides with an associated read, we plotted the following figure:
library(ggplot2)
library(reshape)
m.pt <- melt(percent_table, id=c("Coverage"))
ggplot(m.pt[m.pt$variable %in% "Nucleotides", ],aes(x=Coverage,y=value)) + geom_point(aes(colour=variable)) + stat_smooth(method = "lm", formula = y ~ log(x))
We can see that there is an stabilization curve around 20x coverage.
| Coverage | 1x | 2x | 3x | 4x | 5x | More | Nucleotides | |
|---|---|---|---|---|---|---|---|---|
| 67.count.gz | 27.007145 | 99.43423 | 89.85944 | 82.74950 | 77.08869 | 72.233658 | 68.009499 | 34470792 |
| 1.count.gz | 26.457935 | 99.39702 | 90.00193 | 82.74105 | 76.85816 | 71.825399 | 67.490299 | 35795761 |
| 78.count.gz | 24.238156 | 99.49045 | 88.66940 | 80.87201 | 74.70216 | 69.580735 | 65.104175 | 33925961 |
| 79.count.gz | 22.437104 | 99.42087 | 87.72088 | 79.52779 | 73.23027 | 68.000345 | 63.530731 | 32567639 |
| 75.count.gz | 17.050890 | 99.41300 | 85.65763 | 76.22972 | 68.82014 | 62.804007 | 57.823250 | 31495492 |
| 76.count.gz | 15.293737 | 99.40367 | 85.24969 | 75.16324 | 67.35481 | 61.015667 | 55.864646 | 31654673 |
| 85.count.gz | 14.757131 | 99.36930 | 84.11736 | 73.78436 | 65.93461 | 59.711240 | 54.612830 | 30040721 |
| High_29.count.gz | 14.692067 | 99.74608 | 83.36598 | 72.77794 | 64.91098 | 58.666175 | 53.505434 | 27989355 |
| High_41.count.gz | 14.000667 | 99.78390 | 83.07087 | 72.09353 | 63.98004 | 57.637527 | 52.454588 | 28591560 |
| High_14.count.gz | 13.107033 | 99.75090 | 81.59400 | 70.05939 | 61.68530 | 55.304267 | 50.156312 | 27143479 |
| 31.count.gz | 12.968872 | 99.31352 | 83.22468 | 72.01508 | 63.64846 | 56.968595 | 51.472726 | 30078988 |
| 51.count.gz | 12.934093 | 99.45264 | 82.42267 | 70.96581 | 62.50878 | 55.924023 | 50.692116 | 30238479 |
| High_30.count.gz | 12.698345 | 99.78198 | 81.85155 | 70.80637 | 62.86455 | 56.693584 | 51.638468 | 23478574 |
| High_59.count.gz | 11.893600 | 99.80816 | 81.34917 | 69.74445 | 61.39723 | 54.898057 | 49.569993 | 24119848 |
| High_65.count.gz | 11.520750 | 99.74339 | 80.87797 | 69.05864 | 60.57598 | 54.018144 | 48.732308 | 25406084 |
| High_24.count.gz | 11.389265 | 99.73145 | 82.80158 | 71.73534 | 63.39961 | 56.748214 | 51.149416 | 27013978 |
| 56.count.gz | 11.033276 | 99.43679 | 79.63105 | 66.95136 | 57.94832 | 51.271121 | 45.869824 | 28126225 |
| High_87.count.gz | 10.836501 | 99.72687 | 79.64680 | 67.40109 | 58.55860 | 51.726947 | 46.240795 | 24372548 |
| High_68.count.gz | 10.654542 | 99.76586 | 79.82599 | 67.56783 | 58.93113 | 52.254931 | 46.809959 | 22742983 |
| 10.count.gz | 9.972904 | 99.36495 | 80.12445 | 67.23835 | 57.82938 | 50.781386 | 45.233381 | 29872629 |
| 40.count.gz | 9.888200 | 99.32548 | 80.52204 | 67.83267 | 58.45472 | 51.223917 | 45.406787 | 29711541 |
| 71.count.gz | 9.724286 | 99.37934 | 79.88620 | 66.95471 | 57.61360 | 50.288901 | 44.347746 | 27768135 |
| High_19.count.gz | 9.264691 | 99.75454 | 79.47367 | 66.70873 | 57.60416 | 50.526671 | 44.822954 | 25752977 |
| High_36.count.gz | 9.077055 | 99.71765 | 77.39482 | 64.14093 | 55.02009 | 48.165543 | 42.684185 | 23666674 |
| 18.count.gz | 9.049187 | 99.36320 | 79.60463 | 66.50628 | 57.03210 | 49.715575 | 43.879888 | 28935600 |
| High_86.count.gz | 8.807016 | 99.77361 | 78.32886 | 65.31199 | 55.81293 | 48.466278 | 42.513733 | 22206013 |
| High_12.count.gz | 8.489625 | 99.74482 | 77.04250 | 63.40515 | 53.93752 | 46.813569 | 41.126373 | 24155174 |
| 39.count.gz | 8.106941 | 99.38185 | 76.17511 | 62.16131 | 52.44992 | 45.157326 | 39.403970 | 25978872 |
| Int_82.count.gz | 7.975941 | 99.73883 | 77.05455 | 63.09000 | 53.00807 | 45.232154 | 39.020945 | 21797273 |
| 54.count.gz | 7.937086 | 99.33145 | 76.88406 | 63.01676 | 53.02568 | 45.485615 | 39.283285 | 24817965 |
| High_33.count.gz | 7.874688 | 99.72497 | 75.97681 | 61.67296 | 51.79950 | 44.318601 | 38.386002 | 23380174 |
| High_80.count.gz | 7.773513 | 99.75079 | 76.16999 | 61.69398 | 51.66878 | 44.111138 | 38.088524 | 24190394 |
| High_49.count.gz | 7.731831 | 99.73604 | 76.00834 | 61.59852 | 51.39520 | 43.656254 | 37.631693 | 24191125 |
| Int_35.count.gz | 7.541201 | 99.75522 | 75.40206 | 60.88233 | 50.68652 | 43.047977 | 37.068597 | 23172436 |
| Int_17.count.gz | 7.281554 | 99.73633 | 75.47211 | 60.98990 | 50.94023 | 43.341045 | 37.305141 | 22553157 |
| 34.count.gz | 7.011751 | 99.40971 | 75.50422 | 60.54673 | 50.10957 | 42.204942 | 35.868151 | 25660988 |
| Int_21.count.gz | 6.886382 | 99.71423 | 74.66032 | 59.64688 | 49.21456 | 41.378999 | 35.190951 | 23214914 |
| 53.count.gz | 6.829299 | 99.35784 | 75.49648 | 60.20942 | 49.44772 | 41.301633 | 34.905403 | 26345583 |
| Int_23.count.gz | 6.813986 | 99.67825 | 73.49436 | 58.53009 | 48.32360 | 40.669133 | 34.623157 | 20425353 |
| Int_57.count.gz | 6.751986 | 99.72788 | 74.04793 | 58.74863 | 48.20546 | 40.381586 | 34.155247 | 23312945 |
| Int_69.count.gz | 6.639904 | 99.73374 | 73.91159 | 58.60106 | 48.07823 | 40.159171 | 33.904106 | 21184974 |
| Int_63.count.gz | 6.605658 | 99.71895 | 74.19397 | 58.75059 | 47.98614 | 39.736271 | 33.323698 | 19768166 |
| Int_81.count.gz | 6.378622 | 99.71889 | 73.40641 | 57.81837 | 47.06416 | 38.942709 | 32.598843 | 21257656 |
| Int_70.count.gz | 6.236685 | 99.68415 | 73.40455 | 58.01739 | 47.36094 | 39.360356 | 32.816137 | 21557376 |
| Int_47.count.gz | 6.134337 | 99.68388 | 73.15736 | 57.27252 | 46.38116 | 38.150014 | 31.785113 | 20929927 |
| Int_32.count.gz | 6.131886 | 99.71908 | 72.93141 | 57.35605 | 46.48672 | 38.413883 | 32.040661 | 21564758 |
| High_44.count.gz | 6.124445 | 99.71604 | 72.88332 | 57.04761 | 46.12693 | 38.026150 | 31.614380 | 20680374 |
| Int_83.count.gz | 6.084004 | 99.68110 | 72.30882 | 56.40473 | 45.59403 | 37.558769 | 31.230161 | 20318589 |
| Int_3.count.gz | 6.053316 | 97.68903 | 71.20957 | 55.65841 | 44.98270 | 37.049904 | 30.913633 | 22827498 |
| Int_62.count.gz | 5.901835 | 99.71699 | 71.91089 | 55.60938 | 44.41563 | 36.121537 | 29.746277 | 19381343 |
| Int_25.count.gz | 5.752339 | 99.72551 | 71.22117 | 54.60333 | 43.39072 | 35.146765 | 28.928368 | 21912913 |
| Int_38.count.gz | 5.653055 | 99.67303 | 71.23987 | 54.72050 | 43.40350 | 35.103084 | 28.727079 | 20497148 |
| 5.count.gz | 5.611495 | 98.76246 | 69.94545 | 53.32955 | 42.22566 | 34.112293 | 27.883923 | 21881537 |
| Int_16.count.gz | 5.272909 | 99.67877 | 69.26190 | 52.01544 | 40.74871 | 32.498832 | 26.254742 | 19365702 |
| 50.count.gz | 4.760655 | 99.32428 | 67.62335 | 49.38116 | 37.53467 | 29.139765 | 22.944734 | 21953482 |
| 6.count.gz | 4.548846 | 98.79816 | 67.41767 | 49.26991 | 37.24652 | 28.609672 | 22.106061 | 22408719 |
| Int_74.count.gz | 4.478580 | 99.72826 | 66.97881 | 47.88613 | 35.60765 | 27.045765 | 20.897400 | 15393556 |
| 48.count.gz | 4.325888 | 99.26088 | 64.83695 | 45.96816 | 33.89807 | 25.585511 | 19.539376 | 19971323 |
| 55.count.gz | 4.258633 | 99.31280 | 65.80320 | 46.93586 | 34.74120 | 26.142438 | 19.851242 | 19320650 |
| Int_84.count.gz | 2.169845 | 99.36514 | 46.05961 | 23.25649 | 12.39669 | 7.030094 | 4.208773 | 7774832 |
The results show that, to obtain a 2x coverage in more than 90% of the bases we need a coverage of >20x. Using a >10x coverage will give us a 2x read coverage on ~80% of the bases, while a 5/8X coverage will give us a 2x read coverage in 70-76% of the bases.
If we want to have at least 2 reads per allele in a diploid organism, we will need 4x coverage per site. we can see that on the >20x samples we have a 77% of the sites with 4x covered, while in the >10x samples there is a ~60% of the sites with 4x coverage per base. On the 5-8x range, we see that 50% of the sites are covered with a depth of 4x, so half of the sites can be called.
I wanted to see if there was a linear distribution between the coverage and the per-base coverage, so I constructed the following figure:
We can see that there is n exponential grown until we reach a point of stasis around 20x. That would mean that, in order to get higher per-base coverage in our data we would need a MUCH higher depth than 20x.
So far, with the depths that we have we cannot have 95% of the bases covered by at least 2x. Increasing the depth will become more and more expensive. I would recommend using a depth of 20x if possible or 10x if the resources are scarse, as we will have more than 50% of the bases called with 4x depth and ~90% of the bases with 2x coverage.