Extracting the cut-sites coordinates

To extract the fragments product of restriction enzymes, we took the P. rubi genome and extracted all the regions that matched to the gc[at]gc cut site of ApeKi.

library(ape)
library(stringr)

#Reading the data, transforming DNAbin to char and concatenating
#rubi <- read.dna("~/Documents/rubi/maker/rubi/Pr4671.fa",format = "fasta")
rubi <- read.dna("~/Documents/P_rubi/Genomes/Genomes/rubi/Pr4671.fa",format = "fasta")


contig <- as.character(rubi)
contig <- lapply(contig, function(x)paste(x,collapse = ""))

coord <- lapply(contig, function(x) str_locate_all(x,"gc[at]gc"))
coord <- lapply(coord,as.data.frame)

#Adiing 100bp to each side
coord_plus <- lapply(coord,function (x) (data.frame(x[,1]-100,x[,1]+100)))

#Binding with names
bind_plus <- do.call(rbind, coord_plus)


#Creating a row with the scaffolds
bind_plus <- data.frame(sapply(strsplit(rownames(bind_plus),split = ".",fixed = T), "[",1), bind_plus)
colnames(bind_plus) <- c("chrom","start","end")
full_coord <- bind_plus


#getting rid of the negative values 
full_coord$start[full_coord$start < 0] <- 0

#Exporting the table
write.table(full_coord,"~/Documents/P_rubi/GBS/GBS_Data/coverage/coordinates.bed",sep = "\t",row.names = F,col.names = F,quote = F)

We have a total of 359657 fragments WITH OVERLAP. To determine the regions without overlap we used the mergeBED tool from Bed Tools and recovered the full data set:

Total nucleotides with an associated read

tn <- read.table("~/Documents/P_rubi/GBS/GBS_Data/coverage/overlap_coords.bed")
total_nuc <- sum(tn$V3 - tn$V2)

The ApeKI enzyme will have 123677 non-overlapping cut sites, and a total of 42725124 nucleotides. When compared to the total number of nucleotides of P. rubi (74’863,594), we have approx. 57% of the total genome covered.

Generating the Per-Base depth

First, we combined the old reads (GBS_023) with the newly sequenced regions (GBS_0023 with GBS_0039) using the /nfs0/Grunwald_Lab/home/tabimaj/GBS/GBS_AllData/join_fastq.sh script. Then we mapped the GBS reads using bowtie to the reference genome (script found in /nfs0/Grunwald_Lab/home/tabimaj/GBS/GBS_AllData/bt2b.sh), and used samtools with the cut-site coordinate file (/nfs0/Grunwald_Lab/home/tabimaj/GBS/GBS_AllData/coverage/coordinates.bed) to only get the stats from the cut-site fragments (script: /nfs0/Grunwald_Lab/home/tabimaj/GBS/GBS_AllData/coverage/pileup_BED.sh)

Importing read counts from pileup

Using R and the *.pileups I created a script to determine the Percentage per-base coverage for the entire genome. The logic to do it this way instead of fragment wide is:

  1. If the SNP exist, it has to exist in any of the existent sites, and the pileup coverage includes insertions-deletions as well.
  2. This process is 10x fold faster than a per-fragment coverage series.
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20
## 21
## 22
## 23
## 24
## 25
## 26
## 27
## 28
## 29
## 30
## 31
## 32
## 33
## 34
## 35
## 36
## 37
## 38
## 39
## 40
## 41
## 42
## 43
## 44
## 45
## 46
## 47
## 48
## 49
## 50
## 51
## 52
## 53
## 54
## 55
## 56
## 57
## 58
## 59
## 60

Results:

Nucleotides sequenced

To determine if there is a lineal distribution of number of nucleotides with an associated read, we plotted the following figure:

library(ggplot2)
library(reshape)
m.pt <- melt(percent_table, id=c("Coverage"))
ggplot(m.pt[m.pt$variable %in% "Nucleotides", ],aes(x=Coverage,y=value)) + geom_point(aes(colour=variable)) + stat_smooth(method = "lm", formula = y ~ log(x))

We can see that there is an stabilization curve around 20x coverage.

Coverage and depth percentage per sample using GBS data.

Coverage 1x 2x 3x 4x 5x More Nucleotides
67.count.gz 27.007145 99.43423 89.85944 82.74950 77.08869 72.233658 68.009499 34470792
1.count.gz 26.457935 99.39702 90.00193 82.74105 76.85816 71.825399 67.490299 35795761
78.count.gz 24.238156 99.49045 88.66940 80.87201 74.70216 69.580735 65.104175 33925961
79.count.gz 22.437104 99.42087 87.72088 79.52779 73.23027 68.000345 63.530731 32567639
75.count.gz 17.050890 99.41300 85.65763 76.22972 68.82014 62.804007 57.823250 31495492
76.count.gz 15.293737 99.40367 85.24969 75.16324 67.35481 61.015667 55.864646 31654673
85.count.gz 14.757131 99.36930 84.11736 73.78436 65.93461 59.711240 54.612830 30040721
High_29.count.gz 14.692067 99.74608 83.36598 72.77794 64.91098 58.666175 53.505434 27989355
High_41.count.gz 14.000667 99.78390 83.07087 72.09353 63.98004 57.637527 52.454588 28591560
High_14.count.gz 13.107033 99.75090 81.59400 70.05939 61.68530 55.304267 50.156312 27143479
31.count.gz 12.968872 99.31352 83.22468 72.01508 63.64846 56.968595 51.472726 30078988
51.count.gz 12.934093 99.45264 82.42267 70.96581 62.50878 55.924023 50.692116 30238479
High_30.count.gz 12.698345 99.78198 81.85155 70.80637 62.86455 56.693584 51.638468 23478574
High_59.count.gz 11.893600 99.80816 81.34917 69.74445 61.39723 54.898057 49.569993 24119848
High_65.count.gz 11.520750 99.74339 80.87797 69.05864 60.57598 54.018144 48.732308 25406084
High_24.count.gz 11.389265 99.73145 82.80158 71.73534 63.39961 56.748214 51.149416 27013978
56.count.gz 11.033276 99.43679 79.63105 66.95136 57.94832 51.271121 45.869824 28126225
High_87.count.gz 10.836501 99.72687 79.64680 67.40109 58.55860 51.726947 46.240795 24372548
High_68.count.gz 10.654542 99.76586 79.82599 67.56783 58.93113 52.254931 46.809959 22742983
10.count.gz 9.972904 99.36495 80.12445 67.23835 57.82938 50.781386 45.233381 29872629
40.count.gz 9.888200 99.32548 80.52204 67.83267 58.45472 51.223917 45.406787 29711541
71.count.gz 9.724286 99.37934 79.88620 66.95471 57.61360 50.288901 44.347746 27768135
High_19.count.gz 9.264691 99.75454 79.47367 66.70873 57.60416 50.526671 44.822954 25752977
High_36.count.gz 9.077055 99.71765 77.39482 64.14093 55.02009 48.165543 42.684185 23666674
18.count.gz 9.049187 99.36320 79.60463 66.50628 57.03210 49.715575 43.879888 28935600
High_86.count.gz 8.807016 99.77361 78.32886 65.31199 55.81293 48.466278 42.513733 22206013
High_12.count.gz 8.489625 99.74482 77.04250 63.40515 53.93752 46.813569 41.126373 24155174
39.count.gz 8.106941 99.38185 76.17511 62.16131 52.44992 45.157326 39.403970 25978872
Int_82.count.gz 7.975941 99.73883 77.05455 63.09000 53.00807 45.232154 39.020945 21797273
54.count.gz 7.937086 99.33145 76.88406 63.01676 53.02568 45.485615 39.283285 24817965
High_33.count.gz 7.874688 99.72497 75.97681 61.67296 51.79950 44.318601 38.386002 23380174
High_80.count.gz 7.773513 99.75079 76.16999 61.69398 51.66878 44.111138 38.088524 24190394
High_49.count.gz 7.731831 99.73604 76.00834 61.59852 51.39520 43.656254 37.631693 24191125
Int_35.count.gz 7.541201 99.75522 75.40206 60.88233 50.68652 43.047977 37.068597 23172436
Int_17.count.gz 7.281554 99.73633 75.47211 60.98990 50.94023 43.341045 37.305141 22553157
34.count.gz 7.011751 99.40971 75.50422 60.54673 50.10957 42.204942 35.868151 25660988
Int_21.count.gz 6.886382 99.71423 74.66032 59.64688 49.21456 41.378999 35.190951 23214914
53.count.gz 6.829299 99.35784 75.49648 60.20942 49.44772 41.301633 34.905403 26345583
Int_23.count.gz 6.813986 99.67825 73.49436 58.53009 48.32360 40.669133 34.623157 20425353
Int_57.count.gz 6.751986 99.72788 74.04793 58.74863 48.20546 40.381586 34.155247 23312945
Int_69.count.gz 6.639904 99.73374 73.91159 58.60106 48.07823 40.159171 33.904106 21184974
Int_63.count.gz 6.605658 99.71895 74.19397 58.75059 47.98614 39.736271 33.323698 19768166
Int_81.count.gz 6.378622 99.71889 73.40641 57.81837 47.06416 38.942709 32.598843 21257656
Int_70.count.gz 6.236685 99.68415 73.40455 58.01739 47.36094 39.360356 32.816137 21557376
Int_47.count.gz 6.134337 99.68388 73.15736 57.27252 46.38116 38.150014 31.785113 20929927
Int_32.count.gz 6.131886 99.71908 72.93141 57.35605 46.48672 38.413883 32.040661 21564758
High_44.count.gz 6.124445 99.71604 72.88332 57.04761 46.12693 38.026150 31.614380 20680374
Int_83.count.gz 6.084004 99.68110 72.30882 56.40473 45.59403 37.558769 31.230161 20318589
Int_3.count.gz 6.053316 97.68903 71.20957 55.65841 44.98270 37.049904 30.913633 22827498
Int_62.count.gz 5.901835 99.71699 71.91089 55.60938 44.41563 36.121537 29.746277 19381343
Int_25.count.gz 5.752339 99.72551 71.22117 54.60333 43.39072 35.146765 28.928368 21912913
Int_38.count.gz 5.653055 99.67303 71.23987 54.72050 43.40350 35.103084 28.727079 20497148
5.count.gz 5.611495 98.76246 69.94545 53.32955 42.22566 34.112293 27.883923 21881537
Int_16.count.gz 5.272909 99.67877 69.26190 52.01544 40.74871 32.498832 26.254742 19365702
50.count.gz 4.760655 99.32428 67.62335 49.38116 37.53467 29.139765 22.944734 21953482
6.count.gz 4.548846 98.79816 67.41767 49.26991 37.24652 28.609672 22.106061 22408719
Int_74.count.gz 4.478580 99.72826 66.97881 47.88613 35.60765 27.045765 20.897400 15393556
48.count.gz 4.325888 99.26088 64.83695 45.96816 33.89807 25.585511 19.539376 19971323
55.count.gz 4.258633 99.31280 65.80320 46.93586 34.74120 26.142438 19.851242 19320650
Int_84.count.gz 2.169845 99.36514 46.05961 23.25649 12.39669 7.030094 4.208773 7774832

The results show that, to obtain a 2x coverage in more than 90% of the bases we need a coverage of >20x. Using a >10x coverage will give us a 2x read coverage on ~80% of the bases, while a 5/8X coverage will give us a 2x read coverage in 70-76% of the bases.

If we want to have at least 2 reads per allele in a diploid organism, we will need 4x coverage per site. we can see that on the >20x samples we have a 77% of the sites with 4x covered, while in the >10x samples there is a ~60% of the sites with 4x coverage per base. On the 5-8x range, we see that 50% of the sites are covered with a depth of 4x, so half of the sites can be called.

I wanted to see if there was a linear distribution between the coverage and the per-base coverage, so I constructed the following figure:

We can see that there is n exponential grown until we reach a point of stasis around 20x. That would mean that, in order to get higher per-base coverage in our data we would need a MUCH higher depth than 20x.

Conclusions

So far, with the depths that we have we cannot have 95% of the bases covered by at least 2x. Increasing the depth will become more and more expensive. I would recommend using a depth of 20x if possible or 10x if the resources are scarse, as we will have more than 50% of the bases called with 4x depth and ~90% of the bases with 2x coverage.