|
Large single species data sets, like the one found in strepneumo-sybil,
are becoming more and more common as ultra-high throughput sequencing
technology has become ubiquitous. The goal of the pan-genome is to
estimate the gene repertoire size of a set of organisms (typically
a single species) given a large number of sequenced strains.
The analysis involves 6 steps as described below:
- All vs. All blastp and tblastn search. The tblastn search helps to
eliminate annotation irregularities.
- Aggregation of the blastp/tblastn results into a single hit graph.
- Taking a random sample of all of the possible genome combinations
then adding an out genome to that set and counting the number of
new, core and shared genes the out genome contributes
(table file below).
- Creation of a bit matrix profile representation of the blast graph
.(profile file below).
- A random sampling of all possible genome combinations are taken and
the total number of non-redundant* genes are counted as the
pan-genome (pan-genome file below).
- Plotting the results and fitting a model to estimate the
overall genomic diversity.
*-There is a small amount of redundancy introduced by gene duplication events (paralogs). |
The new genes graph here shows each genome as a different colored circle when that genome is chosen as the
'out group'. The error bars show the 1st and 3rd quartiles and the inverted triangle is the mean. The model is
drawn as a solid line fit to the means. The model was fit only on genome counts above 6. This strategy was
implemented to avoid the left side bias that could skew the results.
This figure shows that as the number of sequenced strains increases the number of new genes decreases.
The power-law model has a slope of -1.03 which (since it is greater than 1) indicates that the number of new genes
will continue to decrease and this appears to be a slightly closed pan-genome. The right side of the graph has
several instances where the error bars extend all the way to the bottom. This indicates that there are several
genomes that contribute 0 or close to 0 new genes to the remaining set. The last mean value is ~7 and by the time
50 sequenced genomes are available this model suggests that there would be ~5 new genes per genome. This reduces
to ~2.5 at 100 genomes. While this is technically a 'closed' pan-genome based on the method suggested by
Tettelin et al. it is closing slowly such that new genes will still be found out to 100 sequenced genomes.
|
This new genes graph adds the out-group mitis genome. Like the first graph it shows each genome as a different colored circle when that
genome is chosen as the 'out group'. The error bars show the 1st and 3rd quartiles and the inverted triangle is the mean. The model is
drawn as a solid line fit to the means. The model was fit only on genome counts above 6. This strategy was
implemented to avoid the left side bias that could skew the results.
This plot has been provided to show how a genome from a different but highly related species (S. mitis shown in yellow) can provide enough
diversity to change the plot from showing a nearly closed pan-genome to a confidently open one. Here the slope of -0.846 is less than 1 indicating
that the number of new genes found for each additional genome could remain high. at 34 the average number of new genes found per genome is ~12. This
number is expected to fall to ~9 at 50 genomes and ~5 by 100 genomes. This means that if S. mitis genomes are sampled along with pneumo genomes one
could expect to find new genes at approximately twice the rate of just sampling S. pneumo genomes.
|
The core genes graph estimates the size of the core genome as more genomes are sequenced. The colored circles
represent the count of core genes present when that particular genome is used as an out group. The error bars
again show the 1st and 3rd quartiles and the triangles show the means. An power law model is used and is
fit to the means above 6 genomes.
In this figure the size of the core genome appears to decrease as more genomes are
sequenced. However the slope of the fitted power-law function indicates that the decrease is very slow. In this case
the core genome contains ~ 1858 genes when 34 genomes are considered. At 50 sequenced genomes this number is
estimated to drop just slightly to ~ 1836 and at 100 genomes the model estimates the number of core genes could drop
slightly more to ~1797 genes. This nearly flat slope suggests that the core-genome of S. pneumo is somewhat stable.
Since the average S. pneumo genome is ~2100 genes this leaves ~300 dispensible genes to account for the variability
of the species.
|
This core genes graph adds the S. mitis genome but uses the same method as the pneumo only one. The colored circles
represent the count of core genes present when that particular genome is used as an out group. The error bars
again show the 1st and 3rd quartiles and the triangles show the means. An power law model is used and is
fit to the means above 6 genomes.
The addition of S. mitis (shown as yellow circles) significantly increases the speed at which the core-genome shrinks.
At 34 genomes the size of the core genome is estimated to be ~1672. This is significantly less than the estimate derived using pneumo-only data
(1858). At 50 genomes that number drops to ~1619 and at 100 it ~1528. This increases the number of dispensible genes from ~300 for pneumo only
to ~500 if S. mitis is included.
|
The pan-genome estimates the trend of the entire gene repertoire of the species. This figure shows each sampled
combination of genomes as a black ring. Because there is no 'out genome' in this analysis the rings cannot be
colored. The error bars represent the 1st and third quartiles and the red triangle represents the median. The power
law function is fit on the medians and is fit to the means above 6 genomes.
This figure shows a growing pan-genome as more genomes are sequenced. However, the slope
(~0.1) of the power law again tells us that this growth is slow. At 34 genomes this value is estimated to be
~3033 genes. At 50 genomes this value is estimated to grow to ~3142 and at 100 genomes ~3343. This somewhat
contradicts what the new genes graph shows in that a slope above 0 (0.1) indicates an open pan-genome. This can be
attributed to the nature of the S. pneumo data set being a fringe case. More sequenced genomes have the potential
to provide additional genomic diversity but this return will most likely diminish as the number approaches 100
genomes.
|
This plot incorporates the S. mitis genome while utilizing the same procedure described above. Each sampled
combination of genomes is show as a black ring. Because there is no 'out genome' in this analysis the rings cannot be
colored. The error bars represent the 1st and third quartiles and the red triangle represents the median. The power
law function is fit on the medians and is fit to the means above 6 genomes.
The rate of pan-genome growth increases significantly, as one would expect, with the addition of S. mitis. The slope of
the regression is ~0.1215. At 34 genomes this comes out to ~3229 genes vs. ~3033 with pneumo only. At 50 genomes the gene repertoire grows to
~3384 (3142 with pneumo only) and at 100 the prediction is ~3681 (3343 with pneumo only). This correlates well to the new genes and core genes
graph and shows what an out group genome can show.
|
The data files used to generate the above figures are available for download below:
|
Tettelin H., Riley D., Cattuto C., Medini D. Comparative genomics: the bacterial pan-genome, Current Opinion in Microbiology, Volume 11, Issue 5, Antimicrobials/Genomics, October 2008, Pages 472-477, ISSN 1369-5274, DOI: 10.1016/j.mib.2008.09.006.
Medini D., Donati C., Tettelin H., Masignani V., and Rappuoli R. (2005) The microbial pan-genome. Curr. Opin. Genet. Dev. 15, 589-594.
Tettelin H., Masignani M., Cieslewicz M.J., Donati C., Medini D., Ward N.L., Angiuoli S.V., Crabtree J., Jones A., Durkin A.S., DeBoy R.T., Davidsen T.M., Mora M., Scarselli M., Margarit y Ros I., Peterson J.D., Hauser C.R., Sundaram J.P., Nelson W.C., Madupu R., Brinkac L.M., Dodson R.J., Rosovitz M.J., Sullivan S.A., Daugherty S.C., Haft D.H., Selengut J., Gwinn M.L., Zhou L., Zafar N., Khouri H., Radune D., Dimitrov G., Watkins K., O'Connor K.J.B., Smith S., Utterback T.R., White O., Rubens C.E., Grandi G., Madoff L.C., Kasper D.L., Telford J.L., Wessels M.R., Rappuoli R., and Fraser C.M. (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome." Proc. Natl. Acad. Sci. USA 102, 13950-13955. Erratum in: Proc. Natl. Acad. Sci. USA 102, 16530. Featured in Nature Reviews Microbiology. Featured in The Scientist.
|
|