|
A new version of strepneumo exists at http://strepneumo-sybil.igs.umaryland.edu. This site will be removed in the coming months. Please bookmark http://strepneumo-sybil.igs.umaryland.edu as this URL will always point to the most recent version of the site.
Large single species data sets, like the one found in strepneumo-sybil,
are becoming more and more common as ultra-high throughput sequencing
technology has become ubiquitous. The goal of the pan-genome is to
estimate the gene repertoire size of a set of organisms (typically
a single species) given a large number of sequenced strains.
The analysis involves 6 steps as described below:
- All vs. All blastp and tblastn search. The tblastn search helps to
eliminate annotation irregularities.
- Aggregation of the blastp/tblastn results into a single hit graph.
- Taking a random sample of all of the possible genome combinations
then adding an out genome to that set and counting the number of
new, core and shared genes the out genome contributes
(table file below).
- Creation of a bit matrix profile representation of the blast graph
.(profile file below).
- A random sampling of all possible genome combinations are taken and
the total number of non-redundant* genes are counted as the
pan-genome (pan-genome file below).
- Plotting the results and fitting a model to estimate the
overall genomic diversity.
*-There is a small amount of redundancy introduced by gene duplication events (paralogs). |
The new genes graph here shows each genome as a different colored circle when that genome is chosen as the
'out group'. The error bars show the 1st and 3rd quartiles and the inverted triangle is the mean. The model is
drawn as a solid line fit to the means. The model was fit only on genome counts above 6. This strategy was
implemented to avoid the left side bias that could skew the results.
This figure shows that as the number of sequenced strains increases the number of new genes decreases.
The power-law model has a slope of -1.02 which (since it is greater than 1) indicates that the number of new genes
will continue to decrease and this appears to be a slightly closed pan-genome. The right side of the graph has
several instances where the error bars extend all the way to the bottom. This indicates that there are several
genomes that contribute 0 or close to 0 new genes to the remaining set. The last mean value is ~8 and by the time
50 sequenced genomes are available this model suggests that there would be ~5 new genes per genome. This reduces
to ~2.5 at 100 genomes. While this is technically a 'closed' pan-genome based on the method suggested by
Tettelin et al. it is closing slowly such that new genes will still be found out to 100 sequenced genomes.
|
The core genes graph estimates the size of the core genome as more genomes are sequenced. The colored circles
represent the count of core genes present when that particular genome is used as an out group. The error bars
again show the 1st and 3rd quartiles and the diamonds show the medians. The power-law model is again used and is
fit to the medians above 6 genomes.
In this figure the size of the core genome appears to decrease as more genomes are
sequenced. However the slope of the fitted power-law function indicates that the decrease is very slow. In this case
the core genome contains ~ 1865 genes when 32 genomes are considered. At 50 sequenced genomes this number is
estimated to drop just slightly to ~ 1840 and at 100 genomes the model estimates the number of core genes could drop
slightly more to ~1803 genes. This nearly flat slope suggests that the core-genome of S. pneumo is somewhat stable.
Since the average S. pneumo genome is ~2100 genes this leaves ~300 dispensible genes to account for the variability
of the species.
|
The pan-genome estimates the trend of the entire gene repertoire of the species. This figure shows each sampled
combination of genomes as a black ring. Because there is no 'out genome' in this analysis the rings cannot be
colored. The error bars represent the 1st and third quartiles and the red triangle represents the median. The power
law function is fit on the medians and is fit for all values.
This figure shows a growing pan-genome as more genomes are sequenced. However, the slope
(0.1) of the power law again tells us that this growth is only slight. At 32 genomes this value is estimated to be
~3053 genes. At 50 genomes this value is estimated to grow to ~3192 and at 100 genomes ~3421. This somewhat
contradicts what the new genes graph shows in that a slope above 0 (0.1) indicates an open pan-genome. This can be
attributed to the nature of the S. pneumo data set being a fringe case. More sequenced genomes have the potential
to provide additional genomic diversity but this return will most likely diminish as the number approaches 100
genomes.
|
The data files used to generate the above figures are available for download below:
- Pangenome Table: This file contains the
following columns: genome count, core genes, shared genes, new genes, , out genome
- Pangenome Profile: This file contains
a bit-matrix representation of the blast hits. Each genome is listed along the top and each gene is listed along
the left.
- Pangenome Output: This file is derived
from the profile and contains the data used to draw the pan-genome curve. The columns are: Number of genomes,
Pan-genome size, Genomes used.
|
Tettelin H., Riley D., Cattuto C., Medini D. Comparative genomics: the bacterial pan-genome, Current Opinion in Microbiology, Volume 11, Issue 5, Antimicrobials/Genomics, October 2008, Pages 472-477, ISSN 1369-5274, DOI: 10.1016/j.mib.2008.09.006.
Medini D., Donati C., Tettelin H., Masignani V., and Rappuoli R. (2005) The microbial pan-genome. Curr. Opin. Genet. Dev. 15, 589-594.
Tettelin H., Masignani M., Cieslewicz M.J., Donati C., Medini D., Ward N.L., Angiuoli S.V., Crabtree J., Jones A., Durkin A.S., DeBoy R.T., Davidsen T.M., Mora M., Scarselli M., Margarit y Ros I., Peterson J.D., Hauser C.R., Sundaram J.P., Nelson W.C., Madupu R., Brinkac L.M., Dodson R.J., Rosovitz M.J., Sullivan S.A., Daugherty S.C., Haft D.H., Selengut J., Gwinn M.L., Zhou L., Zafar N., Khouri H., Radune D., Dimitrov G., Watkins K., O'Connor K.J.B., Smith S., Utterback T.R., White O., Rubens C.E., Grandi G., Madoff L.C., Kasper D.L., Telford J.L., Wessels M.R., Rappuoli R., and Fraser C.M. (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome." Proc. Natl. Acad. Sci. USA 102, 13950-13955. Erratum in: Proc. Natl. Acad. Sci. USA 102, 16530. Featured in Nature Reviews Microbiology. Featured in The Scientist.
|
|