sybil: strepneumo: pangenome

strepneumo-sybil.igs.umaryland.edu as this URL will always point to the most recent version of the site.

Background

Large single species data sets, like the one found in strepneumo-sybil, are becoming more and more common as ultra-high throughput sequencing technology has become ubiquitous. The goal of the pan-genome is to estimate the gene repertoire size of a set of organisms (typically a single species) given a large number of sequenced strains.

The analysis involves 6 steps as described below:

All vs. All blastp and tblastn search. The tblastn search helps to eliminate annotation irregularities.
Aggregation of the blastp/tblastn results into a single hit graph.
Taking a random sample of all of the possible genome combinations then adding an out genome to that set and counting the number of new, core and shared genes the out genome contributes (table file below).
Creation of a bit matrix profile representation of the blast graph .(profile file below).
A random sampling of all possible genome combinations are taken and the total number of non-redundant* genes are counted as the pan-genome (pan-genome file below).
Plotting the results and fitting a model to estimate the overall genomic diversity.

*-There is a small amount of redundancy introduced by gene duplication events (paralogs).

New Genes

The new genes graph here shows each genome as a different colored circle when that genome is chosen as the 'out group'. The error bars show the 1st and 3rd quartiles and the inverted triangle is the mean. The model is drawn as a solid line fit to the means. The model was fit only on genome counts above 6. This strategy was implemented to avoid the left side bias that could skew the results.

This figure shows that as the number of sequenced strains increases the number of new genes decreases. The power-law model has a slope of -1.02 which (since it is greater than 1) indicates that the number of new genes will continue to decrease and this appears to be a slightly closed pan-genome. The right side of the graph has several instances where the error bars extend all the way to the bottom. This indicates that there are several genomes that contribute 0 or close to 0 new genes to the remaining set. The last mean value is ~8 and by the time 50 sequenced genomes are available this model suggests that there would be ~5 new genes per genome. This reduces to ~2.5 at 100 genomes. While this is technically a 'closed' pan-genome based on the method suggested by Tettelin et al. it is closing slowly such that new genes will still be found out to 100 sequenced genomes.

Core Genes

The core genes graph estimates the size of the core genome as more genomes are sequenced. The colored circles represent the count of core genes present when that particular genome is used as an out group. The error bars again show the 1st and 3rd quartiles and the diamonds show the medians. The power-law model is again used and is fit to the medians above 6 genomes.

In this figure the size of the core genome appears to decrease as more genomes are sequenced. However the slope of the fitted power-law function indicates that the decrease is very slow. In this case the core genome contains ~ 1865 genes when 32 genomes are considered. At 50 sequenced genomes this number is estimated to drop just slightly to ~ 1840 and at 100 genomes the model estimates the number of core genes could drop slightly more to ~1803 genes. This nearly flat slope suggests that the core-genome of S. pneumo is somewhat stable. Since the average S. pneumo genome is ~2100 genes this leaves ~300 dispensible genes to account for the variability of the species.

Pan-genome

The pan-genome estimates the trend of the entire gene repertoire of the species. This figure shows each sampled combination of genomes as a black ring. Because there is no 'out genome' in this analysis the rings cannot be colored. The error bars represent the 1st and third quartiles and the red triangle represents the median. The power law function is fit on the medians and is fit for all values.

This figure shows a growing pan-genome as more genomes are sequenced. However, the slope (0.1) of the power law again tells us that this growth is only slight. At 32 genomes this value is estimated to be ~3053 genes. At 50 genomes this value is estimated to grow to ~3192 and at 100 genomes ~3421. This somewhat contradicts what the new genes graph shows in that a slope above 0 (0.1) indicates an open pan-genome. This can be attributed to the nature of the S. pneumo data set being a fringe case. More sequenced genomes have the potential to provide additional genomic diversity but this return will most likely diminish as the number approaches 100 genomes.

Raw Data

The data files used to generate the above figures are available for download below:

Pangenome Table: This file contains the following columns: genome count, core genes, shared genes, new genes, , out genome
Pangenome Profile: This file contains a bit-matrix representation of the blast hits. Each genome is listed along the top and each gene is listed along the left.
Pangenome Output: This file is derived from the profile and contains the data used to draw the pan-genome curve. The columns are: Number of genomes, Pan-genome size, Genomes used.

References

Tettelin H., Riley D., Cattuto C., Medini D. Comparative genomics: the bacterial pan-genome, Current Opinion in Microbiology, Volume 11, Issue 5, Antimicrobials/Genomics, October 2008, Pages 472-477, ISSN 1369-5274, DOI: 10.1016/j.mib.2008.09.006.

Medini D., Donati C., Tettelin H., Masignani V., and Rappuoli R. (2005) The microbial pan-genome. Curr. Opin. Genet. Dev. 15, 589-594.

Tettelin H., Masignani M., Cieslewicz M.J., Donati C., Medini D., Ward N.L., Angiuoli S.V., Crabtree J., Jones A., Durkin A.S., DeBoy R.T., Davidsen T.M., Mora M., Scarselli M., Margarit y Ros I., Peterson J.D., Hauser C.R., Sundaram J.P., Nelson W.C., Madupu R., Brinkac L.M., Dodson R.J., Rosovitz M.J., Sullivan S.A., Daugherty S.C., Haft D.H., Selengut J., Gwinn M.L., Zhou L., Zafar N., Khouri H., Radune D., Dimitrov G., Watkins K., O'Connor K.J.B., Smith S., Utterback T.R., White O., Rubens C.E., Grandi G., Madoff L.C., Kasper D.L., Telford J.L., Wessels M.R., Rappuoli R., and Fraser C.M. (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome." Proc. Natl. Acad. Sci. USA 102, 13950-13955. Erratum in: Proc. Natl. Acad. Sci. USA 102, 16530. Featured in Nature Reviews Microbiology. Featured in The Scientist.

sybil web site: sybil.sourceforge.net

e-mail: driley@som.umaryland.edu