NobleBlocks
Wellcome Sanger Institute logo

Wellcome Sanger Institute

nonprofitCambridge, United Kingdom

Research output, citation impact, and the most-cited recent papers from Wellcome Sanger Institute (United Kingdom). Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works
22.0K
Citations
20.7M
h-index
1758
i10-index
43.5K
Also known as
The Sanger CentreWellcome Sanger InstituteWellcome Trust Sanger Institute

Top-cited papers from Wellcome Sanger Institute

The Sequence Alignment/Map format and SAMtools
Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell +4 more
2009· Bioinformatics67.0Kdoi:10.1093/bioinformatics/btp352

SUMMARY: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. AVAILABILITY: http://samtools.sourceforge.net.

Fast and accurate short read alignment with Burrows–Wheeler transform
Heng Li, Richard Durbin
2009· Bioinformatics62.4Kdoi:10.1093/bioinformatics/btp324

MOTIVATION: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. RESULTS: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows-Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is approximately 10-20x faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. AVAILABILITY: http://maq.sourceforge.net.

Initial sequencing and analysis of the human genome
Eric S. Lander, Lauren Linton, Bruce W. Birren, Chad Nusbaum +4 more
2001· Nature24.5Kdoi:10.1038/35057062

The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

A global reference for human genetic variation
Corresponding authors, Adam Auton, Gonçalo R. Abecasis, David M. Altshuler +4 more
2015· Nature19.8Kdoi:10.1038/nature15393

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies. Results for the final phase of the 1000 Genomes Project are presented including whole-genome sequencing, targeted exome sequencing, and genotyping on high-density SNP arrays for 2,504 individuals across 26 populations, providing a global reference data set to support biomedical genetics. The 1000 Genomes Project has sought to comprehensively catalogue human genetic variation across populations, providing a valuable public genomic resource. The data obtained so far have found applications ranging from association studies and fine mapping studies to the filtering of likely neutral variants in rare-disease cohorts. The authors now report on the final phase of the project, phase 3, which covers previously uncharacterized areas of human genetic diversity in terms of the populations sampled and categories of characterized variation. The sample now includes more than 2,500 individuals from 26 global populations, with low coverage whole-genome and deep exome sequencing, as well as dense microarray genotyping. They find that while most common variants are shared across populations, rarer variants are often restricted to closely related populations. The authors also demonstrate the use of the phase 3 dataset as a reference panel for imputation to improve the resolution in genetic association studies.

The variant call format and VCFtools
Petr Danecek, Adam Auton, Gonçalo R. Abecasis, Cornelis A. Albers +4 more
2011· Bioinformatics17.6Kdoi:10.1093/bioinformatics/btr330

SUMMARY: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. AVAILABILITY: http://vcftools.sourceforge.net

The FAIR Guiding Principles for scientific data management and stewardship
Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton +4 more
2016· Scientific Data17.4Kdoi:10.1038/sdata.2016.18

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders-representing academia, industry, funding agencies, and scholarly publishers-have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

Twelve years of SAMtools and BCFtools
Petr Danecek, James Bonfield, Jennifer Liddle, John Marshall +4 more
2021· GigaScience15.5Kdoi:10.1093/gigascience/giab008

BACKGROUND: SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods. FINDINGS: The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. CONCLUSION: Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed >1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.

The Pfam Protein Families Database
Alex Bateman
2002· Nucleic Acids Research14.2Kdoi:10.1093/nar/30.1.276

Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the World Wide Web in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgb.ki.se/Pfam/, in France at http://pfam.jouy.inra.fr/ and in the US at http://pfam.wustl.edu/. The latest version (6.6) of Pfam contains 3071 families, which match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Structural data, where available, have been utilised to ensure that Pfam families correspond with structural domains, and to improve domain-based annotation. Predictions of non-domain regions are now also included. In addition to secondary structure, Pfam multiple sequence alignments now contain active site residue mark-up. New search tools, including taxonomy search and domain query, greatly add to the functionality and usability of the Pfam resource.

Fast and accurate long-read alignment with Burrows–Wheeler transform
Heng Li, Richard Durbin
2010· Bioinformatics12.8Kdoi:10.1093/bioinformatics/btp698

Abstract Motivation: Many programs for aligning short sequencing reads to a reference genome have been developed in the last 2 years. Most of them are very efficient for short reads but inefficient or not applicable for reads >200 bp because the algorithms are heavily and specifically tuned for short queries with low sequencing error rate. However, some sequencing platforms already produce longer reads and others are expected to become available soon. For longer reads, hashing-based software such as BLAT and SSAHA2 remain the only choices. Nonetheless, these methods are substantially slower than short-read aligners in terms of aligned bases per unit time. Results: We designed and implemented a new algorithm, Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAT, and is several to tens of times faster than both. Availability: http://bio-bwa.sourceforge.net Contact: rd@sanger.ac.uk

A human gut microbial gene catalogue established by metagenomic sequencing
Junjie Qin, Ruiqiang Li, Jeroen Raes, Manimozhiyan Arumugam +4 more
2010· Nature11.6Kdoi:10.1038/nature08821

To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set, ∼150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent) microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions present in all individuals and most bacteria, respectively. The human body plays host to an estimated 100 trillion microbial cells, most of them in the gut where they have a profound influence on human physiology and nutrition — and are now regarded as crucial for human life. Gut microbes contribute to the energy harvest from food, and changes of gut microbiome may be associated with bowel diseases or obesity. Now the international MetaHIT (Metagenomics of the Human Intestinal Tract) project has published a gene catalogue of the human gut microbiome derived from 124 healthy, overweight and obese human adults, as well as inflammatory disease patients, from Denmark and Spain. The resulting data provide the first insights into this gene set — which is over 150 times larger than the human gene complement — and show that the genes are largely shared among individuals. Based on the variety of functions encoded by the gene set, it is possible to define both a minimal gut metagenome and a minimal gut bacterial genome. Deep metagenomic sequencing and characterization of the human gut microbiome from healthy and obese individuals, as well as those suffering from inflammatory bowel disease, provide the first insights into this gene set and how much of it is shared among individuals. The minimal gut metagenome as well as the minimal gut bacterial genome is also described.

The mutational constraint spectrum quantified from variation in 141,456 humans
Konrad J. Karczewski, Laurent C. Francioli, Grace Tiao, Beryl B. Cummings +4 more
2020· Nature10.0Kdoi:10.1038/s41586-020-2308-7

Abstract Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes 1 . Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.

InterProScan 5: genome-scale protein function classification
Philip Jones, David Binns, Hsin-Yu Chang, Matthew Fraser +4 more
2014· Bioinformatics9.9Kdoi:10.1093/bioinformatics/btu031

Abstract Motivation: Robust large-scale sequence analysis is a major challenge in modern genomic science, where biologists are frequently trying to characterize many millions of sequences. Here, we describe a new Java-based architecture for the widely used protein function prediction software package InterProScan. Developments include improvements and additions to the outputs of the software and the complete reimplementation of the software framework, resulting in a flexible and stable system that is able to use both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis. InterProScan is freely available for download from the EMBl-EBI FTP site and the open source code is hosted at Google Code. Availability and implementation: InterProScan is distributed via FTP at ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/ and the source code is available from http://code.google.com/p/interproscan/. Contact: http://www.ebi.ac.uk/support or interhelp@ebi.ac.uk or mitchell@ebi.ac.uk

Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls
Paul R. Burton, David G. Clayton, Lon R. Cardon, Nick Craddock +4 more
2007· Nature9.6Kdoi:10.1038/nature05911

There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip 500K Mapping Array Set) undertaken in the British population, which has examined approximately 2,000 individuals for each of 7 major diseases and a shared set of approximately 3,000 controls. Case-control comparisons identified 24 independent association signals at P < 5 x 10(-7): 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a large number of further signals (including 58 loci with single-point P values between 10(-5) and 5 x 10(-7)) likely to yield additional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizes observed at most loci identified. This study thus represents a thorough validation of the GWA approach. It has also demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiology of these important disorders. We anticipate that our data, results and software, which will be widely available to other investigators, will provide a powerful resource for human genetics research.

An integrated map of genetic variation from 1,092 human genomes
 Zamin Iqbal ,  Zamin Iqbal,  Andy Rimmer,  Anjali Gupta-Hinch +4 more
2012· Nature8.2Kdoi:10.1038/nature11632

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations. This report from the 1000 Genomes Project describes the genomes of 1,092 individuals from 14 human populations, providing a resource for common and low-frequency variant analysis in individuals from diverse populations; hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites, can be found in each individual. This report by the 1000 Genomes Project describes the genomes of 1,092 individuals from 14 human populations, providing a resource for common and low-frequency variant analysis in individuals from diverse populations. Integrative analyses reveal profiles of rare and common variants in different populations. The frequencies of rare variants vary across biological pathways, and hundreds of rare, non-coding variants at conserved sites — such as changes disrupting transcription-factor motifs — can be established for each individual.

A map of human genome variation from population-scale sequencing
 Min Hu,  Yuan Chen,  James Stalker,  Richard M. Durbin  +4 more
2010· Nature8.1Kdoi:10.1038/nature09534

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10−8 per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research. This issue of Nature contains the first publication from The 1000 Genomes Project, an international collaboration that will produce an extensive public catalogue of human genetic variation. The plan, in fact, is to sequence about 2,000 unidentified individuals from 20 populations around the world. This first paper presents the results from the project's pilot phase, testing three different strategies for genome-wide sequencing with high-throughput platforms: low-coverage whole-genome sequencing of 179 individuals in three population groups, high-coverage sequencing of two mother–father–child trios, and exon-targeted sequencing of 697 individuals from seven populations. The goal of the 1000 Genomes Project is to provide in-depth information on variation in human genome sequences. In the pilot phase reported here, different strategies for genome-wide sequencing, using high-throughput sequencing platforms, were developed and compared. The resulting data set includes more than 95% of the currently accessible variants found in any individual, and can be used to inform association and functional studies.

Intratumor Heterogeneity and Branched Evolution Revealed by Multiregion Sequencing
Marco Gerlinger, Andrew J. Rowan, Stuart Horswell, James Larkin +4 more
2012· New England Journal of Medicine7.8Kdoi:10.1056/nejmoa1113205

BACKGROUND: Intratumor heterogeneity may foster tumor evolution and adaptation and hinder personalized-medicine strategies that depend on results from single tumor-biopsy samples. METHODS: To examine intratumor heterogeneity, we performed exome sequencing, chromosome aberration analysis, and ploidy profiling on multiple spatially separated samples obtained from primary renal carcinomas and associated metastatic sites. We characterized the consequences of intratumor heterogeneity using immunohistochemical analysis, mutation functional analysis, and profiling of messenger RNA expression. RESULTS: Phylogenetic reconstruction revealed branched evolutionary tumor growth, with 63 to 69% of all somatic mutations not detectable across every tumor region. Intratumor heterogeneity was observed for a mutation within an autoinhibitory domain of the mammalian target of rapamycin (mTOR) kinase, correlating with S6 and 4EBP phosphorylation in vivo and constitutive activation of mTOR kinase activity in vitro. Mutational intratumor heterogeneity was seen for multiple tumor-suppressor genes converging on loss of function; SETD2, PTEN, and KDM5C underwent multiple distinct and spatially separated inactivating mutations within a single tumor, suggesting convergent phenotypic evolution. Gene-expression signatures of good and poor prognosis were detected in different regions of the same tumor. Allelic composition and ploidy profiling analysis revealed extensive intratumor heterogeneity, with 26 of 30 tumor samples from four tumors harboring divergent allelic-imbalance profiles and with ploidy heterogeneity in two of four tumors. CONCLUSIONS: Intratumor heterogeneity can lead to underestimation of the tumor genomics landscape portrayed from single tumor-biopsy samples and may present major challenges to personalized-medicine and biomarker development. Intratumor heterogeneity, associated with heterogeneous protein function, may foster tumor adaptation and therapeutic failure through Darwinian selection. (Funded by the Medical Research Council and others.).

Pfam: the protein families database
ROBERT FINN, Alex Bateman, Jody Clements, Penelope Coggill +4 more
2013· Nucleic Acids Research6.5Kdoi:10.1093/nar/gkt1223

Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.

The Pfam protein families database: towards a more sustainable future
ROBERT FINN, Penelope Coggill, Ruth Y. Eberhardt, Sean R. Eddy +4 more
2015· Nucleic Acids Research6.4Kdoi:10.1093/nar/gkv1344

In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool.

The Genome Sequence of <i>Drosophila melanogaster</i>
Mark D. Adams, S Celniker, Robert A. Holt, Cheryl Evans +4 more
2000· Science6.0Kdoi:10.1126/science.287.5461.2185

The fly Drosophila melanogaster is one of the most intensively studied organisms in biology and serves as a model system for the investigation of many developmental and cellular processes common to higher eukaryotes, including humans. We have determined the nucleotide sequence of nearly all of the approximately 120-megabase euchromatic portion of the Drosophila genome using a whole-genome shotgun sequencing strategy supported by extensive clone-based sequence and a high-quality bacterial artificial chromosome physical map. Efforts are under way to close the remaining gaps; however, the sequence is of sufficient accuracy and contiguity to be declared substantially complete and to support an initial analysis of genome structure and preliminary gene annotation and interpretation. The genome encodes approximately 13,600 genes, somewhat fewer than the smaller Caenorhabditis elegans genome, but with comparable functional diversity.

Roary: rapid large-scale prokaryote pan genome analysis
Andrew J. Page, Carla Cummins, Martin Hunt, Vanessa Wong +4 more
2015· Bioinformatics5.8Kdoi:10.1093/bioinformatics/btv421

UNLABELLED: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors. AVAILABILITY AND IMPLEMENTATION: Roary is implemented in Perl and is freely available under an open source GPLv3 license from http://sanger-pathogens.github.io/Roary CONTACT: roary@sanger.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.