the lab
home
people
publications
research
positions
past events
contact us
links
tools and databases
Bacteriome.org
Cell++
Elastodb
Subseqer
PhyloPro
Gist + genepuddle
old wiki
lablog
site
log in
Gist Generative Inference of Sequence Taxonomy
Samantha Halliday
John Parkinson

The study of entire microbial communities using mRNA shotgun sequencing (meta-transcriptomics) offers a unique view of gene activity across a large number of strains simultaneously. For a metatranscriptomic analysis to be thorough, it is necessary to assign both taxonomic and functional identities to each read or transcript. In many studies, this information is computed using a single alignment or alignment pipeline, such as MG-RAST, but these approaches face a number of challenges prompted by the incredible diversity of bacteria. High-quality taxonomic assignments make it possible to build compartmentalized gene network reconstructions (that properly isolate the cytosol of each cell), enable spore detection (e.g. of vancomycin-resistant Clostridians, which may instigate disorders such as autism and obesity), provide cleanly-isolated bacterial exomes for easy assembly, and open the door to other sophisticated analyses dependent on understanding microbial gene activity in situ.

Gist (Generative Inference of Sequence Taxonomy) began as a project to develop a sequence classifier based on arbitrary input classes, and has developed into a high-precision, noise-tolerant taxonomic classifier focused specifically on the problem of annotating short (76 nt and higher), unassembled reads with little or no quality filtering. Gist implements many of the techniques which have been published in recent years as promising methods for classifying metatranscriptomic data (alignment via BWA, composition analysis using Naive Bayes and Nearest Neighbour) as well as a number of new or unusual techniques (on-the-fly gene translation using FragGeneScan, support for priors specified as expected abundances, composition analysis using Gaussian mixture models and Expected Codelta Correlation, and a neural network based approach for balancing method weights) that allow Gist to compete with or exceed the performance of all existing methods in the datasets tested thus far.

Getting Gist


The current version of Gist is 0.7.17. Its source code can be downloaded from GitHub. This version is functionally complete, but not yet fully documented. Due to minor implementation restrictions, this version only runs under Linux. See README.md for bare-bones installation instructions, usage tips, and licence information.

Classes and Datasets


A comprehensive database for Gist is still in development. In the meantime, it will be necessary for the user to construct smaller databases based on their expectations about the environment. The very-soon-to-be-released Genepuddle pipeline will expedite this; until then, users are encouraged to get acquainted with Flux Simulator.

Citing Gist


A manuscript is currently in preparation for submission. An early preprint can be found at bioRxiv.

Credits


Gist was written and is maintained by Samantha Halliday (rhetorica@cs.toronto.edu) with advice from her supervisor, John Parkinson (john.parkinson@utoronto.ca). Please feel free to send Samantha mail if you have any questions or requests.

return to Microbiome Software