top of page

Analysis of GC content:

To initially screen for candidate genes which may have resulted from horizontal gene transfer, we analyzed GC content and searched for genes which significantly deviated from the organismal mean. We acquired gene sequences from 384 Gastrointestinal bacterial microbes from the Human Microbiome Project, and the GC content for each gene was calculated. Genes of length less than 300 bases long were excluded to increase the quality of data. The mean GC content for a given organism was defined to be the arithmetic mean of all genes in an organism. To determine which genes significantly deviated from the organism mean, which would imply the gene originated in another organism, we used an E-value cutoff. The E-value was defined to be the two sided p-value of a given gene (the probability of finding a gene with GC content at least as extreme as it has) times the total number of examined genes in all organisms. P-values for the GC content of the gene was calculated by assuming normally distributed GC content in a given genome, and using the standard deviation of the GC content of all the genes to parameterize it. Genes with an E-value, which represents the expectation value of the number of false positives, of less than .05 were taken to be candidate genes which may have originated as a result of horizontal gene transfer.

 

Phylogenetic analysis using DarkHorse:

The top 49 candidate genes flagged by GC analysis were subjected to cross referencing with DarkHorse, a sequence and phylogenetic analysis tool. DarkHorse moves beyond parametric methods by creating lineage trees that are used to calculate ‘lineage probability index’ (LPI) scores. These scores are inversely proportional to phylogenetic distance, allowing us to see genes conserved between distantly related organisms. The DarkHorse database set was used for the analysis, with the search parameters set to species level phylogenetic granularity and LPI values below 0.6. The flagged proteins were run through an advanced search feature that compares sequences with proteins in the DarkHorse database. Matches with an E value less than 1E-5 were then fetched from the database with pre-created DarkHorse analysis. This cross analysis is summarized by a table in the Results section.

bottom of page