Skip to the content.

Welcome to Evolink

Overview


Evolink is a phylogeny-based tool to detect genotypes (both positively and negatively associated ones) associated with a phenotype present in multi-species datasets (e.g. resistance, virulence, host and colony).

Identification of genotype-phenotype associations is a fundamental task not only in microbiology but also across all of biology. Yet as the amount of microbial data is rapidly increasing, the amount of identified gene families (~10^6) and size of phylogenetic trees (with > 10^5 leaves) make it difficult to apply current methods to link genotypes to phenotypes.

Utilizing phylogenetic information is a good way to control for population structure in microbial genotype-phenotype association analyses helping to avoid spurious findings. That’s why Evolink was developed based on the use of phylogeny.

Evolink is able to outperform other similar tools in terms of accuracy and efficiency, being able to handle large datasets in just minutes.

Installation


To install Evolink is easy and you have two choices.

Use Mamba

To expereince a much faster installation: Please install mamba first. If you use Linux, installing mamba is very easy:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh

The installation is very similar to using conda.

Use Conda

Alternatives

In addition to a local installation of Evolink, we provide both a docker image and webserver that can be used to run Evolink analyses.

Docker: https://hub.docker.com/r/nlmirpjianglab/evolink Webserver: https://jianglabnlm.com/evolink/

Input


Evolink takes 3 essential input files: 1) A species tree (in newick format). It is recommended that the tree is rooted. Internal node names are not necessary. For example:

(species_1:1,(species_2:1,(species_3:1,species_4:1)Internal_1:0.5)Internal_2:0.5)Root:0.1;

2) A trait/phenotype binary file (tab separated file). The column headers should be “Tip” and “Status”. The tip column contains the tip names the same as the tree, while the Status column contains the presence (1) and absence (0) of the phenotype for each leaf. So far only 1 or 0 is accepted and all leaves should be labeled with a 0/1 status. For example:

Tip Status
species_1 0
species_2 1
species_3 1
species_4 0

3) The gene presence/absence matrix file (tab separated file). Each row is the binary (0/1) status of each gene across all species. Each gene should appear in a species for at least one time. The first colname could be any word, but “orthoID” (orthogroup ID) is a nice choice to be shown here. For example:

orthoID species_1 species_2 species_3 species_4
gene_1 0 1 1 0
gene_2 1 1 0 0
gene_3 1 0 1 0
gene_4 0 1 0 1

Usage


usage: Evolink.py [-h] -g GENE_TABLE -t TRAIT_TABLE -n TREE [-m MODE] [-c]
                  [-p THRESHOLD] [-s SEED]
                  [--min-outlier-score-threshold THRESHOLD]
                  [--n-estimators NUMBER] [--max-samples PERCENTAGE]
                  [--gesd-mc-method METHOD] [--gesd-pval-threshold THRESHOLD]
                  [--gesd-padj-threshold THRESHOLD] [-z THRESHOLD]
                  [-e THRESHOLD] [--fold-times FOLD_TIMES]
                  [--perm-mc-method METHOD] [--perm-padj-threshold THRESHOLD]
                  [-v] [-N TOP_GENES] [-d {1,2}] [-f] -o OUTPUT

Evolink is designed to find gene families associated with a trait by explicitly
using phylogeny information.

optional arguments:
  -h, --help            show this help message and exit
  -g GENE_TABLE, --genotype GENE_TABLE
                        Tab-delimited gene presence/absence or copy number
                        table. Columns are gene families, while rows are tip
                        names/species/genomes in the phylogenetic tree. If
                        copy number table is provided, please use -c option so
                        that it will be internally converted to binary table.
                        Presence=1, Absence=0.
  -t TRAIT_TABLE, --phenotype TRAIT_TABLE
                        Two-column (so far only one trait is allowed each
                        time) tab-delimited trait presence/absence table. The
                        first column is tip names and the second column is the
                        presence/absence of this trait on the
                        tips/species/genomes. Presence=1, Absence=0.
  -n TREE, --phylogeny TREE
                        A phylogentic tree in newick format. The tip names
                        should be the same in the gene table and trait table.
  -m MODE, --mode MODE  Evolink has four modes insofar to detect phenotype-
                        assoicated genotypes: isolation_forest, gesd_test,
                        (modified) z_score, and cutoff. Running time:
                        isolation_forest > gesd_test > z_score > cutoff.
                        [Choices: isolation_forest, gesd_test, z_score,
                        cutoff; Default: isolation_forest]
  -c, --copy-number     The given gene table stores numbers (e.g. gene copy
                        numbers) instead of presence/absence binary values.
                        [Default: True]
  -p THRESHOLD, --p-threshold THRESHOLD
                        Absolute Prevalence index threshold to filter genes
                        and get Evolink index distribution [Range: 0-1;
                        Default: 0.9]
  -s SEED, --seed SEED  Set seed for simulation for reproducibility of the
                        results [Default: 1]
  --min-outlier-score-threshold THRESHOLD
                        A minimal threshold to determine outliers by
                        IsolationForest [Default: 0.7; Range: 0.5-1]
  --n-estimators NUMBER
                        Number of tree estimators used in IsolationForest
                        [Default: 200]
  --max-samples PERCENTAGE
                        Percentage of training samples for each tree in
                        IsolationForest [Default: 0.1]
  --gesd-mc-method METHOD
                        Multitest correction method [Choices: none,
                        bonferroni, fdr, holm, hommel; Default: none]
  --gesd-pval-threshold THRESHOLD
                        Original p-value threshold [Default:0.1]
  --gesd-padj-threshold THRESHOLD
                        Adjusted p-value threshold [Default:0.2]
  -z THRESHOLD, --z-score-threshold THRESHOLD
                        Absolute modified z-score threshold [Default:3.5]
  -e THRESHOLD, --e-threshold THRESHOLD
                        Absolute Evolink index threshold to select significant
                        genes. [Range: 0-1; Default: 0.375]
  --fold-times FOLD_TIMES
                        Simulate N*the bumber of genotype input provided by
                        users after filtering (namely simulate N*nrow(gene
                        matrix after filtering) times) [Default: 10]
  --perm-mc-method METHOD
                        Multitest correction method [Choices: bonferroni,
                        fdr_bh, holm, hommel; Default: fdr_bh]
  --perm-padj-threshold THRESHOLD
                        Adjusted p-value threshold [Default:0.001]
  -v, --visualization   Whether to generate plots
  -N TOP_GENES, --top-genes TOP_GENES
                        Top positively and negatively associated genes mapped
                        to tree. [Default: 5,5 for top 5 pos genes and top 5
                        neg genes.]
  -d {1,2}, --display-mode {1,2}
                        Tree display mode. [1: circular, 2: rectangular;
                        Default: 1]
  -f, --force           Force to overwrite output folder. [Default: False]
  -o OUTPUT, --output OUTPUT
                        output directory

Examples


  1. Unzip test data first
    unzip test_data/test_data.zip
    
  2. With a binary gene presence/absence table and no plots by default (the most common usage):
    python Evolink.py -g test_data/gene.tsv -t test_data/trait.tsv -n test_data/tree.nwk -o output_dir
    
  3. With a gene copy number table (add “-c” option):
    python Evolink.py -g test_data/gene_CN.tsv -c -t test_data/trait.tsv -n test_data/tree.nwk -o output_dir_CN
    
  4. Enable the plot function (add “-v” option. To save time, Evolink will not generate figures by default):
    python Evolink.py -g test_data/gene.tsv -t test_data/trait.tsv -n test_data/tree.nwk -o output_plot_dir -v
    
  5. “-N” is to map top nine positively and top eight negatively associated genotypes in the plot; “-d” is to use circular layout for the tree; “-f” is to force overwrite the output directory if it already exists:
    python Evolink.py -g test/gene.tsv -c -t test/trait.tsv -n test/tree.nwk -o output_dir -f -v -N 9,8 -d 1
    

Output


The basic output file from Evolink is named “result.tsv” and is generated in the output directory provided by the user. It includes “Evolink_index”, “Prevelance_index”, “significance” and “z_score”. The “Evolink_index” and “significance” are the most useful values for identifying positively and negatively associated genes. For example:

orthoID Prevalence_index Evolink_index pvalue pvalue.adj significance
g5777 -0.4625054136846366 0.10671872772220814 1.0 1.0 NA
g5778 -0.042440298419339606 0.5233140665345326 0.0030094935692830003 0.0030094935692830003 sig
g6658 -0.9507252999311486 -0.011385526945414837 NA NA NA
g6659 -0.07100522867170539 0.5214738941543501 0.003255689458338912 0.003255689458338912 sig
g6660 -0.9527124863232308 -0.009398340553332718 NA NA NA
g8004 0.19435017310947744 0.619534971390268 3.0051525400727557e-05 3.0051525400727557e-05 sig

When enabling the plot function (with -v or –visualization option), Evolink generates four types of figures in the output directory (see example outputs here and here):

1) iTOL website input

A tree file (input.tree) and annotation file (binary.txt) as well as a zipped file called Evolink_itol_input.zip are provided for users to visualize their results on the Tree of Life (iTOL).
a) You can simply upload the input.tree to iTOL website and drag binary.txt onto the webpage for visualization.
b) Or if you have a iTOL subscription API key and have installed iTOL API, you can use the following command line to upload and annotate your tree.
itol Evolink_itol_input.zip -i <your iTOL upload API key> -p <project_name>

2) ggtree plot for positively and negatively associated genotypes

Positive_genotypes_ggtree_plot.pdf ggtree_pos Negative_genotypes_ggtree_plot.pdf ggtree_neg

Evolink_plot.pdf Evolink_plot

4) Manhattan plot

Manhattan_plot.pdf Manhattan_plot

In addition, we also provided a script “Evolink_plot.R” to individually generate local figures after you get the result from Evolink (i.e. result.tsv):
Rscript --vanilla Evolink_plot.R -g test_data/gene.tsv -t test_data/trait.tsv -n test_data/tree.nwk -r test_data/output_dir/result.tsv -o test_data/plot_dir

Use case


We further provide a use case with Flagella data as an example. Click here for more details.

FAQ