Superphysics Superphysics

Contingency, predictability in the evolution of a prokaryotic pangenome

6 minutes  • 1151 words
Table of contents

Significance

Different strains of the same prokaryotic species often show significant variation in gene content.

We do not know whether this variation is due to:

  • genetic drift
  • selection
    • This expects sets of genes to be consistently and repeatedly gained or lost together, or sequentially.

We used machine learning to predict variable genes in a large set of Escherichia coli strains, using other variable genes as predictors.

Most genes are predictable. This suggests selection plays a role in their acquisition, loss, and maintenance.

Some genes are consistently associated with the presence or absence of others.

These results have implications for understanding evolutionary dynamics in prokaryotic genomes.

Abstract

Prokaryotic species:

  • are maintained through horizontal gene transfer and gene loss.
  • have remarkable variable pangenomes

Repeated acquisitions of near-identical homologs can easily be observed across pangenomes.

Do these parallel events have similar evolutionary trajectories, or do they end up quite differently because of the different genetic backgrounds of the postacquisition recipients?

In this study, we present a machine learning method that predicts the presence or absence of genes in the Escherichia coli pangenome based on complex patterns of the presence or absence of other accessory genes within a genome.

Our analysis leverages the repeated transfer of genes through the E. coli pangenome to observe patterns of repeated evolution following similar events.

The presence or absence of genes is highly predictable from other genes alone. This shows that selection:

  • deterministically maintains gene–gene co-occurrence and avoidance relationships over long-term bacterial evolution
  • is robust to differences in host evolutionary history.

The pangenome is a set of genes with relationships* that govern their likely cohabitants, analogous to an ecosystem’s set of interacting organisms.

Superphysics Note
We call this part of Cartesian Relationality. The ecosystem is the system in Poincare’s Law of Relativity

Intragenomic gene fitness effects may be the key drivers of prokaryotic evolution.

  • It influences the repeated emergence of complex gene–gene relationships across the pangenome.

Horizontal gene transfer

Evolution by horizontal gene transfer (HGT) and differential loss causes remarkable variation in gene content in bacterial genomes, both within and between populations (1–5).

Core genome - the genes that are present in all genomes in a collection

Accessory genes - the genes that are found only in some lineages

Pangenome - the union of core and accessory genes

Most gene transfers into a genome are done by horizontal gene transfer, mediated by plasmids, phage, and transformation.

The presence or absence of specific genes (genetic background) can influence the presence or absence of others (6–8).

Consequently, the content of prokaryotic genome:

  • is an outcome of its history of vertical and horizontal gene transmission
  • has emerged via a combination of internal (intragenomic) and external (ecological) fitness effects (9) in addition to stochastic, nonadaptive evolution (genetic drift).

It is also unclear how evolutionary responses to the acquisition of a gene by HGT, are sensitive, or robust, to differences in evolutionary history.

Evolutionary paths depend on unpredictable events.

Stephen J. Gould suggested that if we could replay history, it would not result in the same outcome each time.

This view is too rigid.

Parallel evolution experiments mimic replaying of history. These suggested that:

  • historical contingency does have an effect
  • some aspects of evolution are deterministic

Evolution is likely to happen each time we replay the tape (11–15).

In prokaryote pangenome evolution, repeated HGT can introduce homologs of the same gene family into divergent genomes that contain unique but overlapping sets of genes.

The incorporation of these genes into different genetic backgrounds allows us to address the contingency-determinism question through retrospective analysis of the subsequent outcomes.

A deterministic outcome is when all, or most, recipient lineages evolve in similar ways after gene acquisition.

A non-deterministic outcome is that:

  • prior events such as divergence in gene content of the recipient genomes, plays the more important role.
  • postacquisition evolution of the different lineages would therefore be different.

A deterministic evolutionary trajectory is:

  • the acquisition of a gene that in turn potentiates the acquisition, avoidance, retention, or loss of one or more other genes.
  • evolutionary outcomes become highly likely due to the influence of intragenomic selection on genotypes.

Repeated acquisition and loss of a gene is insufficient to imply deterministic evolution.

Hallmarks of determinism would include:

  • the emergence of repeated biases in gene content, including the selective recruitment of another gene, or
  • selective loss of another gene, following horizontal transfer.

Evolution is stochastic. It is unlikely that gene content evolution is entirely deterministic or entirely driven by contingency. Instead, it falls somewhere on the spectrum between both extremes.

The question is which end of the spectrum is closest.

Several thousand complete prokaryotic genomes are available, providing enough data to address the issue.

Therefore, we can ask whether a gene’s presence or absence in a genome is predictable, based solely on the gene content of the rest of a genome.

This would imply deterministic evolution. Alternatively, if gene presence or absence is not predictable, it is because its presence is either contingent on unaccounted differences in evolutionary history or is solely driven by genetic drift.

To incorporate these more complex and subtle patterns, we used a Random Forest approach (23).

Random Forests aggregate information from individual decision trees, which themselves summarise the conjunction of features not just pairwise comparisons, that lead to predictions of gene presence or absence.

A Random Forest approach can assess whether inferences are generalisable.

A substantial proportion of Escherichia coli accessory genes can be predicted by the other genes present.

E. coli has a large accessory genome (25, 26) and occupies a wide range of niches (27).

The E. coli pangenome has evolved divergent gene content over time. A gene that is horizontally transferred from one E. coli to another will often find itself in a considerably different ensemble genetic background.

We have analysed the predictability of gene content evolution following the repeated transfer of genes into these diverse genetic backgrounds. This is a natural equivalent of what Blount, Lenski, and Losos called a “historical difference experiment” (11).

We have typified the effects of accessory genes’ presence on the presence or absence of other genes into three categories typically used by macroecologists to describe interactions between species. McInerney defined mutualism as a situation where two or more genes benefit from the association (9).

Here, we define putative mutualism as two genes predicting the presence of one another and each gene similarly influencing the likelihood of the other’s occurrence. This could be due to a genuine beneficial relationship between the two genes.

However, they might also both benefit from a common factor, which doesn’t necessarily have to be another gene.

Commensalism refers to the situation where one gene strongly depends on the presence of another, but the reverse dependence is much weaker or nonexistent. Competition is when two genes appear to avoid being in the same genome.

Note that we are not attributing specific behaviours to genes; these categories merely serve to describe observed patterns.

Any Comments? Post them below!