Key Features:

  • Enterotype Classification: XGboost and Lasso regression models, trained with 10x repeated 10-fold cross-validation procedure and tested on an independent validation dataset, for accurate classification of microbiome samples into enterotypes.
  • Different Enterotype Models: By default, Enterotyper uses the 3-enterotype model derived from FKM (fuzzy k-means) clustering but also offers a 2-enterotype model as well as 2-, 3-, and 4-enterotypes models derived from PAM (partition around medoid) clustering.
  • Enterotype Dysbiosis Score (EDS): Quantifies the dysbiosis state of the microbiome samples within the enterotype landscape using a novel scoring system to facilitate health-related microbial studies. The EDS is calculated only when using the default 3-enterotype FKM model by inverting, z-score normalizing using the cemter and standard deviation from the global enterotype training data as a reference, and scaling the maximum strength for a sample to obtain values between 0 and 1. A higher EDS indicates a more dysbiotic microbial community for the sample.
  • Comprehensive Dataset: Built on a global large-scale metagenomic dataset of 16,772 fecal metagenomic samples from 129 studies. Take a closer look at the global training dataset on this world map
  • Compatible with various taxonomic classifiers: Enterotyper accepts taxonomic profiles from a variety of different taxonomic classifiers. See below [link to Usage/Input part] for more information.
  • Data Visualization: Evaluate enterotype classifications and EDS with built-in visualization output for a quick first assessment.

How it works

The Enterotype assignments provided by Enterotyper are independent of de novo clustering. This allows for robust identification across studies and also in data sets that are too small for de novo clustering.

We have constructed, XGboost regression models to predict the strength of each enterotype determined by FKM clustering. The prediction models were trained based on the GTDB genus-level taxonomic profiles of the dataset using the train function of the caret package in R with a 10-times repeated 10-fold cross-validation procedure. Before the model construction, minor genera with an average relative abundance of <1E-4 were excluded and only abundant genera above the threshold were used for the training. To avoid model overfitting due to multiple samples derived from the same individuals, we ensured these samples were incorporated exclusively in either the training or the evaluation data during each cross-validation fold. The models' accuracies were evaluated by applying the prediction models to the unused validation dataset including 347 samples from three studies. Separate models were constructed for each enterotype (i.e., two models for the two-enterotype clustering and three models for three-enterotype clustering) and the highest strength obtained from these models were used for enterotype classification and EDS of each sample.

Additionally, binary LASSO classification models were also constructed to predict enterotypes determined by PAM clustering. The models were trained and constructed in the same procedure described above. The highest score obtained from the LASSO classification models was used for enterotype assignment.


Enterotype concept

In an attempt to simplify the complex structure of the fecal microbiome, taxonomic profiles have been grouped into distinct, reproducible microbial community clusters (often at the genus level) called ‘enterotypes,’ which are dominated by and typically named after specific taxa. Enterotypes were introduced in 2011 based on 33 samples by Arumugam et al. using partitioning around medoid (PAM) clustering and have been repeatedly confirmed in continuously growing datasets (Costea et al. 2017). The latest work used 16,772 metagenomes for PAM clustering and introduced the use of fuzzy k-means (FKM) clustering-based enterotyping, which accounts for the inherent continuous nature of the microbiome by allowing overlapping clusters. Fuzzy clustering reports a classification strength for each sample for each enterotype, reflecting the consistency of enterotype classification in multiple cluster iterations. The Enterotype Dysbiosis Score (EDS) is calculated based on the classification strength with lower strength reporting a higher dysbiosis.

In addition, we can recommend the following papers to get familiar with the Enterotype concept:

There is other work that explores alternative methods to identify subclusters in fecal microbiome composition. These align well with the enterotypes concept: