ML algorithms are primarily categorised into supervised and unsupervised learning based on the presence or absence of labels within the given data. Supervised learning uses labelled data, while unsupervised learning uses unlabelled data to discover patterns or clusters. This study focuses on different unsupervised learning methods in the disease prediction domain. They are partitioning clustering, model-based clustering, hierarchical clustering and density-based clustering.
2.1 Unsupervised machine learning algorithms
Unsupervised ML, also known as clustering, involves grouping data into clusters based on the similarity of their objectives within the same cluster while ensuring that they are dissimilar to objects in other clusters [8]. Clustering is a type of unsupervised classification since there are no predefined classes.
Figure 1 shows how unsupervised ML techniques classify three groups in a two-dimensional dataset. The dataset consists of 100 randomly generated data points divided into three groups based on their similarity. Different colours represent the clusters. On the scatter plot, the clusters are represented by different circles, and circular bounds have been placed around each cluster to visualise their boundaries better.
2.1.1 Partitioning clustering
Partitioning clustering requires the analyst to specify the number of clusters that should be generated. The k-means clustering is the most widely used method of partitioning clustering algorithms [18]. Figure 2 demonstrates the processes for the standard k-means clustering algorithm. The first step involves selecting \(k\) points as the initial centroids. After that, we need to classifying data points based on the distance to the centroids of these \(k\) clusters. Then, recomputing the centroid of each cluster based on classified points and repeating these steps until the centroids do not change. This study uses two popular k-means variants: classic k-means and Mini batch k-means [19].
2.1.2 Model-based clustering
Model-based clustering is another unsupervised ML method. It is a probabilistic approach to clustering that uses Gaussian Mixture Models (GMMs) to represent data as a mixture of Gaussian distributions [20]. GMM is a probabilistic model that attempts to fit a dataset to a combination of different Gaussian distributions. It evaluates the likelihood of each data point belonging to each cluster, as opposed to classic k-means clustering, which allocates each data point to a single cluster. This enables a more flexible and accurate representation of data distributions, mainly when dealing with overlapping or non-spherical clusters [20]. Figure 3 shows how a GMM model with four components is fitted to the data, and the resulting clusters are coloured. The GMM’s Gaussian distributions are shown by ellipses, demonstrating each distribution’s spread and direction and the probabilistic character of the clustering process. We also use Bayesian Gaussian Mixture (BGM) [21] for performance comparison.
2.1.3 Hierarchical clustering
Hierarchical clustering generates a group of nested clusters arranged in a hierarchical tree structure. This can be represented through a dendrogram, a tree-like diagram that documents the sequence of merges or splits [22]. Figure 4 shows an example of a dendrogram for the hierarchical clustering. There are two main types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering is a method that starts with each point as its cluster. As the process progresses, the nearest pair of clusters is merged in each step. This merging continues until it culminates in a single cluster or a specific number of clusters, depending on the parameters set at the outset of the process [22]. On the other hand, Divisive clustering is a method that begins with a single, all-encompassing cluster. As the process evolves, a cluster is split at each step. This splitting continues until every cluster contains only an individual point or a predetermined number of clusters are achieved, depending on the initial setup of the procedure [22]. This study uses both Agglomerative clustering and Divisive clustering for comparison.
2.1.4 Density-based
The density-based method relies on density as the local cluster criterion, such as points connected by density. Characteristics and features of density-based clustering include identifying clusters of any shape. It also effectively handles noise within the data. It requires only a single scan, examining the local region to validate the density. However, it necessitates the specification of density parameters as a condition for termination [22]. Density-based spatial clustering of applications with noise (DBSCAN) is a famous example of a density-based method [23]. This method labels high-density areas as clusters and low-density areas as outliers. It helps discover clusters of varied forms and deal with noise without requiring a set number of clusters [23]. Figure 5 visualises the clusters using the DBSCAN method.
2.2 Performance comparison measures
The performance of various unsupervised ML methods is assessed using different evaluation techniques, such as Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), Homogeneity, Completeness, V-measure, and Silhouette Coefficient. These are applied to establish comparative performance metrics in this study.
2.2.1 Adjusted Rand Index
Adjusted Rand Index (ARI) is a modification to the Rand index. It calculates a similarity metric between two clusters by considering all sample pairs and then counting those pairs that are either similarly or differently assigned in the predicted and actual clusters [24]. The formular for ARI is
$$\mathrm{ARI} = \frac{\mathrm{RI} – \mathrm{Expecte\, RI}}{\mathrm{Max(RI)} – \mathrm{Expected\, RI}}$$
where \(RI\) is the Rand Index, \(Expected \,RI\) is the Expected Rand Index, and \(Max\,(RI)\) is the maximum possible Rand Index.
The ARI value lies between -1 and 1, where 1 means identical clustering and –1 means dissimilar clustering. If the ARI is equal to 0, it indicates random labelling.
2.2.2 Adjusted Mutual Information
Adjusted mutual information (AMI) modifies the Mutual information (MI) score to account for chance [25]. It acknowledges that MI tends to increase with larger clusters, independent of the actual amount of shared information between them. The formula for AMI is
$${\mathrm {AMI}(\mathrm {X,Y})}=\frac{\mathrm{MI}-\mathrm{Expected\, MI}}{\mathrm{Max}(\mathrm H(\mathrm X),\mathrm H(\mathrm Y))-\mathrm{Expected\, MI}}$$
where \(MI\) is the mutual information, \(H(X)\) and \(H(Y)\) are the entropies of X and Y, and \(Expected \, MI\) is the expected mutual information. AMI ranges from 0 to 1, where a score of 1 indicates perfect agreement between two clustering. A score close to 0 suggests largely independent clustering or a result no better than random chance.
2.2.3 Homogeneity
Homogeneity is a clustering measure that compares the outcomes to a ground truth. It denotes that a cluster is homogenous if made up entirely of data points from a single class [26]. The formula for Homogeneity is
$$\mathrm{Homogeneity} = 1-\frac{\mathrm{H}({\mathrm{Y}}_{\mathrm{true}}|{Y}_{\mathrm{predict}})}{\mathrm{H}({Y}_{true})}$$
where \({Y}_{true}\) is the ground truth, \({Y}_{predict}\) is the predicted clusters. \(H({Y}_{true}|{Y}_{predict})\) is the conditional entropy of the ground truth given the cluster predictions. \(H({Y}_{true})\) is the entropy of the ground truth. Homogeneity is a metric that also varies between 0 and 1. A score of 1 means that each cluster contains only members of a single class, signifying perfect Homogeneity. A score of 0 indicates that the clusters are randomly assigned, lacking any homogeneity.
2.2.4 Completeness
Completeness is another clustering evaluation metric determining whether all data points in a given class are clustered. The clustering result is deemed complete when each class is contained inside a single cluster [26]. The formula for this measure is
$$\mathrm{Completeness} = 1-\frac{H({Y}_{predict}|{Y}_{true})}{H({Y}_{predict})}$$
where \({Y}_{true}\) is the ground truth, \({Y}_{predict}\) is the predicted clusters. \(H({Y}_{predict}|{Y}_{true})\) is the conditional entropy of the cluster predictions given the ground truth. \(H({Y}_{predict})\) is the entropy of the cluster predictions. Completeness ranges from 0 to 1. A score of 1 is achieved when all class members are assigned to the same cluster, indicating complete capture of all classes within the clusters. A score of 0 would imply that the clustering assignments are completely scattered without capturing the essence of classes.
2.2.5 V-measure
The V-measure is the harmonic mean between Homogeneity and Completeness [26], and the formula is
$$V-measure = \frac{2 \times (Homogeneity \times Completeness)}{Homogeneity +Completeness}$$
The V-measure score lies between 0 and 1, where 1 stands for perfectly complete and homogeneous labelling. V-measure ranged from 0 to 1. A score of 1 represents perfect clustering with both complete capture of all classes within clusters and each cluster containing only members of a single class. A score of 0 would indicate that the clustering fails on both homogeneity and completeness grounds.
2.2.6 Silhouette coefficient
The silhouette coefficient is used in cluster analysis to assess clustering quality. It computes the distance between each data point in one cluster and the points in neighbouring clusters, measuring how well each data point fits into its allocated cluster [27]. The formula is
$$\mathrm{Silhouette}= \frac{{b}_{i} – {a}_{i}}{max({a}_{i}, {b}_{i})}$$
where \({a}_{i}\) is the average distance inside the cluster, and \({b}_{i}\) is the average distance nearest other clusters. Silhouette Coefficient values range from -1 to 1. A score of 1 denotes that the clusters are well apart from each other and clearly distinguished. A score of 0 indicates overlapping clusters. A negative value suggests that data points might have been assigned to the wrong clusters. This metric gives a perspective on the distance and separation between the formed clusters.