From grouping users of a streaming by their choice of movies and TV shows to the search for protein families in bioinformatics, clustering algorithms are a form of unsupervised learning that is much more present in our lives than we think.
Imagine a set of scattered points in a space. We call clustering the process of forming groups with the characteristic that those points which are in the same cluster are close together, and at the same time distant to those in a different cluster.
So far doesn’t seem to be a complex concept. However, this changes when taking into account that the “points” mentioned above may be representing concepts such as: supermarket customers, goals scored in World Cup matches from 1930 to date, writers of the Lost Generation, and virtually anything we need grouped according to some given similiarity criteria.
Moreover, the space mentioned in the second paragraph needn’t be a plane and could have as many dimensions as we like to imagine (it could even be the kind of space referred to as non-Euclidean in geometry), and the definition of distance used isn’t necessarily the one associated with putting a rule between two points.
Having already introduced what clustering is, to the rest of this article will explain in further depth some basic concepts related to this technique. After that, we’ll will introduce the two main strategies used in these types of algorithm.
Distance between clusters
The traditional concept of distance that we all know is just a particular case of a more general concept. To make a function a “distance function”, it must accomplish a series of logical conditions which are out of the scope of this article.
However, it’s important that we understand that depending the entity that represent the points in our space and its nature, we can choose one of many distance functions that are relevant to the problem. We can even define a function of our own, but it must meet the conditions mentioned in the previous paragraph.
For example, let’s suppose that the points of our space represent text strings. One distance function that we could use is the “edition distance”, which is the least number of insertions and removal of individual characters that is required to transform one string in another. Using this definition of distance, the distance between the word “bank” and “bacon” would be three, because it requires removing the letter “k” and adding the letters “c” and “o”.
When measuring the distance between two clusters, it becomes necessary to define a reference point in each cluster. The most common choice euclidean spaces is to choose a point that is the geometrical center of those that integrate the cluster, which we call “centroid”.
Nevertheless, when the space is non-Eucliudean (like in the text strings example), it isn’t possible to calculate the “centroid”. Because of this, the reference point is chosen between the points that belong to the cluster and is referred to as “clustroid”. This choice is made following a distance minimization criterion, for example one valid choice would be the point that has the minimum sum of distances to all other members of the cluster.
The algorithms that follow this technique start by making each point into a single-point cluster. Then by an iterative process in each cycle a selection process selects clusters that will be merged.
This selection can be made following different criteria, being the most common to choose two clusters which distance (explained before) is minimal. The repetition will continue until to the looping is stopped for some reason. Two common ones are:
- Reaching a predetermined number of total clusters. An example use case for this could be a car Company that plans to launch five cars of different models, and has an old database with client data. Its intention will be to put together five groups to study them and learn which common characteristics gather each member in each group. It’s possible that this information might be very useful to make cars so they adapt better to their potential owner’s needs and to design marketing campaigns focused on the audience they want, among other things.
- Another reason we could stop the cycle is when we detect a resulting cluster that will be formed when joining other two will be unsuitable for the purpose it’s needed. Usually this is determined by measuring the compactness of the new cluster and checking if it exceeds a pre-established maximum or an average metric that might have been met by the other clusters already formed. For instance, when defining the diameter of a cluster as the maximum distance between any two points, the repetition may be stopped when the diameter of a resulting cluster would be greater than a pre-established top diameter.
Even though the hierarchical clustering algorithms produce high quality clusters, they have a high algorithmic complexity that makes them inadequate for a big data set.
That is why in this cases is more common to use non–hierarchical algorithms. In the following section we will talk about the most popular of that category.
Let’s suppose we already know the number of clusters that we wish to end up with, that we will call “K”. The first thing to do is to find K points that probably they belong to different clusters and build a cluster for each one of them. This can be easily achieved by selecting the first point at random and then choosing K-1 points which are as far as possible from all previously selected points..
Then the algorithm makes a repetition for each one of the points left and searches for the minimum distance between an unassigned point and all the centroids, adding the aforementioned point to the closest centroid and re-calculating the “centroid” of the clusterto which the point was added.
The main disadvantages that this family of algorithms present are that they are only applicable to “euclidean” spaces and to use them we must start with determining a target amount of clusters.
The previous’ issue impact can be mitigated by a simple procedure that consists of taking multiple runs of the algorithm with increasing values of K. For each one of this solutions a quality measure is performed based on the compactness on the clusters that were formed, as it was explained before in this article.
The expected behavior is that the clusters’ quality will increase between one run and the next one, until a certain point in which it doesn’t change anymore. We can assume then that the first value K that achieved this quality asymptote measure is the correct value.
This article introduced the clustering concept and explained the two most popular strategies at the time of facing this kind of process. There are also many ways to use this technique to face a bigger universe of challenges.
For readers interested in a deeper understanding of this topic, we will mention two extra examples: the CURE algorithm (Clustering Using Representatives), efficient for big data groups; and the “Fuzzy Clustering” algorithms, which are capable of assigning to each point a probability of belonging to each cluster.
In the next article we are going to go deeper in one of the applications of this technique. For this, we are going to detail the procedure that we must follow to represent some entities through the abstract concept of points in space. Once this is done, we will be able to make use of the algorithms explained in this article.