Drowning in Data and Starving for Insights - Episode #2: What is Topological Data Analysis?
Topology – The Stratosphere of Human Thought
In the previous episode, we raised the need in networking to use methodologies that enable data mining services in complex environments and allow getting insights directly from data. To this aim, we will introduce in this post the concepts behind Topological Data Analysis.
First, let’s introduce topology. For a network engineer, a topology is a physical configuration of a network that defines how devices are connected. But topology is also an important area of mathematics concerned with spatial properties preserved under continuous deformations of objects. This property known as homeomorphism is at the origin of the layman definition of topology as the rubber-sheet geometry, including stretching, twisting, crumpling, bending, but no tearing nor gluing. For example, in the world of topology, a doughnut is like a coffee cup since both are 3D objects with a hole going through them.
Topology emerged along with the investigations of concepts from geometry and set theory, only in the late part of the 19th century due to eminent mathematicians such as Riemann, Betti, Poincare, and others. Until recently topology was pure mathematics, and nobody was seriously thinking of any possible application in the real world. Since then, things are dramatically changed. The Field medalist Michael Freedman is working on the next generation of quantum computers based on topological quantum computing theory. In 2016, M. Kosterlitz, D. Haldane, and D. Thouless were awarded the Nobel Prize in physics for theoretical discoveries of topological phase transitions and topological phases of matter. But for industrial applications, the most credit shall be deserved to Prof. Gunnar Carlsson that has contributed to topological research that made the eventual commercial solutions possible. Topological methods have shown efficient usage in several diverse fields such as healthcare, computational biology, control theory, industrial fault analysis, information security, and many others. So, why are some network engineers interested in topology? Because it provides them with a collection of powerful tools that can quantify the shape and the structure in data. Topology applications include large-scale exploratory data analysis and data mining, inference and prediction on massive datasets, predictive long-term planning, and early warning signals detection, and ultimately topological inferences to go beyond data.
Topological Data Analysis
Among all the flavors of topology, a technique from algebraic topology, called Topological Data Analysis (TDA), has been of particular interest to our team, as it can be successfully applied to support large scale data mining services for telemetry data generated by sensors networks. Larry Wasserman describes TDA as “a collection of data analysis methods that find structure in data. This includes clustering, manifold estimation, nonlinear dimension reduction, mode estimation, ridge estimation, and persistent homology”. TDA is then already equipped with the usual methods a data scientist knows. What is new here is how large datasets are analyzed with minimal assumptions, unlike traditional methods. There are two principal methodologies in TDA that we will describe a bit further. Persistent Homology is concerned with the classification and analysis of topological invariants associated with datasets, and Mapper a powerful algorithm for creating great visualizations of high-dimensional data into graphs.
Homology is a tool from algebraic topology designed to measure geometrical space properties that do not change after deformation. Persistent homology is an adaptation of these ideas to discrete collections of points, called point clouds.
For example, given a set of noisy points sampled from unknown space, persistent homology recovers the structure of the point clouds and builds a topological representation of the data counting how many holes can be found in the set of points. It may look silly to count down the number of holes discovered at a different scale in the point cloud, but the single persistent hole is what makes a doughnut and a coffee cup be the same topological object and why they have the same topological shape. Persistence homology is a complement to standard feature representation techniques in ML that offers the advantage to be applicable to sparse datasets as well.
The procedure to compute persistent homology associated to point clouds involves the construction of the Vietoris-Rips filtration of simplicial complexes ordered to the scaling parameter ε. Point clouds have an inherent 0-dimensional structure and lack interesting topological properties when considered simply as a finite collection of points. The trick to gain more information is to turn the point clouds into a higher-dimensional simplicial complex. Why? A simplicial complex is a high-dimensional generalization of the concept of a graph. It is a combinatorial object used to represent and discretize a continuous space. It is a multiple scale viewpoint of data realized via persistent homology through a sequence of simplicial complexes connected by inclusion, [see Figure 1]. In the topological jargon, this process is called a filtration on data. The overall process looks very complex but is in fact straightforward. To convince, let us assume to have a 2-dim point cloud, starting with circles of increasing radius ε there exists a pair of values ε1 < ε2 such that a hole is born at ε1 and is dead at ε2. We can represent the persistence of the hole as the pair (ε1, ε2) to represent the birth and death of the hole [see Figure 2]. The construction/destruction of topological features in each dimension as ε changes is encoded in a 2D diagram known as the persistence diagram. Persistence diagrams provide a concise description of the topological changes over all scales of the data. The topological features discovered (i.e., connected components, holes, voids, etc.) are persistent because independent of the construction process.
We can discover the related topological features of the point cloud that are persistent computing persistence homology associated with a point cloud with the filtration of a simplicial complex. In turn, we gain more information about the shape of this collection of points. In other words, we extract structure from data!
Persistence Homology is a general method that can be used either as a stand-alone analysis tool to extract topological invariants from a point cloud or comparing persistence diagrams using similarity metrics such as the Wasserstein distance.
Figure 1. Examples of simplex and simplicial complex
Figure 2. The Rips complex at different scales for a 2-dim point cloud [Gidea, M. & Katz, Y. Topological data analysis of financial time series: Landscapes of crashes. Physica A: Statistical Mechanics and its Applications (2018)]
The Mapper Algorithm
Persistence diagrams provide a concise description of the topological changes over all scales of the data but lack metric properties. The second methodology in TDA called Mapper is an algorithm designed for extracting global information from high-dimensional local data. It enables the simple descriptions of point clouds as simplicial complexes, abstracting from exact distances on a metric space.
Mapper replaces a topological space with a simpler one represented by a graph (the Reeb graph), which provides a compact, global representation of the data. It is an unsupervised multi-step process applied to a point cloud as follows:
- Define a filter function (the lenses)
- Construct an overlapped covering
- Apply a clustering algorithm to each cover
- Represent groups of similar objects as nodes of a graph, connect similar nodes by edges, and color nodes mapping to a custom function of interest (coloring function).
Besides graph visualization of high-dimensional datasets, Mapper can be used as an exploratory data analysis tool to extract hidden structures and relations. These structures can be used to improve a supervised classification task or for feature selection and predictive analysis.
In topological data analysis, the structure of data is treated topologically. A topological property is just a particular feature of an object that persists through continuous deformations. When performing topological data analysis, we search for all those properties or attributes that persist in a point cloud under different scales. By doing this, we can determine the structure of the data underneath and distinguish it from any dependence on how the data was observed.
This is the reason why topological methods are worth exploring in networking, to generate rich signatures of input data by reducing large and raw datasets into a compressed representation, in lower dimensions, and extract unexpected relationships and structures from the data because data has shape and shape has meaning.
In the next episode, we will dive into the application of TDA to telemetry data.