UMAP (Uniform Manifold Approximation and Projection), a novel manifold learning technique for dimension reduction is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result brings forth a technique and a practical scalable algorithm that applies to the real-world data. The UMAP algorithm with t-SNE for visualization quality is competitive, and arguably with superior run time performance preserves more of the global structure. Continuing, it on embedding dimension has no computational restrictions, that for machine learning makes it viable as a general purpose dimension reduction technique.
Dimension reduction produces a low dimensional representation of high dimensional data that preserves the relevant structure. It, in data science for both visualization, and as a potential pre-processing step for machine learning turns out to be an important problem. As a technique that is fundamental for both, dimension reduction is being applied to increasing sizes of datasets and in a broad range of fields. It is thus desirable to have an algorithm that is both scalable to massive data and able to cope with the available diversity of data.
Dimension reduction algorithms fall into two categories usually:
Those seeking to within the data preserve the distance structure,
Those favouring over the global distance the preservation of local distances.
UMAP builds upon mathematical foundations related to the work on Laplacian eigenmaps by Belkin and Niyogi but seeks to provide results similar to t-SNE. They introduce for dimension reduction a novel manifold learning technique and provide a sound mathematical theory grounding the technique and a practical scalable algorithm that applies to real-world data. For visualization, t-SNE is the current state-of-the-art for dimension reduction. The UMAP algorithm for visualization quality is competitive with t-SNE and arguably with superior runtime performance preserves more of the global structure.
The UMAP algorithm:
The algorithm makes the use of local manifold approximations and patches their local fuzzy simplicial set representations together. This then of high dimensional data constructs a topological representation. A similar process can be used to construct an equivalent topological representation when given a low dimensional representation of the data. UMAP then in the low dimensional space optimizes the layout of the data representation, minimizing between the two topological representations the cross-entropy.
The construction of fuzzy topological representations into the two problems can be put down:
Approximating a manifold on which the data is assumed to lie;
Construction of a fuzzy simplicial set representation of the approximated manifold.
In explaining the algorithm the paper first discusses, for the source data the method of approximating the manifold. Next, from the manifold approximation, it discusses how to construct a fuzzy simplicial set structure.
Development of a general purpose dimension reduction technique that in strong mathematical foundations is grounded. The algorithm provides better scaling and is faster than t-SNE demonstrably. This allows the generation of high-quality embeddings of larger data sets than those that had been previously attainable.
For more information related to the implementation, experimental results and more, refer to the link below:
GitHub Link: GitHub
Link To The PDF: Click Here
UMAP Uniform Manifold Approximation and Projection for Dimension Reduction | SciPy 2018:
Video Source: Enthought