## t-SNE Python Example

The algorithm t-SNE has been merged in the master of scikit learn recently. It is a nice tool to visualize and understand high-dimensional data. In this post I will explain the basic idea of the algorithm, show how the implementation from scikit learn can be used and show some examples. The IPython notebook that is embedded here, can be found here. It reduces the dimensionality of data to 2 or 3 dimensions so that it can be plotted easily. Local similarities are preserved by this embedding. First, we compute conditional probabilites. This procedure can be influenced by setting the perplexity of the algorithm. Note that the cost function is not convex and multiple runs might yield different results. Here we can see that the 3 classes of the Iris dataset can be separated quite easily. They can even be separated linearly which we can conclude from the low-dimensional embedding of the PCA. In high-dimensional and nonlinear domains, PCA is not applicable any more and many other manifold learning algorithms do not yield good visualizations either because they try to preserve the global data structure. For high-dimensional sparse data it is helpful to first reduce the dimensions to 50 dimensions with TruncatedSVD and then perform t-SNE. This will usually improve the visualization. There are some modifications of t-SNE that already have been published. These issues and more have been adressed in the following papers:. Populating the interactive namespace from numpy and matplotlib. Help on class TSNE in module sklearn. BaseEstimator t-distributed Stochastic Neighbor Embedding. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. It is highly recommended to use another dimensionality reduction method e. This will suppress some noise and speed up the computation of pairwise distances between samples. Larger datasets usually require a larger perplexity. Consider selcting a value between 5 and The choice is not extremely critical since t-SNE is quite insensitive to this parameter. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. It should be between and If the cost function gets stuck in a bad local minimum increasing the learning rate helps sometimes. Should be at least If metric is a string, it must be one of the options allowed by scipy. If metric is "precomputed", X is assumed to be a distance matrix.

## sklearn.manifold.TSNE

Being at SAS, as a data scientist, allows me to learn and try out new algorithms and functionalities that we regularly release to our customers. What is t-SNE? In simpler terms, t-SNE gives you a feel or intuition of how the data is arranged in a high-dimensional space. It was developed by Laurens van der Maatens and Geoffrey Hinton in A lot has changed in the world of data science since mainly in the realm of compute and size of data. Second, PCA is a linear dimension reduction technique that seeks to maximize variance and preserves large pairwise distances. In other words, things that are different end up far apart. This can lead to poor visualization especially when dealing with non-linear manifold structures. Think of a manifold structure as any geometric shape like: cylinder, ball, curve, etc. You can see that due to the non-linearity of this toy dataset manifold and preserving large distances that PCA would incorrectly preserve the structure of the data. How t-SNE works. The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. It then tries to optimize these two similarity measures using a cost function. Step 1, measure similarities between points in the high dimensional space. Think of a bunch of data points scattered on a 2D space Figure 2. Then we measure the density of all points xj under that Gaussian distribution. Then renormalize for all points. This gives us a set of probabilities Pij for all points. Those probabilities are proportional to the similarities. All that means is, if data points x1 and x2 have equal values under this gaussian circle then their proportions and similarities are equal and hence you have local similarities in the structure of this high-dimensional space. Normal range for perplexity is between 5 and 50 [2]. Step 2 is similar to step 1, but instead of using a Gaussian distribution you use a Student t-distribution with one degree of freedom, which is also known as the Cauchy distribution Figure 3. This gives us a second set of probabilities Qij in the low dimensional space. As you can see the Student t-distribution has heavier tails than the normal distribution. The heavy tails allow for better modeling of far apart distances. The last step is that we want these set of probabilities from the low-dimensional space Qij to reflect those of the high dimensional space Pij as best as possible. We want the two map structures to be similar. We measure the difference between the probability distributions of the two-dimensional spaces using Kullback-Liebler divergence KL. Finally, we use gradient descent to minimize our KL cost function. Use Case for t-SNE. Laurens van der Maaten shows a lot of examples in his video presentation [1]. He mentions the use of t-SNE in areas like climate research, computer security, bioinformatics, cancer research, etc. Also, t-SNE could be used to investigate, learn, or evaluate segmentation. Often times we select the number of segments prior to modeling or iterate after results. This can be used prior to using your segmentation model to select a cluster number or after to evaluate if your segments actually hold up. Code Example.

## sklearn.manifold.TSNE

It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. It is highly recommended to use another dimensionality reduction method e. This will suppress some noise and speed up the computation of pairwise distances between samples. Read more in the User Guide. The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selcting a value between 5 and The choice is not extremely critical since t-SNE is quite insensitive to this parameter. Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. The learning rate can be a critical parameter. It should be between and If the cost function gets stuck in a bad local minimum increasing the learning rate helps sometimes. The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy. Alternatively, if metric is a callable function, it is called on each pair of instances rows and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. Initialization of embedding. PCA initialization cannot be used with precomputed distances and is usually more globally stable than random initialization. Pseudo Random Number generator seed control. If None, use the numpy. Note that different initializations might result in different local minima of the cost function. However, the exact method cannot scale to millions of examples. This method is not very sensitive to changes in this parameter in the range of 0. Angle less than 0. Otherwise it contains a sample per row. If True, will return the parameters for this estimator and contained subobjects that are estimators. The method works on simple estimators as well as on nested objects such as pipelines. Comparison of Manifold Learning methods. Manifold Learning methods on a severed sphere.