- An Optimized Computational Framework for Isolation Forest
- Outlier Detection with Isolation Forest
- Subscribe to RSS
- Multivariate Outlier Detection with Isolation Forests
- Spotting Outliers With Isolation Forest Using sklearn
An Optimized Computational Framework for Isolation ForestFor the purpose of this post, I have combined the individual datasets for red and white wine, and assigned both an extra column to distinguish the color of the wine, where 0 represents a red wine and 1 represents a white wine. The purpose of this classification model is to determine whether a wine is red or white. In order to optimize this model to create the most accurate predictions, I will be focusing solely on hyperparameter adjustment and selection. Most generally, a hyperparameter is a parameter of the model that is set prior to the start of the learning process. Different models have different hyperparameters that can be set. For a Random Forest Classifier, there are several different hyperparameters that can be adjusted. In this post, I will be investigating the following four parameters:. A pure leaf is one where all of the data on the leaf comes from the same class. The default value for this parameter is 2, which means that an internal node must have at least two samples before it can be split to have a more specific classification. The default value for this parameter is 1, which means that every leaf must have at least 1 sample that it classifies. More documentation regarding the hyperparameters of a RandomForestClassifier can be found here. Hyperparameters can be adjusted manually when you call the function that creates the model. The different hyperparameters would be tested on the training set, and once the optimized parameter values were chosen, a model would be constructed using the chosen parameters and the testing set, and then would be tested on the training set to see how accurately the model is able to classify the types of wine. When tested on the training set with the default values for the hyperparameters, the values of the testing set were predicted with an accuracy of 0. Validation Curves. A good way to visually check for potentially optimized values of model hyperparameters is with a validation curve. A validation curve can be plotted on a graph to show how well a model performs with different values of a single hyperparameter. In this image, we see that, when testing the values, the best value appears to be In this graph, we see that the highest accuracy value on the cross-validation is close to As we choose higher values for the minimum number of samples required before splitting an internal node, we will have more general leaf nodes, which would have a negative affect on the overall accuracy of our model. It is important to note that, when constructing the validation curves, the other parameters were held at their default values. For the purpose of this post, we will be using all of the optimized values together in a single model. A new Random Forest Classifier was constructed, as follows:. This model resulted in an accuracy of 0. Exhaustive Grid Search. Another way to choose which hyperparameters to adjust is by conducting an exhaustive grid search or randomized search. Randomized searches will not be discussed in this post, but further documentation regarding their implementation can be found here. An exhaustive grid search takes in as many hyperparameters as you would like, and tries every single possible combination of the hyperparameters as well as as many cross-validations as you would like it to perform.
Outlier Detection with Isolation Forest
Isolation Forest or iForest is one of the outstanding outlier detectors proposed in recent years. Yet, in the model setting, it is mainly based on the technique of randomization and, as a result, it is not clear how to select a proper attribute and how to locate an optimized split point on a given attribute while building the isolation tree. Aiming to the two issues, we propose an improved computational framework which allows us to seek the most separable attributes and spot corresponding optimized split points effectively. According to the experimental results, the proposed model is able to achieve overall better performance in the accuracy of outlier detection compared with the original model and its related variants. In the information society, tremendous data are generated to record the events happened in every second. Few of them are special for they are not yielded as our expectations which can be treated as anomalies. Anomalies exist nearly in every field. For example, events like a spam email, an incidental breakdown of the machine, or a hacking behavior on WWW can be regarded as anomalous in real lives. According to the definition given by Hawkins [ 1 ], an anomaly, sometimes called an outlier as well, is far different from the observed normal objects and suspected to be generated from a different mechanism. So, the anomalies should be few and distinct. How to detect them accurately is the main challenge in this field which is commonly called the study of anomaly detection or outlier detection. In reality, the data annotated with normal or abnormal labels are commonly not available which, if existing, are known to be able to provide the prior knowledge to identify them. Meanwhile, the anomalies in the dataset are commonly rare. It suggests that the numbers of the normal instances and the abnormal ones tend to be heavily unbalanced between them. Due to the two reasons, researchers cannot treat the anomaly detection as a typical problem of classification by simply applying the traditional supervised machine learning methodologies. As a result, in the past decades, most proposed models are unsupervised, including statistical models, distance-based models, density-based models, and tree-based models. If a dataset with multivariate attributes contains normal and anomalous points, in contrast to the traditional view angles, we consider that the essential distinction between the anomalies and normal instances lies in the discrepant values on each attribute. In other words, that the different value distributions towards normal and anomalous points on each attribute ensure them separable. Yet, it is not clear how to select the proper split point in the problem settings. Despite the advantages of the iForest, we consider it still has the space to be improved. For example, the selection of the attribute and determination of the split value regarding the chosen attribute are completely arbitrary while applying the model to build the iForest and some more smart methods would be possible to be studied. Since each tree in the iForest, i. Yet, it is not considered in the previous studies. Aiming at the issues mentioned above, we consider proposing a new solution to optimize the tree building process from solving three key questions. Split point refers to a chosen attribute value to partition the data into two sets on an attribute. As there are only two kinds of labels for anomaly detection, we can mark the leaf node with label 1 for normal instance and 0 for the anomaly. For a specified detection process in an iTree, when the step of the decision in the iTree reaches one of the leaf node, it depends on the label of the leaf with 1 or 0 to identify if the instance is normal or not. The layout of the article has an organization as follows. In Section 2related studies in the field of outlier detection will be reviewed. In Section 3we firstly introduce some symbol definitions to be used in the rest of this paper. Then, the motivation of our study and some quantified analysis with regard to the proposed model will be given in Section 4. Some detailed algorithm descriptions are presented in Section 5.
Subscribe to RSS
Multivariate Outlier Detection with Isolation Forests
Update: Part 2 describing the Extended Isolation Forest is available here. During a recent project, I was working on a clustering problem with data collected from users of a mobile app. The goal was to classify the users in terms of their behavior, potentially with the use of K-means clustering. However, after inspecting the data it turned out that some users represented abnormal behavior — they were outliers. A lot of machine learning algorithms suffer in terms of their performance when outliers are not taken care of. In order to avoid this kind of problem you could, for example, drop them from your sample, cap the values at some reasonable point based on domain knowledge or transform the data. However, in this article, I would like to focus on identifying them and leave the possible solutions for another time. As in my case, I took a lot of features into consideration, I ideally wanted to have an algorithm that would identify the outliers in a multidimensional space. That is when I came across Isolation Forest, a method which in principle is similar to the well-known and popular Random Forest. In this article, I will focus on the Isolation Forest, without describing in detail the ideas behind decision trees and ensembles, as there is already a plethora of good sources available. The main idea, which is different from other popular outlier detection methods, is that Isolation Forest explicitly identifies anomalies instead of profiling normal data points. Isolation Forest, like any tree ensemble method, is built on the basis of decision trees. In these trees, partitions are created by first randomly selecting a feature and then selecting a random split value between the minimum and maximum value of the selected feature. In principle, outliers are less frequent than regular observations and are different from them in terms of values they lie further away from the regular observations in the feature space. That is why by using such random partitioning they should be identified closer to the root of the tree shorter average path length, i. The idea of identifying a normal vs. A normal point on the left requires more partitions to be identified than an abnormal point right. As with other outlier detection methods, an anomaly score is required for decision making. In the case of Isolation Forest, it is defined as:. More on the anomaly score and its components can be read in . Each observation is given an anomaly score and the following decision can be made on its basis:. For simplicity, I will work on an artificial, 2-dimensional dataset. This way we can monitor the outlier identification process on a plot. First, I need to generate observations. The second group is new observations, coming from the same distribution as the training ones. Lastly, I generate outliers. Figure 2 presents the generated dataset. Now I need to train the Isolation Forest on the training set. I am using the default settings here. Okay, so now we have the predictions. How to assess the performance? We know that the test set contains only observations from the same distribution as the normal observations. So, all of the test set observations should be classified as normal. And vice versa for the outlier set. At first, this looks pretty good, especially considering the default settings, however, there is one issue still to consider. As the outlier data was generated randomly, some of the outliers are actually located within the normal observations. To inspect it more carefully, I will plot the normal observation dataset together with a labeled outlier set. We can see that some of the outliers lying within the normal observation sets were correctly classified as regular observations, with a few of them being misclassified. What we could do is to try different parameter specifications contamination, number of estimators, number of samples to draw for trining the base estimators, etc.