Part 3 – Local Outlier Factor scores & Conclusion
In Part 2, I demonstrated how to compute the local reachability density for each data point. The local reachability density grants us insight as to how “isolated” a data point is. In this third and final blog post, I will compare each point’s local reachability density to its neighbors to compute the local outlier factor score for each point. Based on the computed local outlier factor score, we can make judgements about whether a particular point is an outlier.
Local Outlier Factor
Now that we have covered all the intermediate steps, we can finally move on to discussing the Local Outlier Factor (LOF). In the previous step, we computed the local reachability density for each data point.
The intuition behind the Local Outlier Factor calculation is simple: For a data point A, how does its local reachability density (LRD) compare to those of its nearest neighbors? Mathematically, the Local Outlier Factor value for point A can be expressed as:
LOF(A) = Average LRD of K-Neighbors of A
Thus, the LOF is a ratio that shows the relative density of a point compared to the average of its neighbors. For most points, the LOF should be close to 1, meaning the local density around point A is approximately that of its closest neighbors. The higher the LOF for a given point is, the more likely that point is an outlier.
Let’s return to our example. Here I have colored and changed the size of each data point based on its Local Outlier Factor value. A higher LOF value corresponds to larger points with a more intense color. After plotting the points, we see our outlier clearly stands out compared to the other points. For a more quantitative result, the LOF score for our outlier point is 2.796, the highest LOF score for any of the data points. What’s the human interpretation of a LOF score of 2.796? The average local density of this point’s closest neighbors is almost three times larger than its own local density! Thus the local density around our outlier is noticeably sparser than its peers.
We’ll use our IT Admin and the PIM vault example one last time here. One day, the IT Admin checks out privileged credentials from the PIM like she’s done so many times before. This time, however, she authenticates across 30 machines instead of the normal one or two. Now, in this case, there aren’t any close examples from the past to support this. While many service accounts stored in PIM often do authenticate to 30 or more machines, her admin account never has. The isolated fact that a privileged account authenticated to 30 machines is not cause for alarm, as many service accounts do so routinely. If an algorithm set a rudimentary rule flagging such behavior as suspicious, that would be a recipe for thousands of false positives, a circumstance well-designed UBA systems minimize. Thanks to the Local Outlier Factor analysis, however, this is a clear example of horizontal movement for this admin account, and StealthDEFEND can confidently alert that clear suspicious behavior has been detected.
Because there is no clear-cut definition of “anomaly”, we do not label points as “anomaly” or “normal” at the end of this algorithm. Rather, we use the likelihood score assigned to each data point to compute the probability that the point is an outlier. Now, StealthDEFEND users can focus on the “suspicious” users, and using the appropriate context, determine the suspicious users’ behavior constitutes a threat to their system.
Don’t miss a post! Subscribe to The Insider Threat Security Blog here:
Manojit Nandi is data scientist at STEALTHbits Technologies. He holds a Bachelor of Science in Decision Sciences, focused on machine learning and mathematical algorithms, from Carnegie Mellon University. He has given talks on machine learning and data science at tech conferences, such as SIGKDD and PyData.