# Local Outlier Factor – Part 1

## User Behavior Analytics

Finding Anomalous Users through the Local Outlier Factor Algorithm

## Part 1 – Motivation and K-Distance

How do you identify users who are behaving anomalously? One way to tackle this problem is to define a set of rules that all user activity should conform to. If a user’s behavior breaks one or several of these rules, then we flag that user as behaving anomalously.

There are several problems with this approach:

- 1. You need a good, formal definition of anomalous behavior. For most organizations, anomalous user behavior falls under the “I know it when I see it” category.
- 2. A rule-based approach ignores what other people in the organization are doing. If every person in the company breaks one of these rules, should we classify all of them as anomalous?

For our newest product, StealthDEFEND, we forego a rule-based approach to anomaly detection and instead opt for a flexible, data-driven approach based on outlier detection.

The intuition behind our approach is that outliers should be more isolated compared to normal user activity. In more technical terms, we expect normal points to exist in denser regions whereas outliers will have only a few number of data points surrounding them.

Consider the graph below:

Which, if any points, in this graph seem like outliers? The one point that jumps out at you is probably the one way at the top. There are probably other points that look suspicious, but for now, let’s focus on that one point at the top.

What makes this red point suspicious? There are two things that intuitively jump out at us:

- The point is far away from the “center” of the data points.
- The point is relatively more isolated than the other data points.

Thus, our density-based definition of outliers aligns well with our intuition of an anomaly; something that is “different” from everything else.

In this blog post, I will detail the Local Outlier Factor algorithm that is at the heart of StealthDEFEND’s anomaly detection. The Local Outlier Factor algorithm is a method to quantify how “isolated” a data point is relative to other points in the dataset.

The Local Outlier Factor algorithm can be broken down into four parts:

- K-Distance and K-Neighbors
- Reachability-Distance
- Local Reachability Density
- Local Outlier Factor calculation

K-Distance and K-Neighbors

Our end goal is to quantify the local density of each data point, so the first step is creating a framework to model how “isolated” each data point is. There are two possible ways to do this:

- We can choose a radius r and for an arbitrary data point p, we count the number of neighboring data points within a distance r of the data point p.
- We choose the number k of neighboring points we want to consider and for an arbitrary data point p, we find the necessary radius r to have k points within distance r of p.

Local Outlier Factor uses the second approach of fixing the number of neighboring points k and finding the necessary radius r to have k points in our neighborhood.

Now, you may be wondering how to choose k, the required number of neighboring points. We could fix k to a particular number, say k=50, and look at the 50th closest neighbor for each data point. Or we can let k represent a proportion of the data, allowing the method to scale to datasets of different sizes.

For example if we set k=0.05 and a dataset of 100 data points, then we would find the 5th-closest neighbor for each point (100*0.05). If we had a larger dataset of 250,000 data points, then we would find the 12,500th-closest (250,000*0.05) neighbor for each data point.

I’m sure you’re wondering how this is useful. The K-Distance provides a meaningful heuristic to reason about isolated data points. The more isolated a data point is, the farther we will have to search to find k neighboring points. On the other hand, a “normal” data point is surrounded by other “normal” data points, so we don’t have to search too far to find k neighbors.

Let’s go back to our original example. Here, I have chosen k=5, so we will look at the 5-Distance (distance to the 5th-closest neighbor) for our outlier and a normal data point. I have colored the five closest neighbors for the outlier and the normal point cyan and yellow, respectively to show that there are indeed five points within the 5-Distance.

As you can see, the 5-Distance for the outlier is noticeably larger than that of the normal data point. Now, we have established the K-Distance as a framework to quantify how “distant” a data point is from its neighbors.

Check back soon for Part 2 where I will take you through the steps to calculate the local density of each data point.

**Real-World Example**

Let’s think about this in the context of IT operations. For each employee – for every action they do, or set of actions – identifying a handful of similar examples should be easy. If the employee is really behaving normally, it should be easy to pick out 5 or 10 examples of similar patterns of behavior. The k-distance is a mathematical measurement of how far and wide the algorithm has to search to find these other examples.

For example, let’s take the potential abuse of admin privileges. An IT Admin has an admin account she uses to check a password out of PIM to make certain, specific changes. Like she’s done a hundred times before, she logs into one machine to restart a service, disable or delete a retired employee’s account, etc.. There are many examples of her checking out this account, logging into one or two machines, and then a few minutes later, PIM resets the password. The K-distance of this behavior may vary from episode to episode slightly – some days she may use the checked-out password for 10 minutes instead of 5 minutes – but the variations relative to the historical behavior are not significant, and therefore not suspicious.

**Don’t miss a post! Subscribe to The Insider Threat Security Blog here:**

Manojit Nandi is data scientist at STEALTHbits Technologies. He holds a Bachelor of Science in Decision Sciences, focused on machine learning and mathematical algorithms, from Carnegie Mellon University. He has given talks on machine learning and data science at tech conferences, such as SIGKDD and PyData.