Protecting people from fraud is one of my favorite activities, but it can be complicated. Occam’s razor states that the simplest explanation is often the best one. As a data practitioner, I’ve always had to make sure I keep that in mind. Complexity usually comes with expense, errors, and unexplainable models. It doesn’t matter if you work in data engineering, software, or analytics, the simplest solution is usually the best.
There are a lot of complex algorithms and model out there to detect fraud. Many of which I’ll cover later in this blog, but before I do that we have to hit some of the ones that I think are pretty amazing given their simplicity. These are some actual models I’ve used to detect and expose fraud in my career. All three of this incredibly simple models are approaches I’ve used to clean data or in some cases even find certain types of potential fraud.
The good old z-score. The grandfather of outlier detection. Z-scores are basically the number of standard deviations from the mean. So, if you have a value with a z-score of 2.2, it’s 2.2 standard deviations from the mean. The magic values we look for in fraud detection is the number 3. Generally, any number greater than 3 is a potential outlier. There are some challenges, and it won’t work on small data sets, but it’s a great first step on any numerical data set. If your data isn’t normal, then z-scores are not going to be very helpful.
The steps in the figure show how to run a quick example in python. Grab it on github here.
The Grubbs test formalizes a model to check for outliers against the z-score. It’s a simple test, which means it is high-speed and can quickly run against massive datasets due to its simplicity. The Grubbs test basically tests that the data is in a Normal or Gaussian distribution. When implemented, Grubbs test removed a single outlier at a time, testing it against a hypothesis test, in which H0 is there are no outliers, and H1 is there is a single outlier in the data. The test can be run iteratively on the data as an item is removed and the data is retested.
You can go here to find a quick example on github using one of the outlier_utils library.
Be warned, this one has brain-melting potential. When I first saw this in grad school I thought it was a fluke, until I actually used to hunt down some fraud issues with credit cards. Benfords law basically states that frequency distribution of the first digit of a set of data will follow a right-skewed distribution. What’s that mean? Well, it means that you’ll have more 1’s than 2’s, more 2’s than 3’s, more 3’2 than 4’s and so on. The graphic depicts what a Benfords law frequency model looks like. The black dots are the expected frequencies in Benfords law, and the red are a theorical use case. Benfords law usually needs a lot of data, something like 1000 data points to appear. To validate that a data set is in line with Benford’s law you can use a Chi-square test. The Chi-square can validate between theoretical and real models. I’m going to skip the visualizations and chi-square test for now for a later blog.
Benford, F. (1938). The law of anomalous numbers. Proceedings of the American philosophical society, 551-572.
Image from Benfords law illustrated by world’s countries population.png - Wikimedia Commons
So there are a lot of different algorithms out there that can be used for fraud, but the simple ones are sometimes the best to get started. I’ve known some data practitioners that have tried to jump into learning the newest and most advanced algorithms out there and while you need to do that, you also should make sure you pay attention to the simple ones!
Links and References
How to Calculate Z-Scores in Python - Statology
Amazon.com: Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection (Wiley and SAS Business Series) eBook : Baesens, Bart, Van Vlasselaer, Veronique, Verbeke, Wouter: Kindle Store
outlier_utils · PyPI
Benford’s Law (The First Digit Law): Simple Definition, Examples - Statistics How To