In my career, I can't count the number of times someone has said: "garbage in, garbage out." It's true with any type of activity that takes an input and transforms it into an output. Frankly, garbage data is a buzzkill when you're working with visualizations, machine learning, or any number of data topics. You may be amid your latest new breakthrough to help with customer attrition, only to realize you have so many issues with your data that the algorithm might be wrong. It's why data cleansing, modeling, and initial forms of analysis are centered around understanding the data and assessing its quality. There are ultimately 2 types of garbage data; known and unknown. This post is particularly interested in the unknown, but a specific area of the unknown is concerned with bias in collecting data.
Some of my favorite Authors, Morewedge and Kahneman, discuss biases as a process in the brain that creates predictable errors. These errors result from the brain developing energy-saving shortcuts and are a perfectly normal part of human behavior. Unfortunately, biases also can create problems for our data collection processes. As an example, let's say you're trying to collect data on people's knowledge of the solar system, and you ask a question like, "Given the billions of moons likely in the cosmos, how many moons are there in our solar system?" That seems like a harmless question. Well, you've just created a bias in the midst of writing the inquiry. You've just started an anchoring effect. Anchoring bias occurs when you have given an individual a starting value and then asked them to predict a value. The individual will adjust from the initial amount rather than draw on their knowledge of the solar system. This is a problem in data science because sometimes we're working with a data set that we have little data governance knowledge of. If an anchoring bias was in effect during data collection of a set of data you are asked to analyze, what consequences will that have on the analysis results?
Let's discuss another example, one that combines confirmation bias and selection bias. Suppose that you will collect data on innovation because you're trying to prove that individuals who are good at math are good at innovation. You collect data but pick out people you know are good at math and give them a test to measure creative output. You find that they score high, so you conclude that good math people are great innovators. Well, you may be correct, but your data collection process selected only people you thought were good at math, and you only sought to confirm the belief rather than disprove it, which is bad science!
In Machine Learning, Software Engineering, and Artificial Intelligence (We can debate the meaning of this later), we find a society the continues to use algorithms to drive decisions. Where the collection of knowledge around the solar system might be harmless, many are not. The decision to provide an individual a loan, show a particular ad, train chatbots, detect fraud, prescribe medical treatment, and others can profoundly affect people. It's exciting to think about using algorithms to find "the signal in the noise," but when we think about data, we have to be careful of "the noise in the signal."
Akter, S., McCarthy, G., Sajib, S., Michael, K., Dwivedi, Y. K., D’Ambra, J., & Shen, K. N. (2021). Algorithmic bias in data-driven innovation in the age of AI. International Journal of Information Management, 60, 102387. https://doi.org/https://doi.org/10.1016/j.ijinfomgt.2021.102387
Morewedge, C. K., & Kahneman, D. (2010). Associative processes in intuitive judgment. Trends in Cognitive Sciences, 14(10), 435–440.