From popular Reddit threads to the discussions that happen in my own computer science classes, “big data” — the usage of large-scale datasets to extract patterns — has become one of the hottest buzzwords in tech. Big data holds incredible potential, enabling insights of unprecedented scale, but it can also pose potent risks if we are not careful about its shortcomings. Until we learn to acknowledge the flaws inherent to big data — and take them seriously — it cannot be the magical cure-all that we want it to be.
Big data is used to empower new insights in artificial intelligence and machine learning models. But if these models are trained on biased datasets, they could easily amplify or overlook the distortions already present in the data.
For example, one study showed how a machine learning model used to automate credit decisions in the United States was deeply entangled with systematic biases against historically marginalized groups. This wasn’t a problem with the size of the dataset, but rather with the social context that shaped the data. Years of discrimination have deprived these marginalized groups of the assets and credentials used to determine credit approval. Instead of correcting for past discrimination, the machine learning model simply perpetuated it, disproportionately denying racial minorities credit. Even when researchers removed sensitive information like race and ethnicity from their model, racial bias was not reduced. After all, race was intimately correlated with other factors in the data, such as education level, property type and neighborhood.
Today, data bias is an issue that developers are widely aware of. In fact, it is even a part of the computer science curricula at many institutions of higher education, including Brown.
Many technologists have proposed solutions to data bias. Some simply entail collecting more data, expanding datasets in order to make them more representative or changing the design behind data collection to minimize data biases. A growing body of algorithmic bias research has produced a number of sophisticated approaches to fixing data bias. But even if these strategies are promising, it’s questionable as to whether big tech companies genuinely want to prioritize equity over profit. For example, although Google funds a department dedicated to investigating algorithmic bias, last year, it fired AI ethics researcher Timnit Gebru. Many speculated that it was because Gebru’s advocacy and attempts to publish a paper about the harms of data bias in language models challenged Google’s leadership. Gebru cared about the dangers of big data; Google only cared when it didn’t interfere with its regular operations.
In our obsession with big data, we also forget a more salient question: Is big data the right approach? Should we be using it at all? After all, in many domains, big data will always be biased, a problem which isn’t taken seriously enough. The hype over the advancements big data brings often overshadows any concerns about its dangers.
Computers are often seen as more objective than humans. But how can we expect computational models to act fairer than we when the data we’ve trained them on contain all the trappings of biased human behavior? It makes no sense to model our goals for technology on existing societal conditions when these conditions are so often flawed. It doesn’t matter much whether an algorithm or a human is perpetuating our biases — in each case, harm results.
All of this is not to deny that big data has its uses. It has given us technological opportunities and insights that could never have existed even just a decade ago. However, big data is no holy grail. Its benefits are realized only when developers care to understand the context of the data they use, rather than feed it straight into a computational model. As Brown continues to educate the next generation of software developers, we need individuals who are not only educated about data bias, but are committed to fixing it.
Anika Bahl ’24 can be reached at email@example.com. Please send responses to this opinion to firstname.lastname@example.org and other op-eds to email@example.com.
mindtalks.ai ™ – mindtalks is a patented non-intrusive survey methodology that delivers immediate insights through non-intrusively posted questions on content websites (web publishers), mobile applications, and advertisements (ads). The conversation is just beginning !, click here to sign-up and connect with other mindtalkers who contribute unique insights and quality answers on this ai-picked talk.