Published on : Wired Magazine
View here: http://www.wired.com/2013/05/more-data-more-problems-is-big-data-always-right/
In an intriguing study from Rutgers University, scientists set out to understand people’s decision-making related to Hurricane Sandy. From October 27th to November 1st, over 20 million tweets were recorded that pertained to the super storm. Tweets concerning preparedness peaked the night before, and tweets about partying peaked after the storm subsided.
The majority of the tweets originated from Manhattan, largely because of the high concentration of smartphone and Twitter usage. Due to the high concentration of power outages, and diminishing cell phone batteries, very few tweets were made from the hardest hit areas such as Seaside Heights and Midland Beach. From the data, one could infer that the Manhattan borough bared the brunt of the storm, however we know that wasn’t the case. There was actually a lot going on in those outlying areas. Given the way the data was presented, there was a huge data-gap from communities unrepresented in the Twitter sphere.
What would it be like if our government decided to handle the recovery of Hurricane Sandy based on large data sets like this one from Twitter?
In order to understand how to combat big data’s faults, we must first understand its nature. We can define big data as a blend between three things: technology, analysis, and myth. First we employ extreme computational power to gather, link, and analyze large data sets. Then we analyze and draw patterns to make claims including but not limited to society, economics, finance and technology. Lastly, the myth that more data will grant us higher acumen, and award us the power to generate better insights that were previously impossible with exactitude.
More often than not, we are too trusting of statistics, and fail to examine the data with a critical eye. Oftentimes, we are quick to conclude that the data presented to us is factual, which is entering risky waters in the context of big data.
The first problem with big data is it’s so vast and unorganized, that organizing it for analysis is no easy task. Dan Ness, principal research analyst at MetaFacts, states, “A lot of big data today is biased and missing context, as it’s based on convenience samples or subsets.” The so-called experts have given people the illusion that since they’ve come up with an algorithm, the data you plug in will always be correct. But that’ll only work if the assumptions that went into the algorithm are correct. As a result, people have a false sense of confidence in the data, especially as the data sets get larger and larger. If the “experts” produced an algorithm with the Twitter data that showed showed Manhattan as the center of disaster, they would have come to the wrong conclusion.
Which leads us to our second problem: the sheer amount of data! No wonder we are more prone to “signal error” and “confirmation bias.” Signal error is when large gaps of data have been overlooked by analysts. If places like Coney Island and Rockaway were overlooked in Hurricane Sandy, like they were in the Twitter study, we could be looking at a higher death toll today. Confirmation bias is the phenomenon that people will search within the data to confirm their own preexisting viewpoint, and disregard the data that goes against their previously held position. In other words, you will find what you seek out. What if FEMA looked at the Twitter data with a preexisting belief that the worst hit part of the Tri-state area was Manhattan? They may have allocated their resources in places that didn’t need it the most.
The third problem is best described by Marcia Richards Suelzer, senior analyst at Wolters Kluwer. She says, “We can now make catastrophic miscalculations in nanoseconds and broadcast them universally. We have lost the balance in ‘lag time.’” Simply put, when we botch the facts, our ability to create damage is greatly magnified because of our enhanced technology, global interconnectivity, and huge data sizes.
And for those of you that think big data is something saved for big corporations and government offices, think again. It’s picking up speed in the digital marketing space and it’s coming to a web browser near you. Big data’s algorithms may draw the wrong conclusions about who you are, and what you do. Digital marketers are getting to know you more than you might think. Imagine walking into Walmart and the greeter hands you a personalized set of coupons with your favorite brand of toothpaste and shampoo.
This may seem great at first, but what happens when they give you coupons for Rogaine and Vagisil? How would you feel if you realized that they know something about you? You have hit upon 67 of their data points, and there’s an 81% chance that you’re balding and have gonorrhea. Firstly, if it’s true, it may make you uncomfortable that marketers and businesses know private information about you.
And what if they’ve decided they know something about you that doesn’t actually exist? It might really piss you off! A friend of mine’s wife, who happens to be a doctor, is accused daily by search engines flashing her ads to treat her nonexistent venereal diseases.
How can we determine what “good data” is and succeed in not falling into the traps above?
When one sets out to determine what is good data within the vast amounts we currently have, he or she must keep three things in mind. Firstly, one will find what he sets out to. We don’t necessarily have greater capacity for better and swifter decision-making if we have more data. We must come up with better data models and better data analysis techniques. Secondly we must keep in mind that there are two types of data, quantitative and qualitative. Quantitative answers the “what” questions, and qualitative answers the “why and how.” Unless you use qualitative analysis, you can’t explain things with quantitative data. Thirdly the context of your data is key. Big data sets come pre-packaged by the collectors, who choose what to include and what to leave out. This can lead to losing context, and data contamination. We must ask for every data set: who and what places were excluded?
How do we fight the problems of big data? First, we need to approach every data set with skepticism. You have to assume that the data has inherent flaws, and that just because something seems statistically right, doesn’t mean it is. Second, you need to realize that data is a tool, not a course of action. Would you ask your hammer how to build a house? Of course not! You can’t let the data do the thinking for you, and can never sacrifice common sense. And third, having a lot of data is good, but what we need are the means to analyze and interpret it for use.
In a day and age where everyone has heralded the coming of Big Data, a blessing for marketers and the end of ignorance, keep in mind Big Data is just a diamond in the rough.