Inside the wacky world of weird data: What's getting crunched
"Big data" is the term industry insiders use to describe a transformational change in computer analytics and business management. It means the slicing and dicing of enormous data sets to discover new—and often surprising—insights into the way the world works.
It's a red-hot field right now—because of twin revolutions going on in the amount of computer data available to study and the dramatic evolution of algorithms and analytics used to study that information.
Where computer scientists were once limited to mere gigabytes or terabytes of information, they're now studying petabytes and even exabytes of information. You don't need to know the math to know that's a colossal amount.
By one claim, all the words ever spoken by human beings ever could be stored in about 5 exabytes of data. And the amount of data in the world is growing exponentially—by another claim, 90 percent of the information in the world has been created just in the last two years.
At the same time, the tools to sift all that data are getting better as computer scientists refine and improve the algorithms they use to extract meaning from the deluge of data. Google, for example, recently unveiled a new version of its search algorithm, called "hummingbird," to provide ever more direct answers to complex questions.
All of that has cleared the way for a fast-growing group of start-up companies staffed with computer scientists and coders who are using this new microscope to examine the world. With names like "Kaggle," "Evolv" and "HaystaqDNA," they're trying to extract insights about the world around us in ways that have never been possible before.
Think of it as "Moneyball," but not just for baseball, for everything.
These new companies, side by side with researchers at universities like Harvard, MIT and Stanford, are taking the concept of what "data" is well beyond old-fashioned Excel spreadsheets. Forget tabs and columns. How about every pixel in every image of the world on Google Earth? That's data.
How about every post by every person on every Chinese social media site? That's data, too. Or what happens when you sort millions of job applicants not by the information on their resume, but simply by which Web browser they used to send in the application? That's also data.
And a lot of what these analysts are discovering about the world is, in some cases, deeply weird. Call it "weird data."
(Read more: Big data download)
The computer scientists are finding patterns we never knew existed. Slicing Americans into demographic groups that no one has ever thought of. Finding signals in what has long been assumed to be just noise.
Want to know how many Republicans have solar panels on the roofs of their homes? You can do that. Do people with reverse-gender names—think of boys named "Sue"—shop differently than everyone else? You can find out. The companies are convinced there's money to be made in mining for these kinds of strange-but-true insights.
These firms are going beyond just looking at the past. With the growing field of "predictive analytics" these firms are learning how to tell the future, too. One start-up founder said he wants to be able to predict when a woman will have a baby even before she starts trying to get pregnant. There's an entire industry of baby-product companies that want to target them before their competitors do.
In the wake of Edward Snowden's NSA leaks, a lot of people are worried about just how much the government knows about them. But corporations and researchers are able to learn enormous amounts, too. And that raises a question: Should there be any limits on how much information private firms can gather about you?
Ironically, it's the Obama administration—which has stalwartly defended the NSA's metadata telephone collection program and other information gathering—that's conducting a study on exactly that. Called "Big Data and the Future of Privacy," the White House study is expected to be completed this spring.
"We are undergoing a revolution in the way that information about our purchases, our conversations, our social networks, our movements, and even our physical identities are collected, stored, analyzed and used," wrote White House counselor John Podesta in a late-January blog post. "The immense volume, diversity and potential value of data will have profound implications for privacy, the economy, and public policy."
What follows are just a few of the ways that big data debate is playing out in the real world.
Weird human resources
Start with some of the strangest findings about how we work. The big data start-up Evolv sifts thought millions of pieces of human resources data from clients looking for insights about who applies for jobs and who succeeds at work. The idea is to help companies decide who to hire, and how to manage their teams. Evolv CEO Max Simkoff said his firm has found a series of surprising insights about corporate hiring, based on companies' own data.
"We've got data on over 3 million employees in a variety of industries and job types," Simkoff said. "Companies have finally honed in on the monthly, weekly and in some cases hourly metrics that measure those employees' productivity."
(Read more: Lisa was called a 'slut' by bank's big data hijinks)
One of the most surprising findings is just how easy it can be to tell a good applicant from a bad one with Internet-based job applications. Evolv contends that the simple distinction of which Web browser an applicant is using when he or she sends in a job application can show who's going to be a star employee and who may not be.
Evolv says a willingness to adopt new technology by choosing "nonstandard browsers" like Firefox or Chrome is a powerful predictor of performance. Employees who use them perform better across the board than those who use standard browsers that come with most computers, like Internet Explorer and Safari.
Not only do these employees stay on the job longer, miss less work and adhere better to company protocols, the company says, they provide higher customer satisfaction and close more sales.
Mary Murcott is chief strategy officer at a call center firm called NOVO 1, which hired Evolv to help it retain its workers and hire new people who would stay.
The company found several trends in play: Workers over the age of 30 or 40 had about half the attrition rate as younger workers. The more rank-and-file employees switch around among managers, the longer they seemed to stay with the firm. And although the firm had long been open to rehiring workers who had left, Evolv concluded that rehires left the company 44 percent faster than new hires, which made Murcott rethink her openness to ex-employees.
"That has helped us understand where we got our top people and where we got people that were not great," Murcott said. "And so we've moved our recruiting dollars and our job-sourcing dollars to other avenues."
What's more, Murcott said her firm has been able to focus its hiring on an unusual pool of applicants who are usually overlooked by corporate hiring managers: the long-term unemployed.
Working with Evolv, the company has developed hiring tests that de-emphasized traditional measures that have not proved effective, such as education and work experience, and emphasized personality and skills-based measures.
"We had a 55-year-old woman come in the other day—she hadn't been able to find a job, she hadn't been in the workforce, she hadn't had any job experience, and we hired her as a customer service representative," Murcott said. "So here's woman that couldn't get a job, that everybody's denying, and we were able to put her on the job. She's doing great."
But here's the weirdest thing Evov said it has discovered: Criminals can make better employees than anyone else. Evolv calculates that employees with criminal backgrounds are 1 to 1.5 percent more productive on the job than people without criminal records, and the firm said that difference in a large company "could result in tens of millions in profit and loss gain."
Evolv CEO Simkoff says he's not sure why that's the case, but he guesses it's because such employees feel a sense of loyalty to the companies that took the risk to hire them.
But won't companies be reluctant to hire criminals? Simkoff said companies sometimes don't want to hear advice that they should be hiring more criminals. "But I tell them their own data is showing this—if they want to save $10 million a year, they should make the change. But what they do with the data is ultimately up to them."
Veterans of the Obama presidential campaigns—which created new ways of drilling down into databases and social media to find and motivate potential voters—have seeded a slew of new big-data analytics firms hoping to do the same for paying clients. One of them is Michael Simon, who worked on the Obama '08 campaign and is now a co-founder of the data analytics firm HaystaqDNA.
Simon and his firm are looking at new ways to isolate demographic groups—slices of American society so small that they might not have been detected by traditional polling, but can now be identified and reached using technology. It's the kind of thing that can be as useful to corporate marketing folks looking for customers as it is for politicians looking for votes.
(Read more: Big data's powerful effect on tiny babies)
"We're trying to understand how people who otherwise look alike are actually different, and you really need to dive beneath the surface," Simon said.
In one recent project, Haystaq used Google Earth images of California to find homes with solar panels on the roofs. After teaching their computers to distinguish between solar panels and other things that look similar from above, like skylights and pools, they are cross-referencing the solar panel homes with publicly available name and address information for those houses.
As a result the firm will have a list of everyone in California—and eventually the whole country—who owns a residential solar panel.
The fact that someone would buy a solar panel tells you a lot about the type of person they are: likely environmentally conscious and with enough disposable income to spend on an expensive piece of technology. Solar panel owners make up their own tiny demographic—and by scouring satellite images, marketers can now find them individually.
"You can actually dive into a neighborhood from outer space and figure out people's attitudes and behaviors," Simon said. The firm found the highest density is in Orange County, where 2.96 percent of the homes have solar panels.
Statewide, the higher the income bracket, the more likely people are to have solar panels—a full 4.2 percent of houses with an average annual income of more than $200,000 have the product. Solar panels are most popular in the 92694 ZIP code, which is a planned community called Ladera Ranch, Calif., where more than 12.5 percent of the homes have solar panels.
Haystaq has discovered another tiny-but-useful demographic: People who have first names typically associated with the opposite gender. Think here about boys named "Sue," or girls named "Bobbie." Haystaq said people with opposite-gender names have left-leaning political views—they skew Democratic by 44.7 to 26.4 percent margin. Just by looking at a potential customer's first name, Haystaq can already make some guesses about how they think.
People with less common names, by the way, skew Democratic as well. Those with the least common names are over 40 percent Democratic and just over 20 percent Republican, Haystaq found. But those with the most common names split almost evenly between the two parties at just under 40 percent for each. It seems that the weirder your name, the more likely you are to be a Democrat.
Social media provides big data analysts with a huge trove of information to mine for new discoveries—and not just here in the United States. The findings can be as insignificant as what's hot on Facebook, or they can be as profound as pulling the curtain back on the ways an authoritarian regime censors the Internet.
Take a series of recent papers by the team at Harvard's Institute for Quantitative Social Science, which copied millions of postings on more than 1,400 Chinese social media sites in real time. They captured the data before the Chinese censorship bureaucracy had a chance to remove offending posts.
By comparing their database to what was allowed to stay up on the Web, the Harvard team was able to compile a database of thousands of social media posts that have been censored by the Chinese government.
Harvard professor Gary King and his colleagues wrote that the Chinese effort represents "what may be the most extensive effort to selectively censor human expression ever implemented." And that has profound implications for Western companies seeking to do business, on the Internet or otherwise, in China.
"This program, designed to limit freedom of speech of the Chinese people, paradoxically also exposes an extraordinarily rich source of information about the Chinese government's interests, intentions, and goals," the Harvard team wrote.
Applying analytics to the data, the team discovered some unexpected things about Chinese censors.
For one thing, the Chinese aren't censoring Web posts that are critical of the government, as many people suspect. They're censoring any comments that encourage "collective action," or the mobilizing of large groups for a common cause.
Among the most highly censored topics, the team found, was a series of protests in Inner Mongolia, which had a high potential for collective action against the government.
(Read more: The end of the social media bubble in a $2,500 check)
Pure criticism of the government, though, doesn't seem to trigger the censors.
The Harvard team said these tough words for government officials were not removed by the censors: "This is a city government [Yulin City, Shaanxi] that treats life with contempt, this is government officials run amuck, a city government without justice, a city government that delights in that which is vulgar, a place where officials all have mistresses." The censors, apparently, had no problem with those comments.
And here's the weird part: The Chinese are censoring the comments even if they agree with the Chinese government. For example, the Harvard team found that posts on a Wenzhou website supporting environmental activist Chen Fei were censored, even though the government supports his work.
—By CNBC's Eamon Javers. Follow him on Twitter: