×

Impermium Announces Results of Competition on Kaggle to Clean Up Offensive Language on the Web Through Crowdsourcing

World’s Top Data Scientists Build Algorithms to Better Identify Social Spam, Aggressive Language, and Bullying

SAN FRANCISCO--(BUSINESS WIRE)-- Impermium, the leader in user-generated content protection, and Kaggle, a platform for predictive modeling competitions, announced today the results of a contest to take on the crudest, rudest, and meanest trolls on the Internet. In the competition, which ran from August 7th to September 17th, Impermium used Kaggle’s community of 53,000 data scientists to develop algorithms capable of automatically detecting insults, foul language, and verbal abuse. The goal of the contest was to identify new ways to defend against malicious language and social spam online, and help clean up the web by scrubbing away unwanted obscenities from user-generated content.

“This partnership with Kaggle was instrumental to achieving our goal of finding novel, innovative solutions to the problem of social spam,” said Mark Risher, co-founder and chief executive officer, Impermium. “Social spam is a growing and complex issue plaguing every online community, putting users at risk and damaging brand reputations. This competition allowed us to harness the brainpower of leading data scientists in order to successfully detect hostile language and threats, enabling our system to better protect the web.”

As with other forms of malware, there are traditionally two standard approaches to dealing with social spam. One is to employ individuals to manually review every comment on every platform, resulting in a costly, tedious, and unscalable project. The other is to run automated software that uses a complex set of rules and configurations to recognize undesirable content. This, too, can be expensive to deploy and slow to adapt to new trends. It is also notorious for turning up false positives while allowing unwanted traffic to slip through the net.

Impermium uses a fundamentally different data science-based approach to tackle social spam. Instead of analyzing each piece of content in isolation as it flows through a pipeline, Impermium takes a network approach, monitoring the traffic of more than 300,000 sites across the globe simultaneously. This allows the software to gain a bird’s-eye view, enabling machine learning algorithms to analyze the context as well as the content. The result is the rapid detection of new types of spam, malware, hate-speech, and cross-site scripting.

“One of the main challenges of this competition was distinguishing between phrases containing curse words that were acceptable within the context of an online discussion versus those that are personal insults,” said Anthony Goldbloom, founder and chief executive officer, Kaggle. “This is far more difficult than just searching for rogue words. It requires a grammatical understanding of sentences, a grasp of the context, and the ability to cope with misspellings and common online abbreviations.”

Impermium’s competition attracted 1,235 total entries from 154 participants. In addition to a $10,000 prize pool, top finishers were also offered job interviews at Impermium. Serving as an innovative recruiting tool, the competition enabled Impermium to interview top specialists in this field, while also providing contestants the opportunity to work for a company at the forefront of machine learning and natural language processing technologies. The winning algorithm was developed by Vivek Sharma, currently ranked as the second-best data scientist on Kaggle. His approach involved using support vector classification, a high-dimensional modeling technique that enabled him to separate insults from non-insults based on many different types of features within a sentence, such as vocabulary, tone, grammar, and more. Sharma credits his win to his algorithm’s additional reliance on grammatical structures like “You are a ****” to distinguish insults from non-insulting phrases that contained curse words.

Competition Results

Competition participants were asked to detect when a comment from a conversation would be considered insulting to another participant in the conversation. Samples could be drawn from conversation streams on news sites, magazine comments, message boards, blogs, text messages, and more. In addition, competition participants were asked to include a visualization element to highlight the most interesting, informative, and thought-provoking infographics, diagrams, or plots.

Words and phrases that are most often found in insulting comments range from curse words (f*ck, b*tch, sh*t), to explicit language describing body parts, to insults of intelligence (moron, idiot, ignorant). One characteristic that distinguished the more successful algorithms was the ability to detect from context whether a word was being used as an intensifier rather than as an insult. In addition, the competition found that the word “mom” is often present in insulting comments. The competition also revealed that people tend to be most abusive between 9:00 p.m. and 10:00 p.m. Although relatively quiet during the mornings, insulting comments begin rising in the early afternoon and throughout the evening.

About Kaggle

Kaggle is the global leader in running predictive modeling competitions. The company has run approximately 100 competitions with major enterprise, government, and academic customers, including Allstate Insurance, Boehringer Ingelheim, Dunnhumby, Ford, Heritage Health Foundation, Microsoft, NASA, Stanford, and Wikipedia. Over 53,000 data scientists worldwide have contributed to competitions that tackled the toughest predictive problems in the marketing, life sciences, insurance, financial services, travel, and science verticals. Kaggle’s investors include Index Ventures and Khosla Ventures. It was founded in 2010 and is based in San Francisco, Calif.

About Impermium

Impermium provides social content cleaning for web sites and social networks, defending them against social spam, fake registrations, racist and inappropriate language, and other forms of abuse. Our system combines advanced technology and broad, Internet scale threat information to provide cost-effective, real-time protection for more than 300,000 sites across the globe.

Founded in 2010 and launched in 2011, Impermium is backed by Accel Partners, Charles River Ventures, Greylock Partners, Highland Capital Partners, and the Social+Capital Partnership.

Cutline Communications (for Kaggle)
Paige Schoknecht, 415-348-2708
pschoknecht@cutline.com
or
SutherlandGold Group (for Impermium)
Liz Clinkenbeard, 415-848-7167
liz@sutherlandgold.com

Source: Kaggle