Using Neural Network to Detect Hate Speech

This article was originally posted to LinkedIn by Amy Hemmeter and Kirsty Ward.
We have included it in Data Column to show an example of the kinds of outside collaborative projects that students complete.
At the beginning of the year, we realized that we shared an interest in text analytics, natural language processing, and social good. We decided that we wanted to tackle a project on text analytics that had implications for social good, so we agreed on working on automated detection methods for hate speech on social media. Companies like Twitter and Facebook have recently come under fire for allowing hate speech and harassment to occur on their platforms. However, moderating this content by hand is costly and impractical, creating a need for automated models to detect hate speech and harassment. In this project, we created a neural network model to classify tweets based on the hashtags and text used. You can see the Python code for this project on GitHub here.


We started by reading a survey paper on natural language processing techniques used to detect hate speech. This gave us an idea not only of the types of techniques we could use, but the data sources and some people who had already worked on the problem. A paper by Zeerak Waseem focusing on automatic detection of hate speech caught our attention, which provided a data set of over 16,000 tweets annotated for hate speech. The tweets in this dataset are annotated as “racist,” “sexist,” or “other” – a variable we refer to as “class.”
Next, we queried the Twitter API to get the tweet text, as well as associated metadata such as follower counts, favorite counts, retweets, hashtags, and author bio. Combining the data queried from twitter’s API with the categorical labels created by Zeerak Waseem gave us a starting point for our project.

Preparation and Cleaning

We then added additional variables for the data based on various aspects of the tweet’s text.  First, we started with sentiment analysis using the Vader sentiment analysis package in Python, which is specifically designed for social media data. For each tweet we got a value of the composite score, and a value for the negative score.  A composite score gives a number which was positive if the tweet was positive in sentiment and negative if the tweet was negative in sentiment.  A negative score gives a score of how negative the tweet was.
Next, we transformed the tweet text into usable information. First, we tokenized (broke into individual words) and got rid of information in the tweet itself that was already represented in the metadata, such as hashtags and user mentions. Continuing with cleaning the data, we removed punctuation, made all words lowercase, removed both the standard stop words (for example, “meaningless” words like “the” and “of” that don’t add much to our analysis) as well as Twitter-specific stop words, like “RT.” A Porter stemmer was used to stem the words involved so that they were broken down into only the words’ roots – this cuts down on redundancy, so that “run” and “runs” both shorten to “run” – meaning that this one stem will be counted twice rather than “run” and “runs” being counted as separate words.
After these preliminary steps, we did a Term Frequency-Inverse Document Frequency analysis on the words, treating each class – racist, sexist, and other tweets – as a separate document, and taking the top 100 words with the highest TF-IDF value. TF-IDF looks at the frequency of a word in a document, along with if that word is present in other documents, to find the words that most define that document. In other words, these were the words most likely to show up in one class than in another class. For example, the word “chick” was much more likely to show up in the sexist tweets than in either of the other classes. As one might expect, slurs were a part of this list, but not the majority of the list. Other words were also indicative – in the case of the sexist tweets, we saw words like “sportscenter,” “espn,” and “kitchen” in the top 100 TF-IDF terms, indicating that sexist tweets often were about women’s relation to certain other topics. Similarly, the racist (which may more accurately be described as “islamophobic” – see the Future Directions section below) category had slurs as well, but also included words like “behead” and “jihad,” indicating that people who wrote racist tweets about Muslims associated certain topics with them more than non-racist tweets.
Categorical variables were created based on the top 100 TF-IDF values for each document, and used as inputs for our model. In addition, we created categorical variables based on each of the hashtags used. This processing resulted in a final dataset consisting of 1576 variables.


Using keras, a Python deep learning library, we created a neural network using a TensorFlow backend to classify the tweets into one of the three categories – sexism, racism, or none. In the model, we used sentiment analysis negative and compound values, the top ~100 TF-IDF terms for each classification, and hashtags present in the tweet as predictors. The final structure of our neural network had 7 layers, with 100 nodes in the first hidden layer, 1000 in the following six layers, and 3 output nodes corresponding to each level of the target variable. Our final neural network model accurately predicted the class of tweets 79% of the time.

Future Directions

Completing this project gave us a new insight into how neural networks can be used to effectively explore text data. In the future, there are more methods and directions we would like to explore. In this iteration of the project, we used TF-IDF as a method of dimension reduction, due to our limited computing capacities.  With increased resources, we would consider using all inputs in a bag-of-words approach or an n-gram approach. We could also consider using more sophisticated deep learning methods, like an RNN or LSTM model.
The nature of the data used also makes this project difficult to generalize. The researcher collected many tokens of “racist” tweets, but these might perhaps be more precisely labeled as “Islamophobic” tweets. In a future iteration of the project, we could choose a greater variety of tweets that could be considered as hate speech. In addition, many of the tweets were from the same users, or from the same thread. This has advantages and disadvantages. One advantage may be that this makes the system more fine-tuned: if two people are discussing the same topic, what differentiates one as using “hate speech” versus one who is not? But on the other hand, many of the inputs were repeated and did not have much bearing on whether or not the tweet was problematic – for example, one word that showed up in our TF-IDF terms for sexism was “Colin.” We do not believe that the incidence of the name “Colin” is meaningful in determining whether a tweet is sexist or not. During data cleaning, we removed names and obviously meaningless tokens such as these from our analysis, but to fully automate a system like this using this methodology would be tricky.  If we were to continue this project, we would like to have a larger, more representative dataset of tweets, perhaps those flagged as offensive by users themselves (and double-checked, of course) and to use more sophisticated deep learning methodologies for natural language processing.
Overall, collaborating on this project gave us a chance to further develop our skills in text analytics, Python, and machine learning. We believe that this project shows that machine learning has the potential to categorize even nebulous, ill-defined topics like hate speech.
Columnists: Kirsty Ward and Amy Hemmeter