What insights can NLP give us about Reddit posts?

J Kelman
7 min readJul 8, 2020

With so much of today’s communication happening online, natural language processing (NLP) can play a crucial role in helping us gather information from online posts. While learning about NLP, I used Reddit’s API to collect online submission data and see how it can be used to derive insights for a fictitious company. For the purpose of this article, I will focus on the data cleaning and exploratory data analysis (EDA).

The Data

We gathered submissions from 2 subreddits: the OCD (Obsessive Compulsive Disorder) subreddit and the autism (ASD) subreddit. The OCD and ASD subreddits were specifically chosen because they are two disorders that share symptoms (such as narrowed interests or difficulty deviating from a routine). As a result, it is interesting to see what insights NLP can give us about two separate populations that share characteristics. Will NLP be able to pick up on the similarities and differences of those 2 user groups?

Our dataframe contained 3,862 submissions posted by individual users between November 2019 and March 2020. It looked a little something like that:

Data Cleaning

Before any analysis can be done, data needs to be cleaned.

Step 1: Removing unneeded variables

One of the variables, created_utc (Universal Time Coordinated when the submission was originally posted), was used to create the timestamp feature (date the submission was posted on). Using the timestamp feature is more appropriate for our analysis, so we can delete the created_utc variable.

Step 2: Reformatting our target variable

Our target variable, subreddit, only takes two possible values: OCD and ASD. As a result, we can reformat this feature as a binary variable with ASD as 1 and OCD as 0. This is useful as categorical variables cannot be used in a model.

Step 3: Handling null values

We first need to check if our dataset contains any missing values.

We see that the selftext feature (main content of the submission) contains 177 missing values. In this case, we really cannot impute or infer what the missing data may be and deleting data points is never advisable. However, for our analysis, we are interested in any type of text shared by users, whether it is main text or title. In addition, looking at the data it seems like some users wrote the entirety of their post in the “title” section. Combining selftext and title into one text feature would resolve the missing value issue without compromising the data.

However, looking at the data, we see that some text (such as “[removed]” or “[deleted’]”) seems to be imputed by Reddit. As our analysis focuses on text submitted by users, we need to remove the Reddit imputed text before we combine selftext and title.

Since a null value cannot be added to a string, we need to fill the null values with an empty string.

We can now combine selftext and title into one main text feature.

Let’s confirm that we no longer have any null values in our data.

Our data now looks like this:

Step 4: Cleaning the text

Now that the general data cleaning and formatting is done, we can focus on NLP specific cleaning. A function was created to make this process more efficient.

Let’s take a closer look at what this function does:

  • The first step of this function (line 4) removes the html formatting using Beautiful Soup’s .get_text().
  • The second step (line 6) creates a variable called “exclude” which is composed of one instance (because it’s a set) of all punctuation.
  • The third step is to replace characters indicating a new line “\n” and digits “\d” by an empty string (line 8 and 9).
  • The fourth step is to remove links from the submission. This is done by first creating a variable combining common URL patterns (such as strings starting with “https”) (line 12) and then replacing those strings by blank spaces in our data (line 13).
  • The fifth step removes punctuation. It does that by looping through the text and joining the text together unless it is part of the exclude variable previously created. Since this variable contains all the punctuation, we are essentially just removing the punctuation and keeping everything else (line 16).
  • The next step standardizes white spaces. This is done by replacing r’\s+’ (\s+ meaning all white spaces and r indicating that escape code should be ignored) with a single white space “ ” (line 18).
  • To make text more uniform, capitalization is removed using .lower() (line 20).
  • Finally, the text is stemmed and lemmatized. Stemming consists of using the stem or base of the word. Lemmatizing consists of using the base/dictionary form of each word. This ensures that different forms of the same word such as “eat”, “eats”, and “eating” are combined.

Now that we’ve created the function, it’s time to use it and clean our text:

Our data is clean and ready for analysis!

EDA Findings

A lot of insights can be derived from this data. Here we will focus on an analysis of the most common words. In order to do that, we need to use CountVectorizer. CountVectorizer transforms text data by making each word into a variable and counting how many time it appears in each observation. In this analysis, one more thing we need to consider is stopwords. Stopwords are a list of very common words (such as “the” or “a”) that are often removed because they can amount to unnecessary information.

20 Most common words in the autism subreddit

Visualizing the 20 most common words with and without stopwords in the autism subreddit gives us the following graphs:

Those graphs show us 3 things:

  • When stopwords are included, a majority of the 20 most frequent words are those very common words. This makes sense considering the English language, but does not bring a lot of value to our analysis.
  • When the stopwords are removed, more words relating to autism emerge as “top words”.
  • Words such as “help”, “think”, “feel” and “people” give us a glimpse at the type of messages people post on this subreddit. It seems like people use this Reddit page to share about their experiences with the disorder, express their feelings, and ask for help.

20 Most common words in the OCD subreddit

Visualizing the 20 most common words with and without stopwords in the OCD subreddit gives us the following graphs:

Once again, when stopwords are included, a majority of the 20 most frequent words are those very common words. This makes sense considering the English language, but does not bring a lot of value to our analysis. When the stopwords are removed, more words relating to OCD (such as “thought”) emerge as “top words”.

Now let’s look at both subreddits together:

This graph shows us the most common words in our entire dataset and labels them as either being a top word in the OCD subreddit only, ASD subreddit only, or both.

We see that most words are found in both subreddits, revealing the similarities in both groups and the fact that they seem to both be using Reddit to share their experiences and feelings. However, 4 words emerge as subreddit specific: “ocd” and “thought” for the OCD page, and “people” and “help” for the autism subreddit. Those words are particularly interesting because they highlight core differences between the two disorders: OCD is defined by intrusive thoughts while ASD is characterized by difficulties with social interactions. It also suggests that the users posting on the autism page may be using this platform to ask for help more than individuals active on the OCD subreddit.

Overall, we saw that NLP was a useful tool to quickly identify potential similarities and differences between two user groups. In addition, some insights found during our analysis (such as the prevalence of stopwords) may be very useful to inform modeling.

To see the full project and how those insights were used to create a classification model, click here.

--

--

J Kelman

On a journey to shift from neuroscience to data science.