Extracting (or Removing) Mentions and Hashtags in Tweets using Python

Image by Oldiefan from Pixabay

One of the most common problem when dealing with Twitter data (such as tweets) is knowing how to extract hashtags and mentions, or in some cases knowing how to remove these hashtags and mentions from a tweet. If you’re like me, who has to deal with data from social media (particularly Twitter) at work a lot, you probably face this problem too. It’s an important step when you are extracting or cleaning your data.

Now a tweet is limited to 280 characters. People can literally write anything in a tweet, from mentioning other user, putting up a hashtag, sharing a URL/link, etc. If you are a fellow Twitter user, you probably already know this anyway. So I am going to just show you how to extract and remove hashtags and mentions.

Let’s pretend we have this tweet below. Also, as a start I am going to import re, which is a package for regex in Python, because we will totally need this package later.

import re

tweet = "@__therealanna @heyitsrose let's have a zoom meeting tonite! #quarantinelife #girlsnight #onlinehangout"

You see in the tweet above we have a tweet that contains two mentions and three hashtags.

Extracting Mentions and Hashtags from A Tweet in Python

Extracting mentions and hashtags means you take these mentions and hashtags in a tweet and put them in variable (s) for later use.

Extracting Mentions

mentions = re.findall("@([a-zA-Z0-9_]{1,50})", tweet)
mentions

If you print the mentions above, your output should be a list of the usernames mentioned in the tweet.

['__therealanna', 'heyitsrose']

The code above works by getting all characters (a-z for lower letters or A-Z for capital letters, 0-9 for any number, and an underscore because Twitter allows underscore in usernames) after the @ sign. Then it will stop if it meets any character other than any character mentioned in the brackets, for example a whitespace. I limit to only fifty characters at most because I don’t think Twitter allows more than 50 characters in their usernames.

Extracting Hashtags

The same logic used in extracting usernames can also be applied to extracting hashtags.

hashtags = re.findall("#([a-zA-Z0-9_]{1,50})", tweet)
hashtags

The output should be a list of hashtags like this ['quarantinelife', 'girlsnight', 'onlinehangout'].

Removing Mentions and Hashtags from A Tweet in Python

By removing mentions and hashtags from a tweet, we aim to have a ‘cleaner’ tweet that won’t contain these things. Again, regex can be helpful to use here.

clean_tweet = re.sub("@[A-Za-z0-9_]+","", tweet)
clean_tweet = re.sub("#[A-Za-z0-9_]+","", clean_tweet)

clean_tweet

" let's have a zoom meeting tonite! "

Your output should be like the text above. There is no more mention or hashtag in the tweet.

Thanks for reading!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s