One of the most common problem when dealing with Twitter data (such as tweets) is knowing how to extract hashtags and mentions, or in some cases knowing how to remove these hashtags and mentions from a tweet. If you’re like me, who has to deal with data from social media (particularly Twitter) at work a lot, you probably face this problem too. It’s an important step when you are extracting or cleaning your data.
Now a tweet is limited to 280 characters. People can literally write anything in a tweet, from mentioning other user, putting up a hashtag, sharing a URL/link, etc. If you are a fellow Twitter user, you probably already know this anyway. So I am going to just show you how to extract and remove hashtags and mentions.
Let’s pretend we have this tweet below. Also, as a start I am going to import re
, which is a package for regex in Python, because we will totally need this package later.
import re
tweet = "@__therealanna @heyitsrose let's have a zoom meeting tonite! #quarantinelife #girlsnight #onlinehangout"
You see in the tweet
above we have a tweet that contains two mentions and three hashtags.
Extracting Mentions and Hashtags from A Tweet in Python
Extracting mentions and hashtags means you take these mentions and hashtags in a tweet and put them in variable (s) for later use.
Extracting Mentions
mentions = re.findall("@([a-zA-Z0-9_]{1,50})", tweet)
mentions
If you print the mentions
above, your output should be a list of the usernames mentioned in the tweet.
['__therealanna', 'heyitsrose']
The code above works by getting all characters (a-z for lower letters or A-Z for capital letters, 0-9 for any number, and an underscore because Twitter allows underscore in usernames) after the @ sign. Then it will stop if it meets any character other than any character mentioned in the brackets, for example a whitespace. I limit to only fifty characters at most because I don’t think Twitter allows more than 50 characters in their usernames.
Extracting Hashtags
The same logic used in extracting usernames can also be applied to extracting hashtags.
hashtags = re.findall("#([a-zA-Z0-9_]{1,50})", tweet)
hashtags
The output should be a list of hashtags like this ['quarantinelife', 'girlsnight', 'onlinehangout']
.
Removing Mentions and Hashtags from A Tweet in Python
By removing mentions and hashtags from a tweet, we aim to have a ‘cleaner’ tweet that won’t contain these things. Again, regex can be helpful to use here.
clean_tweet = re.sub("@[A-Za-z0-9_]+","", tweet)
clean_tweet = re.sub("#[A-Za-z0-9_]+","", clean_tweet)
clean_tweet
" let's have a zoom meeting tonite! "
Your output should be like the text above. There is no more mention or hashtag in the tweet.
Thanks for reading!