Creating Word Clouds in Python

Welcome to an exciting article on the word cloud generation. Word clouds are great ways to summarize vast pieces of information visually. They are typically used to depict metadata on websites. The bigger the font size of the keyword, the higher its significance on the website. In this article, we will code a program to generate custom word clouds.

Installation

We'll install the packages required for this tutorial in a virtual environment. We'll use conda to create a virtual environment. For more installation information, refer to the Anaconda Package Manager website.

Create a new virtual environment by typing the command in the terminal. Perform this after installing anaconda package manager using the instructions mentioned on Anaconda's website.

conda create -n wordcloud python=3.6

This will create a virtual environment with Python 3.6.

We will be installing the following packages:

matplotlib
nltk
wordcloud

Activate the virtual environment using the command, conda activate wordcloud. After activating the virtual environment, we'll be installing these packages locally in the virtual environment. To use these packages, we must always activate the virtual environment named wordcloud before proceeding. You may also use the name of your choice for the virtual environment. Just replace wordcloud with the name of your choice.

To install the packages, we will use the following commands:

matplotlib: pip3 install matplotlib
nltk: pip3 install nltk
wordcloud: pip3 install wordcloud

Note: If you get an error during installation, install the 1.19.3 version of numpy. Use the command pip3 install numpy==1.19.3 For more information on the error, refer to this discussion.

NLTK Installation: On the basis of the installer used, you may or may not need to run the following commands. If you get an error related to stopwords or punkt package while running the code, run the following commands:

>>> import nltk
>>> nltk.download('popular')

A Graphical User Interface pops up. If you are unsure of what to download, refer to this question on Stack Overflow.

Once installed, check if the packages are installed correctly. Run the following piece of code in an instance of the python shell (activated by running python3) in your terminal, and you should get a valid output for the version number.

>>> import nltk
>>> print(ntlk.__version__)
>>> import matplotlib
>>> print(matplotlib.__version__)
>>> import wordcloud
>>> print(wordcloud.__version__)

If you get valid output, you have successfully installed the package and can proceed with the rest of the article. We should get the following output:

>>> import matplotlib
>>> matplotlib.__version__
'3.3.3'
>>> import nltk
>>> nltk.__version__
'3.5'
>>> import wordcloud
>>> wordcloud.__version__
'1.8.1'

Word Cloud Generation

Let's now look at the code to generate word clouds. The input to the program will be a paragraph copied from any website of your choice. With the paragraph as input, we'll pre-process it and send it to the wordcloud package.

Let's begin.

As mentioned above, we use the following libraries:

matplotlib: A visualization and plotting tool used extensively in Python.
nltk.corpus.stopwords: Natural language toolkit, known as nltk, is a library built for performing various Natural Language Processing (NLP) tasks. It's a vast library with tools for pre-processing, data cleaning, data visualization, data modeling, etc. We'll use the list of stopwords for English. Stopwords are redundant words that don't add significant meaning to the data.
nltk.tokenize.word_tokenize: Tokenization is the process of breaking down the text into smaller units called tokens. The tokens can be words, sub-words, or phrases. We'll use the tokenizer available in nltk.
wordcloud: It's a library that takes in the list of words and outputs a word cloud image. Developed by Andreas Mueller, it's quite extensible and flexible with respect to the features.

Code

We define a class called WordCloudGeneration and define the following methods in the class:

pre-processing: We pass the input data through the tokenizer. The data is converted to lower case and tokenized. Tokenization results in a list of words. This list of words is further filtered. The filtering process copies words to preprocessed_data only if the word is not a stopword.
create_word_cloud: This function takes in the processed list of words and calls the WordCloud class object. The generate method in the WordCloud class returns an image of the word cloud. Using the library matplotlib, we plot the image.

import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud

class WordCloudGeneration:
    def preprocessing(self, data):
        # convert all words to lowercase
        data = [item.lower() for item in data]
        # load the stop_words of english
        stop_words = set(stopwords.words('english'))
        # concatenate all the data with spaces.
        paragraph = ' '.join(data)
        # tokenize the paragraph using the inbuilt tokenizer
        word_tokens = word_tokenize(paragraph) 
        # filter words present in stopwords list 
        preprocessed_data = ' '.join([word for word in word_tokens if not word in stop_words])
        print("\n Preprocessed Data: " ,preprocessed_data)
        return preprocessed_data

    def create_word_cloud(self, final_data):
        # initiate WordCloud object with parameters width, height, maximum font size and background color
        # call the generate method of WordCloud class to generate an image
        wordcloud = WordCloud(width=1600, height=800, max_font_size=200, background_color="black").generate(final_data)
        # plt the image generated by WordCloud class
        plt.figure(figsize=(12,10))
        plt.imshow(wordcloud)
        plt.axis("off")
        plt.show()

wordcloud_generator = WordCloudGeneration()
# you may uncomment the following line to use custom input
# input_text = input("Enter the text here: ")
input_text = 'These datasets are used for machine-learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.'
input_text = input_text.split('.')
clean_data = wordcloud_generator.preprocessing(input_text)
wordcloud_generator.create_word_cloud(clean_data)

The output of the code pops in a separate window. It should look like the image shown below.

Conclusion

This was a fun experiment we coded in Python. We went over how to generate a virtual environment using Anaconda and how to install the needed packages to generate word clouds. I encourage you to test the program with various inputs and experiment with the code. Happy learning.

Peer Review Contributions by: Adrian Murage

Author

Lalithnarayan C

Lalithnaryan C is an ambitious and creative engineer pursuing his Masters in Artificial Intelligence at Defense Institute of Advanced Technology, DRDO, Pune. He is passionate about building tech products that inspire and make space for human creativity to flourish. He is on a quest to understand the infinite intelligence through technology, philosophy, and meditation.

More Articles by Author