Introduction to Data analysis using Pandas
Data Science and Data Analytics are some of the hottest topics in the Computer Science industry. The ability to analyze and make predictions based on data is nothing short of extraordinary. <!--more--> Python is one of the most popular languages in the data science community. This is due to its ease of use and rich collection of libraries built to work with data. Pandas is a library that makes handling data easy and efficient. In this tutorial, we are going to understand how Pandas can be used to explore and draw insights from data.
Table of Contents
Prerequisites
-
A basic understanding of programming in Python
-
Basic knowledge of data analytics
What is Pandas?
According to the official documentation, Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool. It is built on top of the Python programming language. Pandas is usually used in conjunction with Jupyter notebooks, making it more powerful and efficient for exploratory data analysis.
If you are new to Jupyter notebooks, this article walks you through the installation and basics of Jupyter notebooks.
Pandas provide a robust collection of functions that make it easy to process and read data. In this tutorial, we are going to explore some useful functions and techniques that are an integral part of a data scientist's toolset. You can install Pandas by using Python's package manager, pip.
Enter the following command on the terminal:
pip3 install pandas
Alternatively, if you want to install Pandas using a different method, this tutorial walks you through the various ways in which you can install Pandas.
Analyzing data using Pandas
Now that we have Pandas installed on our system, we can delve into data exploration and analysis. For this, I will be using the "wine dataset". Navigate to this link to download the dataset from Kaggle.
The "wine" dataset is a beginner-friendly dataset that provides information on various factors that affect the quality of the wine. It has 12 columns describing different factors such as pH, the acidity of the wine, etc. I will be using Jupyter notebooks to execute Python code in this tutorial. However, you can execute the code in a different text editor or IDE of your choice. Jupyter notebooks make it easier to view and explore the data.
Create a Jupyter notebook by running the following command on the terminal:
jupyter notebook
It will open a browser window and display the Jupyter notebook UI.
Code
# Import necessary Libraries
import pandas as pd
# read the data using pandas
data = pd.read_csv("winequality-red.csv")
The first line imports the Pandas library and gives it an Alias called pd
. Therefore, every time we use pd
, we will be referring to pandas. The read_csv
function is used to read a CSV (comma separated values) file and stores the contents in a variable called data
.
Pandas stores the read data in a data structure called a Data Frame. According to the official documentation, a data frame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure that also contains labeled axes (rows and columns).
It is similar to a 2D array in Python. By "size-mutable", we mean that we can modify the size of the data frame at any time. By "heterogeneous", we mean that it can have data of different types.
In simple terms, a data frame is like a table that contains named columns and rows of data similar to a table in a database. A data frame is powerful and has a lot of built-in functions that allow us to manipulate data. We are going to look at some of these functions. In the example above, the wine dataset is read and stored in a data frame called data.
data.head()
data.columns
data.info()
The head()
function prints the first 5 rows in the dataset by default. If a number 'n' is specified as an argument, it prints the first 'n' rows in the dataset.
The data.columns
prints a list containing all the column names in the data.
The info()
function provides useful information about the data such as the number of rows, number of columns, name of each column, and its data type, etc.
# Finding the min and max quality of the wine
print("Wine with maximum quality:",data.quality.max())
print("Wine with minimum wuality:",data.quality.min())
data.quality.head(10)
data.quality.tail(5)
In a data frame, we can access individual columns by using the dot(.) operator.
For instance, in the example above, we access the 'quality' column in the data and print the minimum and maximum values. Similarly, we can also access the 'pH' column by typing data.pH
.
This is another way to access individual columns: data['quality']
data.describe()
data['pH'] = data['pH'].values.astype(int)
data.head()
The describe()
function provides some statistical measures such as mean, median, standard deviation, minimum, and maximum values.
The astype()
function converts the data from its original type to the one specified in the argument. In the example above, we convert the 'pH' column that has float
values to integers by specifying int
as the argument.
data['good_wine'] = data['quality'] > 5
data['bad_wine'] = data['quality'] <= 5
data.head()
data = data.sort_values('alcohol', ascending=False)
data.head(10)
We created two new columns, 'good_wine' and 'bad_wine' as shown in the example above. The 'good_wine' column will have True
wherever the 'quality' of the wine is greater than 5. The 'bad_wine' will have True
wherever the 'quality' is less than or equal to 5.
The sort_values()
function sorts the data frame based on the specified column. In the example above, we specify the 'alcohol' column, and the data is sorted based on this column. ascending=False
tells pandas to sort the data in descending order. This can be set to True
if you want the data to be sorted in ascending order.
data = data.drop(columns=['good_wine', 'bad_wine'])
data.head()
The drop()
function can be used to get rid of unwanted columns in the dataset. You can specify a list of columns as an argument, and Pandas will delete all these columns. As you can see in the image above, the 'good_wine' and 'bad_wine' columns have been removed.
Conclusion and further reading
In conclusion, whether you are a data scientist, data engineer, or a software developer, Pandas is an indispensable part of your toolkit. In this tutorial, we looked at how we can explore the wine dataset and how we can draw insights from it using Pandas and its built-in functions.
Now that we have a better understanding of data analytics basics, you can go to Kaggle, download any dataset of your choice, and use Pandas to read, explore, and gain insights from the dataset.
You can also go one step further and apply machine learning algorithms to classify and predict values from the dataset.
Peer Review Contributions by Saiharsha Balasubramaniam