Data Mining using CRISP-DM methodology
This article will cover how the CRISP-DM methodology can be used to build successful data science projects. We will also get an overview of how it can be used by analyzing a case study. <!--more--> We’ll be studying a case study to understand how this methodology has helped data scientists build successful projects. As a prerequisite, you must have a beginner level understanding of how data science projects are built.
Table of contents
Introduction
With the rise in usage of data mining across several industries, the need for a standard framework is required to achieve the project's objectives.
The use of a standard framework helps us in:
- Recording experience, which can be later used in replicating it for other similar projects.
- Improving project planning and management.
- Encouraging best practices used for achieving better results.
With the increase in the projects' complexity, it is recommended to follow a standard framework to achieve goals faster.
What is CRISP-DM?
According to this article, CRISP-DM stands for CRoss-Industry Standard Process for Data Mining and was developed in 1996.
CRISP-DM is one of the most preferred techniques used to build data mining projects. A significant increase in the usage of this methodology can be seen after conducting a poll in 2007 and 2014, as shown in the below image:
According to Wikipedia, "Data mining is a process model that describes commonly used approaches that data mining experts use to tackle problems... it was the leading methodology used by industry data miners."
CRISP-DM is a 6 step process:
- Understanding the problem statement.
- Understanding the data.
- Preparing the data.
- Perform data analysis.
- Validating the data.
- Presenting/Visualizing the data.
Problem statement
To understand CRISP-DM methodology, let's look at a simple case study.
A utility company wants to predict the next days' demand for electricity, to allocate necessary resources for power generation.
Now that we have understood what's the business problem. Let's devise a solution to predict the next days' electricity consumption.
Steps in CRISP-DM
Understanding the problem statement
This step focuses on understanding the objectives of the project and requirements from the perspective of the business.
Questions to ask are:
- What is the problem?
- What are the objectives?
- How is the success of the project measured?
- Who are the stakeholders?
Now, let's understand this from the perspective of our problem statement.
Determining business objectives
This step helps us define the necessary methods to be taken when considering a business as a success.
For our problem, our objective is to "Predict the approximate electricity consumption for the next day, to allocate necessary resources".
Assess the situation
This step helps us analyze the project's current situation by identifying the resources and the stakeholders of the project.
For our problem,
- We must find out the factors that affect the increase in electricity consumption. One major factor affecting it is temperature.
Determine data mining goals
This step helps us identify how to translate the business goals into data mining goals and select a proper way for its assessment.
For our problem,
- We must use data mining techniques to find what other factors affect consumption.
- We must find out what type of problem it is - Classification, Prediction, or Clustering?
Produce a project plan
This final step helps us create an initial process plan and estimate the effort and resources needed to achieve our goals.
For our problem,
- We must estimate the resources needed when generating electricity.
- We must devise a series of steps to analyze consumption.
- We must decide how the project is to be evaluated.
- We must decide on the selection of tools and techniques.
Understanding the data
According to this article, the data understanding phase starts with an initial data collection and proceeds with activities to get familiar with the data, identify data quality problems, discover first insights into the data, or detect interesting subsets to form hypotheses for hidden information.
Questions to ask are:
- What information is required?
- What information is available?
- How do we collect the required information?
- What is the underlying pattern of the data?
For our problem,
- Based on an assumption, we can say that Date, Time, and Temperature are 3 major factors affecting electricity consumption.
- Before proceeding with this assumption, we must perform Exploratory Data Analysis to verify our assumptions.
- We must find the type of data that will be used to solve the problem. The type of data refers to discrete, continuous, time-series, or seasonal data.
- We must analyze the data statistically and find the relationship between various types of data.
Preparing the data
According to this article, the data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks can be iterative and don't need to follow any sequence. Tasks include formatting, transforming, and cleaning of data.
In this step, we must follow 5 common steps as mentioned in this article:
- Gathering: Collecting data from multiple verified sources.
- Cleansing: Data can be missing or noisy or sometimes incorrect. Cleaning the data is one of the most important task.
- Formatting: Data must be suited well according to the usecase, that requires transforming or augmenting data.
- Blending: Data can be integrated and blended from multiple resources to achieve the desired objective.
- Sampling: Working with large data is always cumbersome. So, splitting and focusing on important data would reduce the wastage of resources.
These steps remain common for any type of dataset that we choose, irrespective of the problem.
Perform data analysis
According to this article, various modeling techniques are selected and applied in this phase, and their parameters are calibrated to optimal values. Each data has its own requirements, understanding them would sometimes require reiterating previous processes.
Things to consider:
- Determine what the technique can be used to solve the problem.
- Determine the data requirement needed to solve the problem.
- Design a prototype of the model.
- Validate the model, and redesign the model.
In our problem, we find that there is a high correlation between temperature and electricity consumption. This can found by following a series of steps as mentioned in this article:
- Build a predictive model - predicting the next days' temperature based on history.
- Validate the model - validate this by predicting the electricity of the next day to check if a correlation exists.
- Repeat the process - repeat the above 2 processes until gaining correlation confidence.
- Perform analysis - on gaining confidence, we can analyze that helps us in resource allocation.
Validating the data
According to this article, at this stage in the project, you have built a model (or models) that appear to be of high quality from a data analysis perspective. Before proceeding to the model's final deployment, it is important to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives.
Things to consider:
- Ensure that the results match the expectations.
- Decide whether to proceed to the next step or return to a previous phase.
- Make note of the important factors that could result in failure.
- Perform various testing with the end users.
In our problem, we have been working based on the assumption that temperature is one of the key factors affecting consumption. While validating, if we find that temperature does not correlate with electricity consumption, we must roll back to the previous step and further revise our model.
Presenting/visualizing the data
According to this article, the creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a useful way to the customer.
Things to consider:
- Determine the best method of presenting insights based on the analysis and audience.
- Stories speak more than facts.
- Ensure every decision is backed up by proper research.
- Enable the end user to have a visual workflow of the solution.
Though the analysts might have analyzed the problem, visualizing, and presenting the clients/business managers' insights is the key step.
Visualization can be done by plotting graphs, performing statistical analysis, or predicting the next possible outcome. To learn more about visualization, please refer to this article on Matplotlib visualizations and this article on Pandas visualization.
Conclusion
In conclusion, we have learned various CRISP-DM methodology steps and understood them by analyzing a case study. This can be understood better only by using this methodology for your next data science project.
Happy Coding!
To summarize:
-
We learned the need for a framework to build data science projects.
-
We learned about CRISP-DM methodology in-detail.
-
We also analyzed a case-study to understand this methodology
Further reading
- Course on Udacity
- PPT on CRISP-DM
- Article by KDNuggets
- Article by AnalyticsIndiaMag
- A Detailed explanation of the CRISP-DM methodology
- Article by Datascience-pm
- Article by Big data path
- Article on Business Analytics
- Article on analyzing the problem
- Notes from the course on Udacity
- Flash cards on understanding CRISP-DM
Peer Review Contributions by: Lalithnarayan C