The Data Science Process

Lesson 3/20 | Study Time: 10 Min

Course: Introduction to Python Basics for Data Science

The Data Science Process

There is much more to data science than chatGPT and selecting, applying and tuning Machine Learning algorithms. A data science project will often include the following stages.

In this section, you will go through each of these stages and see what is involved.

Business Understanding / Domain Knowledge

Before trying to solve a data related problem, it is important that a Data Scientist/Analyst has a clear understanding of the problem domain and the kinds of question(s) that need to be answered by their analysis.

Some of the questions that the Data Scientist might be asked include:

How much or how many? E.g. Identifying the number of new customers likely to join your company in the next quarter.
Which category? E.g. Assigning a document to a given category for a document management system. (Classification analysis)
Which group? E.g. Creating a number of groups (segments) of your customers based on their monetary value. Is this weird? E.g. Detecting suspicious activities of customers by a credit card company to identify potential fraud. (Anomaly detection)
Which items would a user prefer? E.g. Recommending new products (such as movies, books or music) to existing customers (Recommendation systems)

Data Mining

After identifying the objective for your analysis and agreeing on analytical question(s) that need to be answered, the next step is to identify and gather the required data.

Data mining is a process of identifying and collecting data of interest from different sources - databases, text files, APIs, the Internet, and even printed documents. Some of the questions that you may ask yourself at this stage are:

What data do I need in order to answer my analytical question?
Where can I find this data?
How can I obtain the data from the data source?
How do I sample from this data?
Are there any privacy/legal issues that I must consider prior to using this data?

Data Cleaning

Data cleaning is usually the most time-consuming stage of the Data Science process. This stage may take up to 50-80% of a Data Scientist's time as there are a vast number of possible problems that make the data "dirty" and unsuitable for analysis. Some of the problems you may see in data are:

Inconsistencies in data
Misspelled text data
Outliers
Imbalanced data
Invalid/outdated data
Missing data

This stage requires the development of a careful strategy on how to deal with these issues. Such a strategy may vary substantially between different analyses depending on the nature of problems being solved.

Data Exploration

Data exploration or Exploratory Data Analysis (EDA) helps highlight the patterns and relations in data. Exploratory analysis may involve the following activities:

Calculating basic descriptive statistics such as the mean, the median, and the mode
Creating a range of plots including histograms, scatter plots, and distribution curves to identify trends in the data
Other interactive visualizations to focus on a specific segments of data

Feature Engineering

A "Feature" is a measurable attribute of the phenomenon being observed - the number of bedrooms in a house or the weight of a vehicle.

Based on the nature of the analytical question asked in the first step, a Data Scientist (future you) may have to engineer additional features not found in the original dataset.

Feature engineering is the process of using expert knowledge to transform raw data into meaningful features that directly address the problem you are trying to solve.

For example, taking weight and height to calculate Body Mass Index for the individuals in the dataset. This stage will substantially influence the accuracy of the predictive model you construct in the next stage.

Predictive Modelling

Modelling is the stage where you use mathematical and/or statistical approaches to answer your analytical question.

Predictive Modelling refers to the process of using probabilistic statistical methods to try to predict the outcome of an event. For example, based on employee data, an organization can develop a predictive model to identify employee attrition rate in order to develop better retention strategies.

Choosing the "right" model is often a challenging decision as there is never a single right answer. Selecting a model involves balancing the accuracy and computational cost of the analysis process.

For example, some recent approaches in predictive modelling such as deep learning have been shown to offer vastly improved accuracy of results, but with a very high computational cost.

Data Visualization

After deriving the required results from a statistical model, visualizations are normally used to summarize and present the findings of the analysis process in a form which is easily understandable by non-technical decision makers.

Data visualization could be thought of as an evolution of visual communication techniques as it deals with the visual representation of data.

There are a wide range of different data visualization techniques, from bar graphs, line graphs and scatter plots to alluvial diagrams and spatio-temporal visualizations, each of which will work better for presenting certain types of information.

Summary

In this lesson, we looked at the end-to-end Data Science process to give a sense of the activities that Data Scientists engage with.

I promise you it's not as tough as it sounds ... Now, get ready for a quick quiz to test your understanding 😀

Previous Lesson Next Lesson

Admin

Product Designer

Profile

Class Sessions

1- Data Science Fundamentals 2- Problems Data Science Can Solve 3- The Data Science Process 4- What is Python? 5- Get Started with Google Colab 6- Comments in Python 7- Operators in Python 8- Variables in Python 9- Input and Output in Python 10- Data Structures & Data Types in Python 11- Strings 12- Introduction & Implementation of List in Python 13- Introduction & Implementation of Tuple in Python 14- Introduction & Implementation of Dictionary in Python 15- Inbuilt Functions 16- Define functions in Python 17- Methods in Python 18- Difference between functions and methods 19- Packages in Python 20- Python Notebook

GDPR

When you visit any of our websites, it may store or retrieve information on your browser, mostly in the form of cookies. This information might be about you, your preferences or your device and is mostly used to make the site work as you expect it to. The information does not usually directly identify you, but it can give you a more personalized web experience. Because we respect your right to privacy, you can choose not to allow some types of cookies. Click on the different category headings to find out more and manage your preferences. Please note, that blocking some types of cookies may impact your experience of the site and the services we are able to offer.

The Data Science Process

The Data Science Process

Business Understanding / Domain Knowledge

Data Mining

Data Cleaning

Data Exploration

Feature Engineering

Predictive Modelling

Data Visualization

Summary

Admin

Class Sessions

Your privacy matters

GDPR