Learn everything you need to know about exploratory data analysis, a critical process used to discover trends and patterns and summarize data sets with the help of statistical summaries and graphical representations.
Like any project, a data science project is a long process that requires time, good organization, and scrupulous respect for several steps. Exploratory data analysis (EDA) is one of the most important steps in this process.
Therefore, in this article, we will briefly look into what exploratory data analysis is and how you can perform it with R!
What is Exploratory Data Analysis?
Exploratory data analysis examines and studies the characteristics of a data set before it is submitted to an application, whether exclusively business, statistical, or machine learning.
This summary of the nature of the information and its main particularities is usually done by visual methods, such as graphical representations and tables. The practice is carried out in advance precisely to assess the potential of these data, which will receive a more complex treatment in the future.
The EDA therefore allows:
Formulate hypotheses for the use of this information;
Explore hidden details in the data structure;
Identify missing values, outliers, or abnormal behaviors;
Discover trends and relevant variables as a whole;
Discard irrelevant variables or variables correlated with others;
Determine the formal modeling to be used.
What is the Difference Between Descriptive and Exploratory Data Analysis?
There are two types of data analysis, descriptive analysis, and exploratory data analysis, which go hand in hand, despite having different goals.
While the first focuses on describing the behavior of variables, for example, mean, median, mode, etc.
The exploratory analysis aims to identify relationships between variables, extract preliminary insights and direct the modeling to the most common machine learning paradigms: classification, regression, and clustering.
In common, both may deal with graphic representation; however, only exploratory analysis seeks to bring actionable insights, that is, insights that provoke action by the decision-maker.
Finally, while exploratory data analysis seeks to solve problems and bring solutions that will guide the modeling steps, descriptive analysis, as its name implies, only aims to produce a detailed description of the dataset in question.
Exploratory Data Analysis
Analyzes behavior and relationship
Provides a summary
Leads to specification and actions
Organizes data in tables and graphs
Organizes data in tables and graphs
Does not have significant explanatory power
Does have a significant explanatory power
Some Practical Uses Cases of EDA
#1. Digital Marketing
Digital Marketing has evolved from a creative process to a data-driven process. Marketing organizations use exploratory data analysis to determine the results of campaigns or efforts and to guide consumer investment and targeting decisions.
Demographic studies, customer segmentation, and other techniques allow marketers to use large amounts of consumer purchase, survey, and panel data to understand and communicate strategy marketing.
Web exploratory analytics allows marketers to collect session-level information about interactions on a website. Google Analytics is an example of a free and popular analytics tool marketers use for this purpose.
Exploratory techniques frequently used in marketing include marketing mix modeling, pricing and promotion analyses, sales optimization, and exploratory customer analysis, e.g., segmentation.
#2. Exploratory Portfolio Analysis
A common application of exploratory data analysis is exploratory portfolio analysis. A bank or lending agency has a collection of accounts of varying value and risk.
Accounts may differ depending on the holder’s social status (rich, middle class, poor, etc.), geographic location, net worth, and many other factors. The lender must balance the return on the loan with the risk of default for each loan. The question then becomes how to value the portfolio as a whole.
The lowest-risk loan may be for very wealthy people, but there are a very limited number of wealthy people. On the other hand, many poor people can lend, but at greater risk.
The exploratory data analysis solution can combine time series analysis with many other problems to decide when to lend money to these different segments of borrowers or the rate of lending. Interest is charged to members of a portfolio segment to cover losses among members of that segment.
#3. Exploratory Risk Analysis
Predictive models in banking are being developed to provide certainty about risk scores for individual customers. Credit scores are designed to predict an individual’s delinquent behavior and are widely used to assess each applicant’s creditworthiness.
In addition, risk analysis is carried out in the scientific world and the insurance industry. It is also widely used in financial institutions such as online payment gateway companies to analyze whether a transaction is genuine or fraudulent.
For this purpose, they use the customer’s transaction history. It is more commonly used in credit card purchases; when there is a sudden spike in client transaction volume, the client receives a confirmation call if he initiated the transaction. It also helps to reduce losses due to such circumstances.
Exploratory Data Analysis with R
The first thing that you need to perform EDA with R is to download R base and R Studio (IDE), followed by installing and loading the following packages:
For this tutorial, we will use an economics dataset that comes builtin with R and provides yearly economic indicators data of the US economy, and change its name to econ for simplicity:
econ <- ggplot2::economics
To perform the descriptive analysis, we will use the skimr package, which calculates these statistics in a simple and well-presented way:
You can also use the summary function for descriptive analysis:
Here the descriptive analysis shows 547 rows and 6 columns in the dataset. The minimum value is for 1967-07-01, and the maximum is for 2015-04-01. Similarly, it also shows the mean value and the standard deviation.
Now you have a basic idea of what is inside the econ dataset. Let’s plot a histogram of the variable uempmed to better look at the data:
#Histogram of Unemployment
ggplot2::aes(x = uempmed) +
labs(x = "Unemployment", title = "Monthly Unemployment Rate in US between 1967 to 2015")
The distribution of the histogram shows that it has an elongated tail on the right; that is, there are possibly a few observations of this variable with more “extreme” values. The question arises: in what period did these values take place, and what is the trend of the variable?
The most direct way to identify the trend of a variable is through a line graph. Below we generate a line graph and add a smoothing line:
#Line Graph of Unemployment
Using this graph, we can identify that in the most recent period, in the last observations from 2010, there is a tendency for an increase in unemployment, surpassing the history observed in previous decades.
Another important point, especially in econometric modeling contexts, is the stationarity of the series; that is, are the mean and variance constant over time?
When these assumptions are not true in a variable, we say that the series has a unit root (non-stationary) so that the shocks that the variable suffers generate a permanent effect.
It seems to have been the case for the variable in question, the duration of unemployment. We have seen that the fluctuations of the variable have changed considerably, which has strong implications related to economic theories that deal with cycles. But, departing from theory, how do we practically check whether the variable is stationary?
The forecast package has an excellent function allowing to apply tests, such as ADF, KPSS, and others, which already return the number of differences necessary for the series to be stationary:
#Using ADF test for checking stationarity
x = econ$uempmed,
test = "adf")
Here the p-value greater than 0.05 shows that the data is non-stationary.
Another important issue in time series is the identification of possible correlations (the linear relationship) between the lagged values of the series. The ACF and PACF correlograms help to identify it.
As the series does not have seasonality but has a certain trend, the initial autocorrelations tend to be large and positive because the observations close in time are also close in value.
Thus, the autocorrelation function (ACF) of a trended time series tends to have positive values that slowly decrease as the lags increase.
#Residuals of Unemployment
When we get our hands on data that is more or less clean, that is to say, already cleaned, we are immediately tempted to dive into the model construction stage to draw the first results. You have to resist this temptation and start doing exploratory data analysis, which is simple yet helps us draw powerful insights into the data.
Have you ever wondered how it is possible that every time your product owner brings in some new feature topic, the team response is they need to investigate technical possibilities and create some form of design before they can be sure how to develop that feature? Then that’s most likely because you have no Architecture Runway in place.
Google Docs does a great job of keeping things simple. The default page setup works great for most documents, and common formatting options are right on the toolbar. However, when you need to do some advanced formatting, you’ll need to dig a little deeper.