Correlation Analysis is a statistical technique used to determine the relationship between two or more variables. It is a crucial step in Exploratory Data Analysis (EDA), and helps to identify patterns, trends, and relationships in data. The outcome of correlation analysis is a correlation coefficient, which can be positive, negative, or zero.
Positive Correlation
Positive correlation occurs when two variables increase or decrease together. In other words, as the value of one variable increases, the value of the other variable also increases, and vice versa. For example, there is a positive correlation between the number of hours of study and the marks obtained in an exam. The more hours a student studies, the higher the marks they are likely to score.
Let’s use Python to demonstrate positive correlation using a sample data set of student study hours and exam scores.
import matplotlib.pyplot as plt
import numpy as np
hours = [2, 4, 6, 8, 10]
scores = [50, 70, 80, 90, 95]
plt.scatter(hours, scores)
plt.xlabel("Hours of Study")
plt.ylabel("Exam Scores")
plt.title("Positive Correlation between Study Hours and Exam Scores")
plt.show()
In this code, we first import the matplotlib.pyplot
and numpy
libraries. Then, we create two arrays hours
and scores
to represent the sample data set. Finally, we plot a scatter plot using the scatter
function from the matplotlib.pyplot
library and display the plot using the show
function. The resulting scatter plot demonstrates a clear positive correlation between the number of hours a student studies and the marks they obtain in an exam.
Negative Correlation
Negative correlation occurs when one variable increases while the other decreases, or vice versa. In other words, as the value of one variable increases, the value of the other variable decreases, and vice versa. For example, there is a negative correlation between the number of hours of screen time and the quality of sleep. The more hours a person spends in front of a screen, the lower the quality of their sleep is likely to be.
Let’s use Python to demonstrate negative correlation using a sample data set of screen time and sleep quality.
import matplotlib.pyplot as plt
import numpy as np
screen_time = [2, 4, 6, 8, 10]
sleep_quality = [8, 7, 6, 5, 4]
plt.scatter(screen_time, sleep_quality)
plt.xlabel("Screen Time (hours)")
plt.ylabel("Sleep Quality (out of 10)")
plt.title("Negative Correlation between Screen Time and Sleep Quality")
plt.show()
No Correlation
No correlation occurs when there is no relationship between two variables. In other words, the change in one variable does not result in a change in the other variable. For example, there is no correlation between a person’s favorite color and their intelligence level.
To illustrate an example of no correlation, let’s consider the case of a person’s height and their intelligence level. There is no relationship between the two variables, as the height of a person does not affect their intelligence level and vice versa. We can plot a scatter plot to visualize this relationship.
Here’s an example code in Python to plot the scatter plot:
import matplotlib.pyplot as plt
import numpy as np
height = np.array([5.5, 5.7, 5.9, 6.1, 6.3, 6.5, 6.7])
intelligence = np.array([100, 110, 120, 130, 140, 150, 160])
plt.scatter(height, intelligence)
plt.xlabel("Height (ft)")
plt.ylabel("Intelligence (IQ)")
plt.title("Scatter Plot of Height vs Intelligence")
plt.show()
In this code, we first import the matplotlib.pyplot and numpy libraries. Then, we create two arrays height and intelligence to represent the sample data set. Finally, we plot a scatter plot using the scatter function from the matplotlib.pyplot library and display the plot using the show function. The resulting scatter plot demonstrates a lack of relationship between the height of a person and their intelligence level.
In conclusion, correlation analysis is a powerful tool for Exploratory Data Analysis. Understanding the types of correlation – positive, negative, and no correlation – is essential for interpreting the results of correlation analysis and for making informed decisions based on the data. It is also important to keep in mind that correlation does not imply causation, and it is crucial to consider other factors before drawing conclusions from the data.