Notebook - Visualisation in EDA

Notebook - Visualisation in EDA#

data process

Visualisation#

The simple graph has brought more information to the data analyst’s mind than any other device.

—John Tukey

Anscombe’s quartet#

Anscombe’s quartet (please don’t click the link until after you have worked through the notebook) consists of four simple datasets, each consisting of 11 pairs of x and y values.

Based on the statistical measures of mean, standard deviation and the correlation coefficient between the two variables, the datasets look close to identical. The data highlights the importance of visualisation in EDA and going beyond looking at the simple summary statistics.

# The csv contains pairs of (xy) data, so (x1, y1) form the first set of data, (x2, y2) is the second, etc.

anscombe_df = pd.read_csv("data/anscombe_quartet.csv")
anscombe_df

	x1	y1	x2	y2	x3	y3	x4	y4
0	10	8.04	10	9.14	10	7.46	8	6.58
1	8	6.95	8	8.14	8	6.77	8	5.76
2	13	7.58	13	8.74	13	12.74	8	7.71
3	9	8.81	9	8.77	9	7.11	8	8.84
4	11	8.33	11	9.26	11	7.81	8	8.47
5	14	9.96	14	8.10	14	8.84	8	7.04
6	6	7.24	6	6.13	6	6.08	8	5.25
7	4	4.26	4	3.10	4	5.39	19	12.50
8	12	10.84	12	9.13	12	8.15	8	5.56
9	7	4.82	7	7.26	7	6.42	8	7.91
10	5	5.68	5	4.74	5	5.73	8	6.89

We can compare the some simple summary statistics for the x and y columns of each dataset.

The mean and standard deviation give a measure of the centre and its spread, so we can look at those. Pearson’s correlation coefficient, \(r\) measures the linear correlation between the x and y variables.

y_cols = ["y1", "y2", "y3", "y4"]
x_cols = ["x1", "x2", "x3", "x4"]

print("Summary statistics for x data:")
display(anscombe_df[x_cols].describe().loc[["count","mean", "std"]])

print("Summary statistics for y data:")
display(anscombe_df[y_cols].describe().loc[["count","mean", "std"]])


for idx, dataset in enumerate(zip(x_cols, y_cols)):
    print(f"Dataset {idx+1}")
    pearson_r = np.corrcoef(anscombe_df[dataset[0]], anscombe_df[dataset[1]])[0][1]
    print(f"Pearson correlation coefficient: {pearson_r:.3f}\n")

Summary statistics for x data:

	x1	x2	x3	x4
count	11.000000	11.000000	11.000000	11.000000
mean	9.000000	9.000000	9.000000	9.000000
std	3.316625	3.316625	3.316625	3.316625

Summary statistics for y data:

	y1	y2	y3	y4
count	11.000000	11.000000	11.000000	11.000000
mean	7.500909	7.500909	7.500000	7.500909
std	2.031568	2.031657	2.030424	2.030579

Dataset 1
Pearson correlation coefficient: 0.816

Dataset 2
Pearson correlation coefficient: 0.816

Dataset 3
Pearson correlation coefficient: 0.816

Dataset 4
Pearson correlation coefficient: 0.817

The summary statistics for each of the x columns and y columns are the same to two decimal places (the same precision as the data itself). The correlation coefficients are also the same.

So from these measures, the four datasets look very similar. To check this, we can visualise the datasets as a set of scatter plots.

# Plotting function
def anscombe_plots(ans_df=anscombe_df, show_stats=False, show_fit=False):
    """Draw a 4x4 grid showing plots of the Anscombe quartet data"""
    
    INCLUDE_STATS = show_stats
    INCLUDE_FIT = show_fit

    fig, axs = plt.subplots(2,2, sharex=True, sharey=True, figsize=(8, 8))

    for i, ax in enumerate(axs.flat):
        x = ans_df.iloc[:, 2*i]
        y = ans_df.iloc[:, 2*i+1]
        ax.scatter(x, y)
        ax.set_xlabel("x")
        ax.set_ylabel("y")
        data_label = f"Data {i+1}"
        ax.text(0.95, 0.3, data_label, fontsize=16, transform=ax.transAxes, horizontalalignment="right", verticalalignment='top')

        if INCLUDE_STATS:
            add_stats(ax, x, y)
        
        if INCLUDE_FIT:
            add_fit(ax, x, y)


    plt.xlim(0, 20)
    plt.ylim(0, 14)
    plt.xticks([5, 10, 15, 20])
    plt.tight_layout()
    plt.show()    

anscombe_plots(anscombe_df, show_stats=True, show_fit=False)

From the scatter plots we can immediately see that the datasets are quite distinct. If we calculate a linear regression line through the data, do the coefficients of the equation help to distinguish the data numerically?

anscombe_plots(anscombe_df, show_stats=True, show_fit=True)

The best fit line is also identical and the r-squared value indicates that numerically, at least, it fits all of the datasets equally well (or badly).

However, we can see from the visualisations that the regression line is only really appropriate for the first dataset.

If we look at some of the other descriptive statistics, particularly relating to the data distribution (max/min, percentiles), these show there are differences between the data. But the simple scatter plot visualisations make the contrasts unmissable.

In the case of Anscombe’s quartet, the scatter plots are sufficient to highlight the differences. For more complex data, there are other visualisations that highlight the differences between the distributions of the datasets.

Key points summary#

Visualisation is an essential part of exploratory data analysis.
- Always explore your datasets visually before applying models or drawing conclusions.

“The first step in analyzing data is to look at it.”

—John Tukey

Notebook - Visualisation in EDA

Contents

Notebook - Visualisation in EDA#

Visualisation#

Anscombe’s quartet#

Key points summary#

Things to try/consider#