Scatter Plot in Python using Seaborn

Scatter Plot using Seaborn

One of the handiest visualization tools for making quick inferences about relationships between variables is the scatter plot. We're going to be using Seaborn and the boston housing data set from the Sci-Kit Learn library to accomplish this.

import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt

sb.set(font_scale=1.2, style="ticks") #set styling preferences
dataset = datasets.load_boston()

#convert to pandas data frame
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
df.head()
df = df.rename(columns={'target': 'median_value', 'oldName2': 'newName2'})
df.DIS = df.DIS.round(0)

Describe the data

df.describe().round(1)

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	median_value
count	506.0	506.0	506.0	506.0	506.0	506.0	506.0	506.0	506.0	506.0	506.0	506.0	506.0	506.0
mean	3.6	11.4	11.1	0.1	0.6	6.3	68.6	3.8	9.5	408.2	18.5	356.7	12.7	22.5
std	8.6	23.3	6.9	0.3	0.1	0.7	28.1	2.1	8.7	168.5	2.2	91.3	7.1	9.2
min	0.0	0.0	0.5	0.0	0.4	3.6	2.9	1.0	1.0	187.0	12.6	0.3	1.7	5.0
25%	0.1	0.0	5.2	0.0	0.4	5.9	45.0	2.0	4.0	279.0	17.4	375.4	7.0	17.0
50%	0.3	0.0	9.7	0.0	0.5	6.2	77.5	3.0	5.0	330.0	19.0	391.4	11.4	21.2
75%	3.6	12.5	18.1	0.0	0.6	6.6	94.1	5.0	24.0	666.0	20.2	396.2	17.0	25.0
max	89.0	100.0	27.7	1.0	0.9	8.8	100.0	12.0	24.0	711.0	22.0	396.9	38.0	50.0

Variable Key

Variable	Name
CRIM	per capita crime rate by town
ZN	proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS	proportion of non-retail business acres per town
CHAS	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX	nitric oxides concentration (parts per 10 million)
RM	average number of rooms per dwelling
AGE	proportion of owner-occupied units built prior to 1940
DIS	weighted distances to five Boston employment centres
RAD	index of accessibility to radial highways
TAX	full-value property-tax rate per \$10,000
PTRATIO	pupil-teacher ratio by town
B	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT	% lower status of the population
median_value	Median value of owner-occupied homes in $1000's

via UCI

Barebones scatter plot

plot = sb.lmplot(x="RM", y="median_value", data=df)

scatter1

Add some color and re-label

points = plt.scatter(df["RM"], df["median_value"],
                     c=df["median_value"], s=20, cmap="Spectral") #set style options

#add a color bar
plt.colorbar(points)

#set limits
plt.xlim(3, 9)
plt.ylim(0, 50)

#build the plot
plot = sb.regplot("RM", "median_value", data=df, scatter=False, color=".1")
plot = plot.set(ylabel='Median Home Price ($1000s)', xlabel='Mean Number of Rooms') #add labels

scatter2