Statistical Sampling with Scikit-learn

Photo by Patrick Perkins on Unsplash

Setting a test set is one of the early stages of developing a machine learning model. Thus creating a representative sample will determine how the model will perform in production.
The following example demonstrates the difference between stratified sampling and random sampling.

0. Step Nada, imports:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
  1. First we create a population to work on:
np.random.seed(42) # (1)p = 0.8
a = ["Circle", "Square"]
size = 100
data = np.random.choice(a=a, size=size, p=[p, 1-p]) # (2)
  1. a seed, to make random numbers reproducible
  2. the population of circles and squares, the sample size is 100 with a population of 0.8 and 0.2

2. Then, we convert the data to a data frame and process it:

df = pd.DataFrame(data=data, columns=['X']) # (1)df.replace({'Circle': 0, 'Square': 1}, inplace=True) # (2)df['y'] = df.apply(lambda row: int(row["X"] == 0), axis=1) # (3)
  1. converting the data to a data frame
  2. encoding the data
  3. adding a label column

3. The random sampling step:

X_train, X_test, y_train, y_test = train_test_split(df['X'], df['y'], test_size=0.2) # (1)p_random = X_test.value_counts() / len(X_test) # (2)
  1. sample size of 20%
  2. p_random result is 0.85 and 0.15, a skewed sample

4. The stratified sampling step:

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2) # (1)for train_index, test_index in split.split(df, df['X']):
strat_train = df.loc[train_index]
strat_test = df.loc[test_index]
p_strat = strat_test['X'].value_counts() / len(strat_test) # (2)
  1. sample size of 20%
  2. p_strat result is 0.8 and 0.2, just like the original population.

In conclusion, the result is as expected. The stratified sampling was more representative than random sampling; this is only because the play was staged to demonstrates the differences. Increase the size of the population, and see what happens.

  • The overall population is: 0.8, 0.2
  • The random sample population is: 0.85, 0.15
  • The stratified sample populations is: 0.8, 0.2

The stratified sampling function (StratifiedShuffleSplit) will return a sample from each stratum (in this case, our strata was column ‘X’).

Happy cross-validating!

The complete code is here




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Trap DS Projects: Beware of “Easy” Segmentation Projects

Search Autocomplete Personalisation

Visualization with Seaborn

Let’s Fix Jakarta’s Traffic

Google Cloud Professional Data Engineer Certification — My personal road map and thoughts in 2020

Bandits can trick you about your outcomes

Frequent pattern mining, Association, and Correlations

Ratio Analysis — The Business Tool for Analyzing Financial Performance of A Business

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mansoor Aldosari

Mansoor Aldosari

More from Medium

What Could Go Wrong: Linear Regression

Logistic Regression- The history, the theory and the maths

Learning Ensemble methods

what is a regression in machine learning?