Statistical Sampling with Scikit-learn

Photo by Patrick Perkins on Unsplash

Setting a test set is one of the early stages of developing a machine learning model. Thus creating a representative sample will determine how the model will perform in production.
The following example demonstrates the difference between stratified sampling and random sampling.

0. Step Nada, imports:

  1. First we create a population to work on:
  1. a seed, to make random numbers reproducible
  2. the population of circles and squares, the sample size is 100 with a population of 0.8 and 0.2

2. Then, we convert the data to a data frame and process it:

  1. converting the data to a data frame
  2. encoding the data
  3. adding a label column

3. The random sampling step:

  1. sample size of 20%
  2. p_random result is 0.85 and 0.15, a skewed sample

4. The stratified sampling step:

  1. sample size of 20%
  2. p_strat result is 0.8 and 0.2, just like the original population.

In conclusion, the result is as expected. The stratified sampling was more representative than random sampling; this is only because the play was staged to demonstrates the differences. Increase the size of the population, and see what happens.

  • The overall population is: 0.8, 0.2
  • The random sample population is: 0.85, 0.15
  • The stratified sample populations is: 0.8, 0.2

The stratified sampling function (StratifiedShuffleSplit) will return a sample from each stratum (in this case, our strata was column ‘X’).

Happy cross-validating!

The complete code is here https://github.com/booletic/medium/blob/main/strata.ipynb

Programming and Statistics for now!