Splitting a Pandas DataFrame into training and testing sets
Start your free 7-days trial now!
To split a DataFrame into training and test sets, use Scikit-learn's train_test_split(~) method.
Example
Basic usage
Suppose we wanted to split the following DataFrame into training and testing sets:
df
A B C0 3 6 101 4 7 112 5 8 123 6 9 13
We first need to divide df into two DataFrames - one for features, and one for targets:
Here, the : before the , indicates that we want to fetch all rows, and whatever is after the , are the columns to fetch.
We then import and use the train_test_split(~) method to split our X and y into training and testing sets:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Here, note the following:
the splitting process involves random shuffling. You can turn this off by setting
shuffle=False.the split is 75% training and 25% tests by default.
the
random_state=1is needed for reproducibility; despite the random nature of splits, you would still end up with the same splits over and over again by using the samerandom_state.
Just for your reference, here's X_train:
X_train # DataFrame
A B2 5 80 3 61 4 7
Here's y_test:
y_test # Series
3 13Name: C, dtype: int64
Changing training and test size
By default, the split is 75% training and 25% tests. We can change this by specifying the parameters train_size and/or test_size, both of which must be between 0 and 1. As you would expect, you just need to specify one of these.
To do a 50:50 split:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=1)X_train
A B0 3 61 4 7