# Splitting a Pandas DataFrame into training and testing sets

schedule Aug 12, 2023
Last updated
local_offer
Python
Tags
expand_more
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

To split a DataFrame into training and test sets, use Scikit-learn's `train_test_split(~)` method.

# Example

## Basic usage

Suppose we wanted to split the following DataFrame into training and testing sets:

``` df = pd.DataFrame({"A":[3,4,5,6],"B":[6,7,8,9],"C":[10,11,12,13]})df    A  B  C0  3  6  101  4  7  112  5  8  123  6  9  13 ```

We first need to divide `df` into two DataFrames - one for features, and one for targets:

``` X = df.loc[:,["A","B"]]y = df.loc[:,"C"] ```

Here, the `:` before the `,` indicates that we want to fetch all rows, and whatever is after the `,` are the columns to fetch.

We then import and use the `train_test_split(~)` method to split our `X` and `y` into training and testing sets:

``` from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) ```

Here, note the following:

• the splitting process involves random shuffling. You can turn this off by setting `shuffle=False`.

• the split is 75% training and 25% tests by default.

• the `random_state=1` is needed for reproducibility; despite the random nature of splits, you would still end up with the same splits over and over again by using the same `random_state`.

Just for your reference, here's `X_train`:

``` X_train      # DataFrame    A  B2  5  80  3  61  4  7 ```

Here's `y_test`:

``` y_test      # Series 3 13Name: C, dtype: int64 ```

## Changing training and test size

By default, the split is 75% training and 25% tests. We can change this by specifying the parameters `train_size` and/or `test_size`, both of which must be between 0 and 1. As you would expect, you just need to specify one of these.

To do a 50:50 split:

``` X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=1)X_train    A  B0  3  61  4  7 ```
