search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Splitting a Pandas DataFrame into training and testing sets

schedule Aug 12, 2023
Last updated
local_offer
Python
Tags
tocTable of Contents
expand_more
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

To split a DataFrame into training and test sets, use Scikit-learn's train_test_split(~) method.

Example

Basic usage

Suppose we wanted to split the following DataFrame into training and testing sets:

df = pd.DataFrame({"A":[3,4,5,6],"B":[6,7,8,9],"C":[10,11,12,13]})
df
   A  B  C
0  3  6  10
1  4  7  11
2  5  8  12
3  6  9  13

We first need to divide df into two DataFrames - one for features, and one for targets:

X = df.loc[:,["A","B"]]
y = df.loc[:,"C"]

Here, the : before the , indicates that we want to fetch all rows, and whatever is after the , are the columns to fetch.

We then import and use the train_test_split(~) method to split our X and y into training and testing sets:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Here, note the following:

  • the splitting process involves random shuffling. You can turn this off by setting shuffle=False.

  • the split is 75% training and 25% tests by default.

  • the random_state=1 is needed for reproducibility; despite the random nature of splits, you would still end up with the same splits over and over again by using the same random_state.

Just for your reference, here's X_train:

X_train      # DataFrame
   A  B
2  5  8
0  3  6
1  4  7

Here's y_test:

y_test      # Series
3 13
Name: C, dtype: int64

Changing training and test size

By default, the split is 75% training and 25% tests. We can change this by specifying the parameters train_size and/or test_size, both of which must be between 0 and 1. As you would expect, you just need to specify one of these.

To do a 50:50 split:

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5random_state=1)
X_train
   A  B
0  3  6
1  4  7
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!