How do I create test and train samples from one dataframe with pandas?

Solution 1:

scikit learn's train_test_split is a good one - it will split both numpy arrays as dataframes.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

Solution 2:

I would just use numpy's randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

Solution 3:

Pandas random sample will also work

train=df.sample(frac=0.8,random_state=200) #random state is a seed value
test=df.drop(train.index)

Solution 4:

I would use scikit-learn's own training_test_split, and generate it from the index

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train