Sample datasets in Pandas
Since I originally wrote this answer, I have updated it with the many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.
Seaborn
The brilliant plotting package seaborn
has several built-in sample data sets.
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
Pandas
If you do not want to import seaborn
, but still want to access its sample
data sets, you can use @andrewwowens's approach for the seaborn sample
data:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
Note that the sample data sets containing categorical columns have their column
type modified by sns.load_dataset()
and the result might not be the same
by getting it from the url directly. The iris and tips sample data sets are also
available in the pandas github repo here.
R sample datasets
Since any dataset can be read via pd.read_csv()
, it is possible to access all
R's sample data sets by copying the URLs from this R data set
repository.
Additional ways of loading the R sample data sets include
statsmodel
import statsmodels.api as sm
iris = sm.datasets.get_rdataset('iris').data
and PyDataset
from pydataset import data
iris = data('iris')
scikit-learn
scikit-learn
returns sample data as numpy arrays rather than a pandas data
frame.
from sklearn.datasets import load_iris
iris = load_iris()
# `iris.data` holds the numerical values
# `iris.feature_names` holds the numerical column names
# `iris.target` holds the categorical (species) values (as ints)
# `iris.target_names` holds the unique categorical names
Quilt
Quilt is a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as several from the uciml sample repository. The quick start page shows how to install and import the iris data set:
# In your terminal
$ pip install quilt
$ quilt install uciml/iris
After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.
import quilt.data.uciml.iris as ir
iris = ir.tables.iris()
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Quilt also support dataset versioning and include a short description of each dataset.
The builtin pandas testing DataFrame is very convenient.
makeMixedDataFrame():
In [22]: import pandas as pd
In [23]: pd.util.testing.makeMixedDataFrame()
Out[23]:
A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07
other testing DataFrame options:
makeDataFrame():
In [24]: pd.util.testing.makeDataFrame().head()
Out[24]:
A B C D
acKoIvMLwE 0.121895 -0.781388 0.416125 -0.105779
jc6UQeOO1K -0.542400 2.210908 -0.536521 -1.316355
GlzjJESv7a 0.921131 -0.927859 0.995377 0.005149
CMhwowHXdW 1.724349 0.604531 -1.453514 -0.289416
ATr2ww0ctj 0.156038 0.597015 0.977537 -1.498532
makeMissingDataframe():
In [27]: pd.util.testing.makeMissingDataframe().head()
Out[27]:
A B C D
qyXLpmp1Zg -1.034246 1.050093 NaN NaN
v7eFDnbQko 0.581576 1.334046 -0.576104 -0.579940
fGiibeTEjx -1.166468 -1.146750 -0.711950 -0.205822
Q8ETSRa6uY 0.461845 -2.112087 0.167380 -0.466719
7XBSChaOyL -1.159962 -1.079996 1.585406 -1.411159
makeTimeDataFrame():
In [28]: pd.util.testing.makeTimeDataFrame().head()
Out[28]:
A B C D
2000-01-03 -0.641226 0.912964 0.308781 0.551329
2000-01-04 0.364452 -0.722959 0.322865 0.426233
2000-01-05 1.042171 0.005285 0.156562 0.978620
2000-01-06 0.749606 -0.128987 -0.312927 0.481170
2000-01-07 0.945844 -0.854273 0.935350 1.165401