Pandas convert dataframe to array of tuples
How about:
subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.to_numpy()]
for pandas < 0.24 use
tuples = [tuple(x) for x in subset.values]
list(data_set.itertuples(index=False))
As of 17.1, the above will return a list of namedtuples.
If you want a list of ordinary tuples, pass name=None
as an argument:
list(data_set.itertuples(index=False, name=None))
A generic way:
[tuple(x) for x in data_set.to_records(index=False)]
Motivation
Many data sets are large enough that we need to concern ourselves with speed/efficiency. So I offer this solution in that spirit. It happens to also be succinct.
For the sake of comparison, let's drop the index
column
df = data_set.drop('index', 1)
Solution
I'll propose the use of zip
and map
list(zip(*map(df.get, df)))
[('2012-02-17', 24.75, 25.03),
('2012-02-16', 25.0, 25.07),
('2012-02-15', 24.99, 25.15),
('2012-02-14', 24.68, 25.05),
('2012-02-13', 24.62, 24.77),
('2012-02-10', 24.38, 24.61)]
It happens to also be flexible if we wanted to deal with a specific subset of columns. We'll assume the columns we've already displayed are the subset we want.
list(zip(*map(df.get, ['data_date', 'data_1', 'data_2'])))
[('2012-02-17', 24.75, 25.03),
('2012-02-16', 25.0, 25.07),
('2012-02-15', 24.99, 25.15),
('2012-02-14', 24.68, 25.05),
('2012-02-13', 24.62, 24.77),
('2012-02-10', 24.38, 24.61)]
What is Quicker?
Turn's out records
is quickest followed by asymptotically converging zipmap
and iter_tuples
I'll use a library simple_benchmarks
that I got from this post
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
import pandas as pd
import numpy as np
def tuple_comp(df): return [tuple(x) for x in df.to_numpy()]
def iter_namedtuples(df): return list(df.itertuples(index=False))
def iter_tuples(df): return list(df.itertuples(index=False, name=None))
def records(df): return df.to_records(index=False).tolist()
def zipmap(df): return list(zip(*map(df.get, df)))
funcs = [tuple_comp, iter_namedtuples, iter_tuples, records, zipmap]
for func in funcs:
b.add_function()(func)
def creator(n):
return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})
@b.add_arguments('Rows in DataFrame')
def argument_provider():
for n in (10 ** (np.arange(4, 11) / 2)).astype(int):
yield n, creator(n)
r = b.run()
Check the results
r.to_pandas_dataframe().pipe(lambda d: d.div(d.min(1), 0))
tuple_comp iter_namedtuples iter_tuples records zipmap
100 2.905662 6.626308 3.450741 1.469471 1.000000
316 4.612692 4.814433 2.375874 1.096352 1.000000
1000 6.513121 4.106426 1.958293 1.000000 1.316303
3162 8.446138 4.082161 1.808339 1.000000 1.533605
10000 8.424483 3.621461 1.651831 1.000000 1.558592
31622 7.813803 3.386592 1.586483 1.000000 1.515478
100000 7.050572 3.162426 1.499977 1.000000 1.480131
r.plot()
Here's a vectorized approach (assuming the dataframe, data_set
to be defined as df
instead) that returns a list
of tuples
as shown:
>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()
produces:
[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
(datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
(datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
(datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
(datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
(datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]
The idea of setting datetime column as the index axis is to aid in the conversion of the Timestamp
value to it's corresponding datetime.datetime
format equivalent by making use of the convert_datetime64
argument in DF.to_records
which does so for a DateTimeIndex
dataframe.
This returns a recarray
which could be then made to return a list
using .tolist
More generalized solution depending on the use case would be:
df.to_records().tolist() # Supply index=False to exclude index