How to append rows in a pandas dataframe in a for loop?
I have the following for loop:
for i in links:
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
data = pd.DataFrame(data.items())
data = data.transpose()
data.columns = data.iloc[0]
data = data.drop(data.index[[0]])
Each dataframe so created has most columns in common with the others but not all of them. Moreover, they all have just one row. What I need to to is to add to the dataframe all the distinct columns and each row from each dataframe produced by the for loop
I tried pandas concatenate or similar but nothing seemed to work. Any idea? Thanks.
Suppose your data looks like this:
import pandas as pd
import numpy as np
np.random.seed(2015)
df = pd.DataFrame([])
for i in range(5):
data = dict(zip(np.random.choice(10, replace=False, size=5),
np.random.randint(10, size=5)))
data = pd.DataFrame(data.items())
data = data.transpose()
data.columns = data.iloc[0]
data = data.drop(data.index[[0]])
df = df.append(data)
print('{}\n'.format(df))
# 0 0 1 2 3 4 5 6 7 8 9
# 1 6 NaN NaN 8 5 NaN NaN 7 0 NaN
# 1 NaN 9 6 NaN 2 NaN 1 NaN NaN 2
# 1 NaN 2 2 1 2 NaN 1 NaN NaN NaN
# 1 6 NaN 6 NaN 4 4 0 NaN NaN NaN
# 1 NaN 9 NaN 9 NaN 7 1 9 NaN NaN
Then it could be replaced with
np.random.seed(2015)
data = []
for i in range(5):
data.append(dict(zip(np.random.choice(10, replace=False, size=5),
np.random.randint(10, size=5))))
df = pd.DataFrame(data)
print(df)
In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data)
once at the end, outside the loop.
Each call to df.append
requires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and then copying data into the new row. All that allocation and copying makes calling df.append
in a loop very inefficient. The time cost of copying grows quadratically with the number of rows. Not only is the call-DataFrame-once code easier to write, its performance will be much better -- the time cost of copying grows linearly with the number of rows.
There are 2 reasons you may append rows in a loop, 1. add to an existing df, and 2. create a new df.
to create a new df, I think its well documented that you should either create your data as a list and then create the data frame:
cols = ['c1', 'c2', 'c3']
lst = []
for a in range(2):
lst.append([1, 2, 3])
df1 = pd.DataFrame(lst, columns=cols)
df1
Out[3]:
c1 c2 c3
0 1 2 3
1 1 2 3
OR, Create the dataframe with an index and then add to it
cols = ['c1', 'c2', 'c3']
df2 = pd.DataFrame(columns=cols, index=range(2))
for a in range(2):
df2.loc[a].c1 = 4
df2.loc[a].c2 = 5
df2.loc[a].c3 = 6
df2
Out[4]:
c1 c2 c3
0 4 5 6
1 4 5 6
If you want to add to an existing dataframe, you could use either method above and then append the df's together (with or without the index):
df3 = df2.append(df1, ignore_index=True)
df3
Out[6]:
c1 c2 c3
0 4 5 6
1 4 5 6
2 1 2 3
3 1 2 3
Or, you can also create a list of dictionary entries and append those as in the answer above.
lst_dict = []
for a in range(2):
lst_dict.append({'c1':2, 'c2':2, 'c3': 3})
df4 = df1.append(lst_dict)
df4
Out[7]:
c1 c2 c3
0 1 2 3
1 1 2 3
0 2 2 3
1 2 2 3
Using the dict(zip(cols, vals)))
lst_dict = []
for a in range(2):
vals = [7, 8, 9]
lst_dict.append(dict(zip(cols, vals)))
df5 = df1.append(lst_dict)
Including the idea from the comment below:
It turns out Pandas does have an effective way to append to a dataframe:
df.loc( len(df) ) = [new, row, of, data]
(this) will "append" to the end of a dataframe in-place. – Demis Mar 22 at 15:32