Pandas version of rbind
In R, you can combine two dataframes by sticking the columns of one onto the bottom of the columns of the other using rbind. In pandas, how do you accomplish the same thing? It seems bizarrely difficult.
Using append results in a horrible mess including NaNs and things for reasons I don't understand. I'm just trying to "rbind" two identical frames that look like this:
EDIT: I was creating the DataFrames in a stupid way, which was causing issues. Append=rbind to all intents and purposes. See answer below.
0 1 2 3 4 5 6 7
0 ADN.L 20130220 437.4 442.37 436.5000 441.9000 2775364 2013-02-20 18:47:42
1 ADM.L 20130220 1279.0 1300.00 1272.0000 1285.0000 967730 2013-02-20 18:47:42
2 AGK.L 20130220 1717.0 1749.00 1709.0000 1739.0000 834534 2013-02-20 18:47:43
3 AMEC.L 20130220 1030.0 1040.00 1024.0000 1035.0000 1972517 2013-02-20 18:47:43
4 AAL.L 20130220 1998.0 2014.50 1942.4999 1951.0000 3666033 2013-02-20 18:47:44
5 ANTO.L 20130220 1093.0 1097.00 1064.7899 1068.0000 2183931 2013-02-20 18:47:44
6 ARM.L 20130220 941.5 965.10 939.4250 951.5001 2994652 2013-02-20 18:47:45
But I'm getting something horrible a la this:
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 NaN NaN NaN NaN NaN NaN NaN NaN ADN.L 20130220 437.4 442.37 436.5000 441.9000 2775364 2013-02-20 18:47:42
1 NaN NaN NaN NaN NaN NaN NaN NaN ADM.L 20130220 1279.0 1300.00 1272.0000 1285.0000 967730 2013-02-20 18:47:42
2 NaN NaN NaN NaN NaN NaN NaN NaN AGK.L 20130220 1717.0 1749.00 1709.0000 1739.0000 834534 2013-02-20 18:47:43
3 NaN NaN NaN NaN NaN NaN NaN NaN AMEC.L 20130220 1030.0 1040.00 1024.0000 1035.0000 1972517 2013-02-20 18:47:43
4 NaN NaN NaN NaN NaN NaN NaN NaN AAL.L 20130220 1998.0 2014.50 1942.4999 1951.0000 3666033 2013-02-20 18:47:44
5 NaN NaN NaN NaN NaN NaN NaN NaN ANTO.L 20130220 1093.0 1097.00 1064.7899 1068.0000 2183931 2013-02-20 18:47:44
6 NaN NaN NaN NaN NaN NaN NaN NaN ARM.L 20130220 941.5 965.10 939.4250 951.5001 2994652 2013-02-20 18:47:45
0 NaN NaN NaN NaN NaN NaN NaN NaN ADN.L 20130220 437.4 442.37 436.5000 441.9000 2775364 2013-02-20 18:47:42
1 NaN NaN NaN NaN NaN NaN NaN NaN ADM.L 20130220 1279.0 1300.00 1272.0000 1285.0000 967730 2013-02-20 18:47:42
2 NaN NaN NaN NaN NaN NaN NaN NaN AGK.L 20130220 1717.0 1749.00 1709.0000 1739.0000 834534 2013-02-20 18:47:43
3 NaN NaN NaN NaN NaN NaN NaN NaN
And I don't understand why. I'm starting to miss R :(
Ah, this is to do with how I created the DataFrame, not with how I was combining them. The long and the short of it is, if you are creating a frame using a loop and a statement that looks like this:
Frame = Frame.append(pandas.DataFrame(data = SomeNewLineOfData))
You must ignore the index
Frame = Frame.append(pandas.DataFrame(data = SomeNewLineOfData), ignore_index=True)
Or you will have issues later when combining data.
pd.concat
will serve the purpose of rbind
in R.
import pandas as pd
df1 = pd.DataFrame({'col1': [1,2], 'col2':[3,4]})
df2 = pd.DataFrame({'col1': [5,6], 'col2':[7,8]})
print(df1)
print(df2)
print(pd.concat([df1, df2]))
The outcome will looks like:
col1 col2
0 1 3
1 2 4
col1 col2
0 5 7
1 6 8
col1 col2
0 1 3
1 2 4
0 5 7
1 6 8
If you read the documentation careful enough, it will also explain other operations like cbind, ..etc.
This worked for me:
import numpy as np
import pandas as pd
dates = np.asarray(pd.date_range('1/1/2000', periods=8))
df1 = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df2 = df1.copy()
df = df1.append(df2)
Yields:
A B C D
2000-01-01 -0.327208 0.552500 0.862529 0.493109
2000-01-02 1.039844 -2.141089 -0.781609 1.307600
2000-01-03 -0.462831 0.066505 -1.698346 1.123174
2000-01-04 -0.321971 -0.544599 -0.486099 -0.283791
2000-01-05 0.693749 0.544329 -1.606851 0.527733
2000-01-06 -2.461177 -0.339378 -0.236275 0.155569
2000-01-07 -0.597156 0.904511 0.369865 0.862504
2000-01-08 -0.958300 -0.583621 -2.068273 0.539434
2000-01-01 -0.327208 0.552500 0.862529 0.493109
2000-01-02 1.039844 -2.141089 -0.781609 1.307600
2000-01-03 -0.462831 0.066505 -1.698346 1.123174
2000-01-04 -0.321971 -0.544599 -0.486099 -0.283791
2000-01-05 0.693749 0.544329 -1.606851 0.527733
2000-01-06 -2.461177 -0.339378 -0.236275 0.155569
2000-01-07 -0.597156 0.904511 0.369865 0.862504
2000-01-08 -0.958300 -0.583621 -2.068273 0.539434
If you don't already use the latest version of pandas
I highly recommend upgrading. It is now possible to operate with DataFrames which contain duplicate indices.
import pandas as pd
import numpy as np
If you have a DataFrame like this:
array = np.random.randint( 0,10, size = (2,4) )
df = pd.DataFrame(array, columns = ['A','B', 'C', 'D'], \
index = ['10aa', '20bb'] ) ### some crazy indexes
df
A B C D
10aa 4 2 4 6
20bb 5 1 0 2
And you want add some NEW ROW which is a list (or another iterable object):
List = [i**3 for i in range(df.shape[1]) ]
List
[0, 1, 8, 27]
You should transform list to dictionary with keys equals columns in DataFrame with zip() function:
Dict = dict( zip(df.columns, List) )
Dict
{'A': 0, 'B': 1, 'C': 8, 'D': 27}
Than you can use append() method to add new dictionary:
df = df.append(Dict, ignore_index=True)
df
A B C D
0 7 5 5 4
1 5 8 4 1
2 0 1 8 27
N.B. the indexes are dropped.
And yeah, it's not as simple as cbind() in R :(