Constructing 3D Pandas DataFrame
I'm having difficulty constructing a 3D DataFrame in Pandas. I want something like this
A B C
start end start end start end ...
7 20 42 52 90 101
11 21 213 34
56 74 9 45
45 12
Where A
, B
, etc are the top-level descriptors and start
and end
are subdescriptors. The numbers that follow are in pairs and there aren't the same number of pairs for A
, B
etc. Observe that A
has four such pairs, B
has only 1, and C
has 3.
I'm not sure how to proceed in constructing this DataFrame. Modifying this example didn't give me the designed output:
import numpy as np
import pandas as pd
A = np.array(['one', 'one', 'two', 'two', 'three', 'three'])
B = np.array(['start', 'end']*3)
C = [np.random.randint(10, 99, 6)]*6
df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])
df.set_index(['A', 'B'], inplace=True)
df
yielded:
C
A B
one start [22, 19, 16, 20, 63, 54]
end [22, 19, 16, 20, 63, 54]
two start [22, 19, 16, 20, 63, 54]
end [22, 19, 16, 20, 63, 54]
three start [22, 19, 16, 20, 63, 54]
end [22, 19, 16, 20, 63, 54]
Is there any way of breaking up the lists in C into their own columns?
EDIT: The structure of my C
is important. It looks like the following:
C = [[7,11,56,45], [20,21,74,12], [42], [52], [90,213,9], [101, 34, 45]]
And the desired output is the one at the top. It represents the starting and ending points of subsequences within a certain sequence (A
, B
. C
are the different sequences). Depending on the sequence itself, there are a differing number of subsequences that satisfy a given condition I'm looking for. As a result, there are a differing number of start:end pairs for A
, B
, etc
Solution 1:
First, I think you need to fill C to represent missing values
In [341]: max_len = max(len(sublist) for sublist in C)
In [344]: for sublist in C:
...: sublist.extend([np.nan] * (max_len - len(sublist)))
In [345]: C
Out[345]:
[[7, 11, 56, 45],
[20, 21, 74, 12],
[42, nan, nan, nan],
[52, nan, nan, nan],
[90, 213, 9, nan],
[101, 34, 45, nan]]
Then, convert to a numpy array, transpose, and pass to the DataFrame constructor along with the columns.
In [288]: C = np.array(C)
In [289]: df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B)))
In [349]: df
Out[349]:
one two three
start end start end start end
0 7 20 42 52 90 101
1 11 21 NaN NaN 213 34
2 56 74 NaN NaN 9 45
3 45 12 NaN NaN NaN NaN
Solution 2:
As @Aaron mentioned in a comment above, panels have been deprecated. Also, @tlnagy mentioned his dataset would be likely to expand to more than 3 dimensions in the future.
This sounds like a good use-case for the xarray package, which provides semantically labelled arrays of arbitrarily many dimensions. Pandas and xarray have strong conversion support, and panels have been deprecated in favour of using xarray.
Initial setup of the problem.
import numpy as np
A = np.array([[7,11,56,45], [20,21,74,12]]).T
B = np.array([[42], [52]]).T
C = np.array([[90,213,9], [101, 34, 45]]).T
You can then create a three dimensional xarray.DataArray object like so:
import xarray
output_as_dataarray = xarray.concat(
[
xarray.DataArray(
X,
dims=["record", "edge"],
coords={"record": range(X.shape[0]), "edge": ["start", "end"]},
)
for X in (A, B, C)
],
dim="descriptor",
).assign_coords(descriptor=["A", "B", "C"])
We turn our three 2D numpy arrays into xarray.DataArray objects, and then concatenate them together along a new dimension.
Our output looks like so:
<xarray.DataArray (descriptor: 3, record: 4, edge: 2)>
array([[[ 7., 20.],
[ 11., 21.],
[ 56., 74.],
[ 45., 12.]],
[[ 42., 52.],
[ nan, nan],
[ nan, nan],
[ nan, nan]],
[[ 90., 101.],
[213., 34.],
[ 9., 45.],
[ nan, nan]]])
Coordinates:
* record (record) int64 0 1 2 3
* edge (edge) <U5 'start' 'end'
* descriptor (descriptor) <U1 'A' 'B' 'C'