Converting a 2D numpy array to a structured array
I'm trying to convert a two-dimensional array into a structured array with named fields. I want each row in the 2D array to be a new record in the structured array. Unfortunately, nothing I've tried is working the way I expect.
I'm starting with:
>>> myarray = numpy.array([("Hello",2.5,3),("World",3.6,2)])
>>> print myarray
[['Hello' '2.5' '3']
['World' '3.6' '2']]
I want to convert to something that looks like this:
>>> newarray = numpy.array([("Hello",2.5,3),("World",3.6,2)], dtype=[("Col1","S8"),("Col2","f8"),("Col3","i8")])
>>> print newarray
[('Hello', 2.5, 3L) ('World', 3.6000000000000001, 2L)]
What I've tried:
>>> newarray = myarray.astype([("Col1","S8"),("Col2","f8"),("Col3","i8")])
>>> print newarray
[[('Hello', 0.0, 0L) ('2.5', 0.0, 0L) ('3', 0.0, 0L)]
[('World', 0.0, 0L) ('3.6', 0.0, 0L) ('2', 0.0, 0L)]]
>>> newarray = numpy.array(myarray, dtype=[("Col1","S8"),("Col2","f8"),("Col3","i8")])
>>> print newarray
[[('Hello', 0.0, 0L) ('2.5', 0.0, 0L) ('3', 0.0, 0L)]
[('World', 0.0, 0L) ('3.6', 0.0, 0L) ('2', 0.0, 0L)]]
Both of these approaches attempt to convert each entry in myarray into a record with the given dtype, so the extra zeros are inserted. I can't figure out how to get it to convert each row into a record.
Another attempt:
>>> newarray = myarray.copy()
>>> newarray.dtype = [("Col1","S8"),("Col2","f8"),("Col3","i8")]
>>> print newarray
[[('Hello', 1.7219343871178711e-317, 51L)]
[('World', 1.7543139673493688e-317, 50L)]]
This time no actual conversion is performed. The existing data in memory is just re-interpreted as the new data type.
The array that I'm starting with is being read in from a text file. The data types are not known ahead of time, so I can't set the dtype at the time of creation. I need a high-performance and elegant solution that will work well for general cases since I will be doing this type of conversion many, many times for a large variety of applications.
Thanks!
Solution 1:
You can "create a record array from a (flat) list of arrays" using numpy.core.records.fromarrays as follows:
>>> import numpy as np
>>> myarray = np.array([("Hello",2.5,3),("World",3.6,2)])
>>> print myarray
[['Hello' '2.5' '3']
['World' '3.6' '2']]
>>> newrecarray = np.core.records.fromarrays(myarray.transpose(),
names='col1, col2, col3',
formats = 'S8, f8, i8')
>>> print newrecarray
[('Hello', 2.5, 3) ('World', 3.5999999046325684, 2)]
I was trying to do something similar. I found that when numpy created a structured array from an existing 2D array (using np.core.records.fromarrays), it considered each column (instead of each row) in the 2-D array as a record. So you have to transpose it. This behavior of numpy does not seem very intuitive, but perhaps there is a good reason for it.
Solution 2:
If the data starts as a list of tuples, then creating a structured array is straight forward:
In [228]: alist = [("Hello",2.5,3),("World",3.6,2)]
In [229]: dt = [("Col1","S8"),("Col2","f8"),("Col3","i8")]
In [230]: np.array(alist, dtype=dt)
Out[230]:
array([(b'Hello', 2.5, 3), (b'World', 3.6, 2)],
dtype=[('Col1', 'S8'), ('Col2', '<f8'), ('Col3', '<i8')])
The complication here is that the list of tuples has been turned into a 2d string array:
In [231]: arr = np.array(alist)
In [232]: arr
Out[232]:
array([['Hello', '2.5', '3'],
['World', '3.6', '2']],
dtype='<U5')
We could use the well known zip*
approach to 'transposing' this array - actually we want a double transpose:
In [234]: list(zip(*arr.T))
Out[234]: [('Hello', '2.5', '3'), ('World', '3.6', '2')]
zip
has conveniently given us a list of tuples. Now we can recreate the array with desired dtype:
In [235]: np.array(_, dtype=dt)
Out[235]:
array([(b'Hello', 2.5, 3), (b'World', 3.6, 2)],
dtype=[('Col1', 'S8'), ('Col2', '<f8'), ('Col3', '<i8')])
The accepted answer uses fromarrays
:
In [236]: np.rec.fromarrays(arr.T, dtype=dt)
Out[236]:
rec.array([(b'Hello', 2.5, 3), (b'World', 3.6, 2)],
dtype=[('Col1', 'S8'), ('Col2', '<f8'), ('Col3', '<i8')])
Internally, fromarrays
takes a common recfunctions
approach: create target array, and copy values by field name. Effectively it does:
In [237]: newarr = np.empty(arr.shape[0], dtype=dt)
In [238]: for n, v in zip(newarr.dtype.names, arr.T):
...: newarr[n] = v
...:
In [239]: newarr
Out[239]:
array([(b'Hello', 2.5, 3), (b'World', 3.6, 2)],
dtype=[('Col1', 'S8'), ('Col2', '<f8'), ('Col3', '<i8')])
Solution 3:
I guess
new_array = np.core.records.fromrecords([("Hello",2.5,3),("World",3.6,2)],
names='Col1,Col2,Col3',
formats='S8,f8,i8')
is what you want.