transform scipy sparse csr to pandas?
If A is csr_matrix
, you can use .toarray()
(there's also .todense()
that produces a numpy
matrix
, which is also works for the DataFrame
constructor):
df = pd.DataFrame(A.toarray())
You can then use this with pd.concat()
.
A = csr_matrix([[1, 0, 2], [0, 3, 0]])
(0, 0) 1
(0, 2) 2
(1, 1) 3
<class 'scipy.sparse.csr.csr_matrix'>
pd.DataFrame(A.todense())
0 1 2
0 1 0 2
1 0 3 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null int64
1 2 non-null int64
2 2 non-null int64
In version 0.20, pandas
introduced sparse data structures, including the SparseDataFrame
.
In pandas 1.0, SparseDataFrame
was removed:
In older versions of pandas, the
SparseSeries
andSparseDataFrame
classes were the preferred way to work with sparse data. With the advent of extension arrays, these subclasses are no longer needed. Their purpose is better served by using a regular Series or DataFrame with sparse values instead.
The migration guide shows how to use these new data structures.
For instance, to create a DataFrame
from a sparse matrix:
from scipy.sparse import csr_matrix
A = csr_matrix([[1, 0, 2], [0, 3, 0]])
df = pd.DataFrame.sparse.from_spmatrix(A, columns=['A', 'B', 'C'])
df
A B C
0 1 0 2
1 0 3 0
df.dtypes
A Sparse[float64, 0]
B Sparse[float64, 0]
C Sparse[float64, 0]
dtype: object
Alternatively, you can pass sparse matrices to sklearn
to avoid running out of memory when converting back to pandas
. Just convert your other data to sparse format by passing a numpy
array
to the scipy.sparse.csr_matrix
constructor and use scipy.sparse.hstack
to combine (see docs).
UPDATE for Pandas 1.0+
Per the Pandas Sparse data structures documentation, SparseDataFrame
and SparseSeries
have been removed.
Sparse Pandas Dataframes
Previous Way
pd.SparseDataFrame({"A": [0, 1]})
New Way
pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Working with SciPy sparse csr_matrix
Previous Way
from scipy.sparse import csr_matrix
matrix = csr_matrix((3, 4), dtype=np.int8)
df = pd.SparseDataFrame(matrix, columns=['A', 'B', 'C'])
New Way
from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd
matrix = csr_matrix((3, 4), dtype=np.int8)
df = pd.DataFrame.sparse.from_spmatrix(matrix, columns=['A', 'B', 'C', 'D'])
df.dtypes
Output:
A Sparse[int8, 0]
B Sparse[int8, 0]
C Sparse[int8, 0]
D Sparse[int8, 0]
dtype: object
Conversion from Sparse to Dense
df.sparse.to_dense()
Output:
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
Sparse Properties
df.sparse.density
Output:
0.0
You could also avoid getting back a sparse matrix in the first place by setting the parameter sparse
to False
when creating the Encoder.
The documentation of the OneHotEncoder states:
sparse : boolean, default=True
Will return sparse matrix if set True else will return an array.
Then you can again call the DataFrame constructor to transform the numpy array to a DataFrame.