DASK - AttributeError: 'DataFrame' object has no attribute 'sort_values'
I am just trying to order a dask dataframe by a specific column.
CODE 1 - If I call it it shows as indeed a ddf
my_ddf
OUTPUT 1
npartitions=1
headers .....
CODE 2
my_ddf.sort_values('id', ascending=False)
OUTPUT 2
AttributeError Traceback (most recent call last)
<ipython-input-374-35ce4bd06557> in <module>
----> 1 my_ddf.sort_values('id', ascending=False) #.head(20)
2 # df.sort_values(columns, ascending=True)
~/anaconda3/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
3619 return self[key]
3620 else:
-> 3621 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3622
3623 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute 'sort_values'
Tried Solutions
- This is an example from the official dask documentation
df.sort_values(columns, ascending=False).head(n)
- pandas only - DataFrame object has no attribute 'sort_values'
- pandas only - 'DataFrame' object has no attribute 'sort'
- DASK answer - https://stackoverflow.com/a/40378896/10270590
- I don't want to set it in to index because I want to have only my current index values.
- The following answer is a bit strange and I am not sure that it would work when I have more partition (currently I have 1 because if previous group by of the data) or how to not to have just a random big number "1000000000". Or how to make it Increasing from top to bottom in the dask dataframe
my_ddf.nlargest(1000000000, 'id').compute()
Solution 1:
AFAIK, sort across partitions is not implemented (yet?). If the dataset is small enough to fit in memory you can do ddf = ddf.compute()
and then run sorting on the pandas dataframe.
Solution 2:
Dask indexes are not global anyway (by default) If you want to retain the original within-partition index, you can do something like
df["old_index"] = df.reset_index()
df.set_index("colA")