Adding meta-information/metadata to pandas DataFrame

Solution 1:

Sure, like most Python objects, you can attach new attributes to a pandas.DataFrame:

import pandas as pd
df = pd.DataFrame([])
df.instrument_name = 'Binky'

Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, join or loc to name just a few) may return a new DataFrame without the metadata attached. Pandas does not yet have a robust method of propagating metadata attached to DataFrames.

Preserving the metadata in a file is possible. You can find an example of how to store metadata in an HDF5 file here.

Solution 2:

As of pandas 1.0, possibly earlier, there is now a Dataframe.attrs property. It is experimental, but this is probably what you'll want in the future. For example:

import pandas as pd
df = pd.DataFrame([])
df.attrs['instrument_name'] = 'Binky'

Find it in the docs here.

Trying this out with to_parquet and then from_parquet, it doesn't seem to persist, so be sure you check that out with your use case.

Solution 3:

Just ran into this issue myself. As of pandas 0.13, DataFrames have a _metadata attribute on them that does persist through functions that return new DataFrames. Also seems to survive serialization just fine (I've only tried json, but I imagine hdf is covered as well).

Solution 4:

Not really. Although you could add attributes containing metadata to the DataFrame class as @unutbu mentions, many DataFrame methods return a new DataFrame, so your meta data would be lost. If you need to manipulate your dataframe, then the best option would be to wrap your metadata and DataFrame in another class. See this discussion on GitHub: https://github.com/pydata/pandas/issues/2485

There is currently an open pull request to add a MetaDataFrame object, which would support metadata better.