Why does pandas.DataFrame.skew() return 0 when the SD of a list of values is 0?

Background

Let's think, there is a list of values which presents activity of a person for several hours. That person did not have any movement in those hours. Therefore, all the values are 0.

What did raise the question?

Searching on Google, I found the following formula of skewness. The same formula is available in some other sites also. In the denominator part, Standard Deviation (SD) is included. For a list of similar non-zero values (e.g., [1, 1, 1]) and also for 0 values (i.e., [0, 0, 0]), the SD will be 0. Therefore, I am supposed to get NaN (something divided by 0) for skewness. Surprisingly, I get 0 while calling pandas.DataFrame.skew(). enter image description here

My Question

Why does pandas.DataFrame.skew() return 0 when the SD of a list of values is 0?

Minimum Reproducible Example

import pandas as pd
ot_df = pd.DataFrame(data={'Day 1': [0, 0, 0, 0, 0, 0],
                           'Day 2': [0, 0, 0, 0, 0, 0],
                           'Day 3': [0, 0, 0, 0, 0, 0]})
print(ot_df.skew(axis=1))

Note: I have checked several Q&A of this site (e.g., this one (How does pandas calculate skew?)) and others (e.g., this one of GitHub). But I did not find the answer of my question.

Solution 1:

You can find the implementation here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/nanops.py

As you can see there is a:

    with np.errstate(invalid="ignore", divide="ignore"):
        result = (count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2 ** 1.5)

    dtype = values.dtype
    if is_float_dtype(dtype):
        result = result.astype(dtype)

    if isinstance(result, np.ndarray):
        result = np.where(m2 == 0, 0, result)
        result[count < 3] = np.nan
    else:
        result = 0 if m2 == 0 else result
        if count < 3:
            return np.nan

As you can see if m2 (which will be equal 0 for all constant values) is 0, then the result will be 0.

If you are asking why it is implemented this way, I can only speculate. I suppose, that it is done for practical reasons - if you are calculating the skewness you want to check if the distribution of variables is symetrical (and you can argue, that it indeed is: https://stats.stackexchange.com/questions/114823/skewness-of-a-random-variable-that-have-zero-variance-and-zero-third-central-mom).

EDIT: It was done due to: https://github.com/pandas-dev/pandas/issues/11974 https://github.com/pandas-dev/pandas/pull/12121

Probably you could add an issue for adding a flag on behaviour of this method in case of constant value of variable. It should be easy to fix.

Related

Recent Posts

org.apache.kafka.common.errors.TimeoutException: Topic not present in metadata after 60000 ms

Why my code runs infinite time when i entered non integer type in c++ [duplicate]

How to retrieve Instagram username from User ID?

Serverless Framework - Variables resolution error

How do we access a file in github repo inside our azure databricks notebook