Why does pandas.DataFrame.skew() return 0 when the SD of a list of values is 0?
Background
Let's think, there is a list of values which presents activity of a person for several hours. That person did not have any movement in those hours. Therefore, all the values are 0.
What did raise the question?
Searching on Google, I found the following formula of skewness. The same formula is available in some other sites also. In the denominator part, Standard Deviation (SD) is included. For a list of similar non-zero values (e.g., [1, 1, 1]) and also for 0 values (i.e., [0, 0, 0]), the SD will be 0. Therefore, I am supposed to get NaN
(something divided by 0) for skewness. Surprisingly, I get 0 while calling pandas.DataFrame.skew()
.
My Question
Why does pandas.DataFrame.skew()
return 0 when the SD of a list of values is 0?
Minimum Reproducible Example
import pandas as pd
ot_df = pd.DataFrame(data={'Day 1': [0, 0, 0, 0, 0, 0],
'Day 2': [0, 0, 0, 0, 0, 0],
'Day 3': [0, 0, 0, 0, 0, 0]})
print(ot_df.skew(axis=1))
Note: I have checked several Q&A of this site (e.g., this one (How does pandas calculate skew?)) and others (e.g., this one of GitHub). But I did not find the answer of my question.
Solution 1:
You can find the implementation here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/nanops.py
As you can see there is a:
with np.errstate(invalid="ignore", divide="ignore"):
result = (count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2 ** 1.5)
dtype = values.dtype
if is_float_dtype(dtype):
result = result.astype(dtype)
if isinstance(result, np.ndarray):
result = np.where(m2 == 0, 0, result)
result[count < 3] = np.nan
else:
result = 0 if m2 == 0 else result
if count < 3:
return np.nan
As you can see if m2 (which will be equal 0 for all constant values) is 0, then the result will be 0.
If you are asking why it is implemented this way, I can only speculate. I suppose, that it is done for practical reasons - if you are calculating the skewness you want to check if the distribution of variables is symetrical (and you can argue, that it indeed is: https://stats.stackexchange.com/questions/114823/skewness-of-a-random-variable-that-have-zero-variance-and-zero-third-central-mom).
EDIT: It was done due to: https://github.com/pandas-dev/pandas/issues/11974 https://github.com/pandas-dev/pandas/pull/12121
Probably you could add an issue for adding a flag on behaviour of this method in case of constant value of variable. It should be easy to fix.