Pandas pivot table function values into wrong rows
I'm making a pivot table from a CSV (cl_total_data.csv) file using pandas pd.pivot_table() and need find a fix to values in the wrong rows.
[Original CSV File]
The error occurs when the year has 53 weeks(i.e. 53 values) instead of 52, the first value in the year with 53 weeks is set as the last value in the pivot table
[Pivot Table with wrong values top]
[Pivot Table with wrong values bottom]
[Original CSV 2021 w/ 53 values]
The last value for the pivot table 2021 row 53 (1123544) is the first value of the year for 2021-01-01 (1123544) in the original CSV table for 2021.
I figured out how to fix this in the pivot table after making it. I use
Find columns with 53 values:
cl_total_p.columns[~cl_total_p.isnull().any()]
Then take the values from the original CSV files to its corresponding year and replace the values in the pivot table
cl_total_p[2021] = cl_total_data.loc['2021'].Quantity.values
My problem is:
- I can't figure out what I'm coding wrong in the pivot table function that causes this misplacement of values. Is there a better way to code it?
- Using my manual solution takes a lot of time especially when I'm using multiple CSV files 10+ and having to fix every single misplacement in columns with 53 weeks. Is there a for loop I can code to loop through all columns with 53 weeks and replace them with their corresponding year?
I tried
import numpy
import pandas
year_range = np.arange(1982,2023)
week_range = np.arange(54)
for i in year_range:
for y in week_range:
cl_total_p[i] = cl_total_data.loc['y'].Quantity.values
But I get an error :( How can I fix the pivot table value misplacement? and/or find a for loop to take the original values and replace them in the pivot table?
Solution 1:
I can't figure out what I'm coding wrong in the pivot table function that causes this misplacement of values. Is there a better way to code it?
The problem here lies in the definition of the ISO week number. Let's look at this line of code:
cl_total_p = pd.pivot_table(cl_total_data, index = cl_total_data.index.isocalendar().week, columns = cl_total_data.index.year, values = 'Quantity')
This line uses the ISO week number to determine the row position, and the non-ISO year to determine the column position.
The ISO week number is defined as the number of weeks since the first week of the year with a majority of its days in that year. This means that it is possible for the first week of the year to not line up with the first day of the year. For that reason, the ISO week number is used alongside the ISO year number, which says that the part of the year before the first week belongs to the the previous year.
For that reason, January 1st, 2021 was not the first week of 2021 in the ISO system. It was the 53rd week of 2020. When you mix the ISO week with the non-ISO year, you get the result that it was the 53rd week of 2021, a date which is a year off.
Here's an example of how to show this with the linux program date
:
$ date -d "Jan 1 2021" "+%G-%V"
2020-53
You have a few options:
- Use both the ISO week and the ISO year for consistency. The isocalendar() function can provide both the ISO week and ISO year.
- If you don't want the ISO system, you can come up with your own definition of "week" which avoids having the year's first day belong to the previous year. One approach you could take is to take the day of year, divide by seven, and round down. Unfortunately, this does mean that the week will start on a different day each year.