Snowflake write_pandas is not inserting dates correctly
I have a pandas dataframe named "df" that I'm creating with the result of a SQL query against a Netezza database. I'm working in Jupyter notebook. The dataframe has two rows and two of the columns (CREATEDDATE and STAGEDATE) contain datetime values. When I run print(df), the result looks like this:
ID ISDELETED PARENTID CREATEDBYID \
0 017o000003tQRftAAG false a0no000000Hrv1IAAR 005o0000001w8wgAAA
1 017o000003jl52cAAA false a0no000000GszDUAAZ 005o0000001w2pTAAQ
CREATEDDATE FIELD OLDVALUE NEWVALUE STAGEDATE
0 2015-07-30 14:51:41 created None None 2016-06-06
1 2015-07-16 14:48:37 created None None 2016-06-06
If I run print(df.dtypes), the result is this:
ID object
ISDELETED object
PARENTID object
CREATEDBYID object
CREATEDDATE datetime64[ns]
FIELD object
OLDVALUE object
NEWVALUE object
STAGEDATE datetime64[ns]
dtype: object
So, as far as I can tell, my datetime columns are correctly formatted to write to Snowflake using write_pandas(). However, after I do that, the dates are wildly different in Snowflake:
The STAGEDATE value of 2016-06-06, for example, is now 48399-06-06. Does anyone know how to fix this? I'm doing my pull from Netezza using pyodbc and using df = cs.execute() to populate the dataframe. I am importing and using snowflake.connector for my connection to Snowflake and am running the following to get write_pandas:
from snowflake.connector.pandas_tools import write_pandas
Solution 1:
I found a solution with the code that pday wrote here https://github.com/snowflakedb/snowflake-connector-python/issues/319#issuecomment-764145625. This function automatically adds to the date type cols a timezone (default one in the function is UTC).
def fix_date_cols(df, tz = 'UTC'):
cols = df.select_dtypes(include=['datetime64[ns]']).columns
for col in cols:
df[col] = df[col].dt.tz_localize(tz)
So I would suggest using this function before pass the dataframe to write_pandas.
From what I get the issue is that the datetime object is misinterpreted, because it is not enough well defined. Adding the info of a timezone will force the interpretation of the elements as datetime.
But I really recommend you to read this amazing conversation where it is well explained https://github.com/snowflakedb/snowflake-connector-python/issues/319 it really helped me.
Solution 2:
SELECT '2016-06-06'::timestamp as a
,to_timestamp_ntz(date_part('epoch_second', a),0)::date as a_s
,to_timestamp_ntz(date_part('epoch_millisecond', a),0)::date as a_ms
,to_timestamp_ntz(date_part('epoch_millisecond', a),3)::date as b_ms;
gives
A A_S A_MS B_MS
2016-06-06 00:00:00.000 2016-06-06 48399-06-06 2016-06-06
which is to say, your date, turned into epoch millisconds and parse as seconds gives your date.
so you ether throw away your milliseconds, or alter how your dates are getting parsed.