Snowflake write_pandas is not inserting dates correctly

I have a pandas dataframe named "df" that I'm creating with the result of a SQL query against a Netezza database. I'm working in Jupyter notebook. The dataframe has two rows and two of the columns (CREATEDDATE and STAGEDATE) contain datetime values. When I run print(df), the result looks like this:

                   ID ISDELETED            PARENTID         CREATEDBYID  \
0  017o000003tQRftAAG     false  a0no000000Hrv1IAAR  005o0000001w8wgAAA   
1  017o000003jl52cAAA     false  a0no000000GszDUAAZ  005o0000001w2pTAAQ   

          CREATEDDATE    FIELD OLDVALUE NEWVALUE  STAGEDATE  
0 2015-07-30 14:51:41  created     None     None 2016-06-06  
1 2015-07-16 14:48:37  created     None     None 2016-06-06

If I run print(df.dtypes), the result is this:

ID                     object
ISDELETED              object
PARENTID               object
CREATEDBYID            object
CREATEDDATE    datetime64[ns]
FIELD                  object
OLDVALUE               object
NEWVALUE               object
STAGEDATE      datetime64[ns]
dtype: object

So, as far as I can tell, my datetime columns are correctly formatted to write to Snowflake using write_pandas(). However, after I do that, the dates are wildly different in Snowflake: enter image description here

The STAGEDATE value of 2016-06-06, for example, is now 48399-06-06. Does anyone know how to fix this? I'm doing my pull from Netezza using pyodbc and using df = cs.execute() to populate the dataframe. I am importing and using snowflake.connector for my connection to Snowflake and am running the following to get write_pandas:

from snowflake.connector.pandas_tools import write_pandas

Solution 1:

I found a solution with the code that pday wrote here https://github.com/snowflakedb/snowflake-connector-python/issues/319#issuecomment-764145625. This function automatically adds to the date type cols a timezone (default one in the function is UTC).

def fix_date_cols(df, tz = 'UTC'):
cols = df.select_dtypes(include=['datetime64[ns]']).columns
for col in cols:
    df[col] = df[col].dt.tz_localize(tz)

So I would suggest using this function before pass the dataframe to write_pandas.

From what I get the issue is that the datetime object is misinterpreted, because it is not enough well defined. Adding the info of a timezone will force the interpretation of the elements as datetime.

But I really recommend you to read this amazing conversation where it is well explained https://github.com/snowflakedb/snowflake-connector-python/issues/319 it really helped me.

Solution 2:

SELECT '2016-06-06'::timestamp as a
    ,to_timestamp_ntz(date_part('epoch_second', a),0)::date as a_s
    ,to_timestamp_ntz(date_part('epoch_millisecond', a),0)::date as a_ms
    ,to_timestamp_ntz(date_part('epoch_millisecond', a),3)::date as b_ms;

gives

A                        A_S          A_MS         B_MS
2016-06-06 00:00:00.000  2016-06-06   48399-06-06  2016-06-06

which is to say, your date, turned into epoch millisconds and parse as seconds gives your date.

so you ether throw away your milliseconds, or alter how your dates are getting parsed.

Snowflake write_pandas is not inserting dates correctly

Solution 1:

Solution 2:

Related

Recent Posts