雪花 pandas pd_writer用NULL写出表 [英] Snowflake pandas pd_writer writes out tables with NULLs

查看:76
本文介绍了雪花 pandas pd_writer用NULL写出表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,正在使用SQLAlchemy引擎和 to_sql 函数将其写到Snowflake。它工作正常,但由于雪花限制,我不得不使用 chunksize 选项。这对于较小的数据帧也很好。但是,某些数据帧有500k +行,每块记录15k记录,要永久完成对Snowflake的写入。

I have a Pandas dataframe that I'm writing out to Snowflake using SQLAlchemy engine and the to_sql function. It works fine, but I have to use the chunksize option because of some Snowflake limit. This is also fine for smaller dataframes. However, some dataframes are 500k+ rows, and at a 15k records per chunk, it takes forever to complete writing to Snowflake.

我做了一些研究,发现了 Snowflake提供的pd_writer 方法显然可以更快地加载数据帧。我的Python脚本的完成速度更快,我看到它创建了一个具有所有正确的列和正确的行数的表,但是每一行中每一列的值都是NULL。

I did some research and came across the pd_writer method provided by Snowflake, which apparently loads the dataframe much faster. My Python script does complete faster and I see it creates a table with all the right columns and the right row count, but every single column's value in every single row is NULL.

我认为是 NaN NULL 的问题,并尝试了所有可能的方法来替换 NaN 带有 None ,并且在数据帧内进行替换时,到表时,所有内容都变为NULL。

I thought it was a NaN to NULL issue and tried everything possible to replace the NaNs with None, and while it does the replacement within the dataframe, by the time it gets to the table, everything becomes NULL.

如何使用 pd_writer 将这些巨大的数据帧正确地写入Snowflake?还有其他可行的选择吗?

How can I use pd_writer to get these huge dataframes written properly into Snowflake? Are there any viable alternatives?

编辑:在克里斯回答之后,我决定尝试使用官方示例。这是我的代码和结果集:

Following Chris' answer, I decided to try with the official example. Here's my code and the result set:

import os
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer


def create_db_engine(db_name, schema_name):
    return create_engine(
        URL(
            account=os.environ.get("DB_ACCOUNT"),
            user=os.environ.get("DB_USERNAME"),
            password=os.environ.get("DB_PASSWORD"),
            database=db_name,
            schema=schema_name,
            warehouse=os.environ.get("DB_WAREHOUSE"),
            role=os.environ.get("DB_ROLE"),
        )
    )


def create_table(out_df, table_name, idx=False):
    engine = create_db_engine("dummy_db", "dummy_schema")
    connection = engine.connect()

    try:
        out_df.to_sql(
            table_name, connection, if_exists="append", index=idx, method=pd_writer
        )

    except ConnectionError:
        print("Unable to connect to database!")

    finally:
        connection.close()
        engine.dispose()

    return True


df = pd.DataFrame([("Mark", 10), ("Luke", 20)], columns=["name", "balance"])

print(df.head)

create_table(df, "dummy_demo_table")

代码工作正常,没有任何障碍,但是当我查看创建的表时,它全都是NULL。再次。

The code works fine with no hitches, but when I look at the table, which gets created, it's all NULLs. Again.

推荐答案

结果证明,文档(可以说是Snowflake的最薄弱点)与现实不同步。这是真正的问题: https://github.com/snowflakedb/snowflake-connector -python / issues / 329 。它所需要的只是在列名中使用单个字符将其大写,并且可以正常使用。

Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.

我的解决方法是简单地做到: df.columns = map( str.upper,df.columns),然后调用 to_sql

My workaround is to simply do: df.columns = map(str.upper, df.columns) before invoking to_sql.

这篇关于雪花 pandas pd_writer用NULL写出表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆