雪花 pandas pd_writer 用 NULL 写出表 [英] Snowflake pandas pd_writer writes out tables with NULLs

查看:23
本文介绍了雪花 pandas pd_writer 用 NULL 写出表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Pandas 数据框,我正在使用 SQLAlchemy 引擎和 to_sql 函数将其写入 Snowflake.它工作正常,但由于某些雪花限制,我必须使用 chunksize 选项.这对于较小的数据帧也很好.但是,某些数据帧有 500k+ 行,并且每块 15k 记录,完成写入 Snowflake 需要很长时间.

I have a Pandas dataframe that I'm writing out to Snowflake using SQLAlchemy engine and the to_sql function. It works fine, but I have to use the chunksize option because of some Snowflake limit. This is also fine for smaller dataframes. However, some dataframes are 500k+ rows, and at a 15k records per chunk, it takes forever to complete writing to Snowflake.

我做了一些研究,发现了 Snowflake 提供的 pd_writer 方法,它显然可以更快地加载数据帧.我的 Python 脚本完成得更快,我看到它创建了一个包含所有正确列和正确行数的表,但每一行中每一列的值都是 NULL.

I did some research and came across the pd_writer method provided by Snowflake, which apparently loads the dataframe much faster. My Python script does complete faster and I see it creates a table with all the right columns and the right row count, but every single column's value in every single row is NULL.

我认为这是 NaNNULL 的问题,并尝试了一切可能将 NaN 替换为 None>,当它在数据帧内进行替换时,当它到达表时,一切都变成了 NULL.

I thought it was a NaN to NULL issue and tried everything possible to replace the NaNs with None, and while it does the replacement within the dataframe, by the time it gets to the table, everything becomes NULL.

如何使用 pd_writer 将这些巨大的数据帧正确写入 Snowflake?有没有可行的替代方案?

How can I use pd_writer to get these huge dataframes written properly into Snowflake? Are there any viable alternatives?

根据 Chris 的回答,我决定尝试使用官方示例.这是我的代码和结果集:

Following Chris' answer, I decided to try with the official example. Here's my code and the result set:

import os
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer


def create_db_engine(db_name, schema_name):
    return create_engine(
        URL(
            account=os.environ.get("DB_ACCOUNT"),
            user=os.environ.get("DB_USERNAME"),
            password=os.environ.get("DB_PASSWORD"),
            database=db_name,
            schema=schema_name,
            warehouse=os.environ.get("DB_WAREHOUSE"),
            role=os.environ.get("DB_ROLE"),
        )
    )


def create_table(out_df, table_name, idx=False):
    engine = create_db_engine("dummy_db", "dummy_schema")
    connection = engine.connect()

    try:
        out_df.to_sql(
            table_name, connection, if_exists="append", index=idx, method=pd_writer
        )

    except ConnectionError:
        print("Unable to connect to database!")

    finally:
        connection.close()
        engine.dispose()

    return True


df = pd.DataFrame([("Mark", 10), ("Luke", 20)], columns=["name", "balance"])

print(df.head)

create_table(df, "dummy_demo_table")

代码运行良好,没有任何问题,但是当我查看创建的表时,它全是 NULL.再次.

The code works fine with no hitches, but when I look at the table, which gets created, it's all NULLs. Again.

推荐答案

事实证明,文档(可以说是 Snowflake 的最弱点)与现实不同步.这是真正的问题:https://github.com/snowflakedb/snowflake-connector-python/issues/329.它所需要的只是列名中的单个字符为大写,它可以完美运行.

Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.

我的解决方法是在调用 to_sql 之前简单地执行:df.columns = map(str.upper, df.columns).

My workaround is to simply do: df.columns = map(str.upper, df.columns) before invoking to_sql.

这篇关于雪花 pandas pd_writer 用 NULL 写出表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆