雪花 pandas pd_writer 用 NULL 写出表 [英] Snowflake pandas pd_writer writes out tables with NULLs
问题描述
我有一个 Pandas 数据框,我正在使用 SQLAlchemy 引擎和 to_sql
函数将其写入 Snowflake.它工作正常,但由于某些雪花限制,我必须使用 chunksize
选项.这对于较小的数据帧也很好.但是,某些数据帧有 500k+ 行,并且每块 15k 记录,完成写入 Snowflake 需要很长时间.
I have a Pandas dataframe that I'm writing out to Snowflake using SQLAlchemy engine and the to_sql
function. It works fine, but I have to use the chunksize
option because of some Snowflake limit. This is also fine for smaller dataframes. However, some dataframes are 500k+ rows, and at a 15k records per chunk, it takes forever to complete writing to Snowflake.
我做了一些研究,发现了 Snowflake 提供的 pd_writer
方法,它显然可以更快地加载数据帧.我的 Python 脚本完成得更快,我看到它创建了一个包含所有正确列和正确行数的表,但每一行中每一列的值都是 NULL.
I did some research and came across the pd_writer
method provided by Snowflake, which apparently loads the dataframe much faster. My Python script does complete faster and I see it creates a table with all the right columns and the right row count, but every single column's value in every single row is NULL.
我认为这是 NaN
到 NULL
的问题,并尝试了一切可能将 NaN
替换为 None
>,当它在数据帧内进行替换时,当它到达表时,一切都变成了 NULL.
I thought it was a NaN
to NULL
issue and tried everything possible to replace the NaN
s with None
, and while it does the replacement within the dataframe, by the time it gets to the table, everything becomes NULL.
如何使用 pd_writer
将这些巨大的数据帧正确写入 Snowflake?有没有可行的替代方案?
How can I use pd_writer
to get these huge dataframes written properly into Snowflake? Are there any viable alternatives?
根据 Chris 的回答,我决定尝试使用官方示例.这是我的代码和结果集:
Following Chris' answer, I decided to try with the official example. Here's my code and the result set:
import os
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer
def create_db_engine(db_name, schema_name):
return create_engine(
URL(
account=os.environ.get("DB_ACCOUNT"),
user=os.environ.get("DB_USERNAME"),
password=os.environ.get("DB_PASSWORD"),
database=db_name,
schema=schema_name,
warehouse=os.environ.get("DB_WAREHOUSE"),
role=os.environ.get("DB_ROLE"),
)
)
def create_table(out_df, table_name, idx=False):
engine = create_db_engine("dummy_db", "dummy_schema")
connection = engine.connect()
try:
out_df.to_sql(
table_name, connection, if_exists="append", index=idx, method=pd_writer
)
except ConnectionError:
print("Unable to connect to database!")
finally:
connection.close()
engine.dispose()
return True
df = pd.DataFrame([("Mark", 10), ("Luke", 20)], columns=["name", "balance"])
print(df.head)
create_table(df, "dummy_demo_table")
代码运行良好,没有任何问题,但是当我查看创建的表时,它全是 NULL.再次.
The code works fine with no hitches, but when I look at the table, which gets created, it's all NULLs. Again.
推荐答案
事实证明,文档(可以说是 Snowflake 的最弱点)与现实不同步.这是真正的问题:https://github.com/snowflakedb/snowflake-connector-python/issues/329.它所需要的只是列名中的单个字符为大写,它可以完美运行.
Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.
我的解决方法是在调用 to_sql
之前简单地执行:df.columns = map(str.upper, df.columns)
.
My workaround is to simply do: df.columns = map(str.upper, df.columns)
before invoking to_sql
.
这篇关于雪花 pandas pd_writer 用 NULL 写出表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!