用dask合并数据框并将其转换为 pandas [英] Merge dataframe with dask and convert it to pandas
本文介绍了用dask合并数据框并将其转换为 pandas 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有两个数据框
dataframe1:
dataframe1:
>df_case = dd.read_csv('s3://../.../df_case.csv')
>df_case.head(1)
sacc_id$ id$ creation_date
0 001A000000hwvV0IAI 5001200000ZnfUgAAJ 2016-06-07 14:38:02
dataframe2:
dataframe2:
>df_limdata = dd.read_csv('s3://../.../df_limdata.csv')
>df_limdata.head(1)
sacc_id$ opp_line_id$ oppline_creation_date
0 001A000000hAUn8IAG a0W1200000G0i3UEAR 2015-06-10
首先,我合并了两个数据框:
First, I did a merge of the 2 dataframes :
> case = dd.merge(df_limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
>case
Dask DataFrame Structure:
Unnamed: 0_x sacc_id$ opp_line_id$_x oppline_creation_date_x Unnamed: 0_y opp_line_id$_y oppline_creation_date_y
npartitions=5
int64 object object object int64 object object
... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
... ... ... ... ... ... ...
... ... ... ... ... ... ...
Dask Name: hash-join, 78 tasks
然后,我尝试将这个普通案例数据框转换为pandas数据框:
Then I try to convert this dask case dataframe to pandas dataframe :
> # conversion to pandas
df = case.compute()
我收到此错误:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+------------+---------+----------+
| Column | Found | Expected |
+------------+---------+----------+
| Unnamed: 0 | float64 | int64 |
+------------+---------+----------+
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'Unnamed: 0': 'float64'}
to the call to `read_csv`/`read_table`.
Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.
您能帮我解决这个问题吗?
Can you help me to resolve this problem please?
谢谢
推荐答案
在阅读文件dask时,假定未命名:0"列的int64为dtype,但后来在计算时发现它为float64.
While reading the file dask assumed that column "Unnamed: 0" has int64 as dtype but later while computing it found it as float64.
因此,您在读取文件时需要提及dtype:
Hence you need to mention the dtype while reading the file:
df_case = dd.read_csv('s3://../.../df_case.csv',dtpye={'Unnamed: 0': 'float64'})
df_limdata = dd.read_csv('s3://../.../df_limdata.csv',dtpye={'Unnamed: 0': 'float64'})
case = dd.merge(df_limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
# conversion to pandas
df = case.compute()
这篇关于用dask合并数据框并将其转换为 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文