pandas 联接字符串数据类型 [英] Pandas Join on String Datatype

查看:209
本文介绍了 pandas 联接字符串数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在id字段(字符串uuid)上加入两个熊猫数据帧.我收到值错误:

I am trying to join two pandas dataframes on an id field which is a string uuid. I get a Value error:

ValueError:您正在尝试合并object和int64列.如果要继续,则应使用pd.concat

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

代码在下面.我正在尝试按照尝试将字段转换为字符串2个数据帧,但得到ValueError ,但错误仍然存​​在.请注意,pdf是来自spark dataframe.toPandas(),而outputPdf是根据字典创建的.

The code is below. I am trying to convert the fields to string as per Trying to merge 2 dataframes but get ValueError but the error remains. Note that pdf is coming from a spark dataframe.toPandas() while outputsPdf is created from a dictionary.

pdf.id = pdf.id.apply(str)
outputsPdf.id = outputsPdf.id.apply(str)
inOutPdf = pdf.join(outputsPdf, on='id', how='left', rsuffix='fs')

pdf.dtypes
id         object
time      float64
height    float32
dtype: object

outputsPdf.dtypes
id         object
labels    float64
dtype: object

我该如何调试? 完整回溯:

How can I debug this? Full Traceback:

ValueError                                Traceback (most recent call last)
<ipython-input-13-deb429dde9ad> in <module>()
     61 pdf['id'] = pdf['id'].astype(str)
     62 outputsPdf['id'] = outputsPdf['id'].astype(str)
---> 63 inOutPdf = pdf.join(outputsPdf, on=['id'], how='left', rsuffix='fs')
     64 
     65 # idSparkDf = spark.createDataFrame(idPandasDf, schema=StructType([StructField('id', StringType(), True),

~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   6334         # For SparseDataFrame's benefit
   6335         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 6336                                  rsuffix=rsuffix, sort=sort)
   6337 
   6338     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   6349             return merge(self, other, left_on=on, how=how,
   6350                          left_index=on is None, right_index=True,
-> 6351                          suffixes=(lsuffix, rsuffix), sort=sort)
   6352         else:
   6353             if on is not None:

~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     59                          right_index=right_index, sort=sort, suffixes=suffixes,
     60                          copy=copy, indicator=indicator,
---> 61                          validate=validate)
     62     return op.get_result()
     63 

~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    553         # validate the merge keys dtypes. We may need to coerce
    554         # to avoid incompat dtypes
--> 555         self._maybe_coerce_merge_keys()
    556 
    557         # If argument passed to validate,

~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
    984             elif (not is_numeric_dtype(lk)
    985                     and (is_numeric_dtype(rk) and not is_bool_dtype(rk))):
--> 986                 raise ValueError(msg)
    987             elif is_datetimelike(lk) and not is_datetimelike(rk):
    988                 raise ValueError(msg)

推荐答案

on参数仅适用于调用的DataFrame

on:在调用方中的列或索引级别名称要在其他索引中联接,否则在索引上联接.

on: Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.

尽管您指定了on='id',它将使用pdf中的'id'作为对象,并尝试将其与采用整数值的outputsPdf的索引连接.

Though you specify on='id' it will use the 'id' in pdf, which is an object and attempt to join that with the index of outputsPdf, which takes integer values.

如果需要跨两个DataFrame对非索引列进行join,则可以将它们设置为索引,或者必须使用merge,因为pd.merge中的on参数适用于 数据框.

If you need to join on non-index columns across two DataFrames you can either set them to the index, or you must use merge as the on paremeter in pd.merge applies to both DataFrames.

import pandas as pd

df1 = pd.DataFrame({'id': ['1', 'True', '4'], 'vals': [10, 11, 12]})
df2 = df1.copy()

df1.join(df2, on='id', how='left', rsuffix='_fs')

ValueError:您正在尝试合并object和int64列.如果要继续,则应使用pd.concat

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

另一方面,这些工作:

df1.set_index('id').join(df2.set_index('id'), how='left', rsuffix='_fs').reset_index()
#     id  vals  vals_fs
#0     1    10       10
#1  True    11       11
#2     4    12       12

df1.merge(df2, on='id', how='left', suffixes=['', '_fs'])
#     id  vals  vals_fs
#0     1    10       10
#1  True    11       11
#2     4    12       12

这篇关于 pandas 联接字符串数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆