大 pandas -合并字符串列不起作用(错误?) [英] pandas - Merging on string columns not working (bug?)

查看:67
本文介绍了大 pandas -合并字符串列不起作用(错误?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在两个数据框之间进行简单合并.这些来自两个不同的SQL表,其中的连接键是字符串:

I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:

>>> df1.col1.dtype
dtype('O')
>>> df2.col2.dtype
dtype('O')

我尝试使用以下方法合并它们:

I try to merge them using this:

>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

内部联接的结果为空,这首先提示我交点中可能没有任何条目:

The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:

>>> merge_res.shape
(0, 19)

但是当我尝试匹配单个元素时,我看到了这种非常奇怪的行为.

But when I try to match a single element, I see this really odd behavior.

# Pick random element in second dataframe
>>> df2.iloc[5,:].col2
'95498208100000'

# Manually look for it in the first dataframe
>>> df1[df1.col1 == '95498208100000']
0 rows × 19 columns
# Empty, which makes sense given the above merge result

# Now look for the same value as an integer
>>> df1[df1.col1 == 95498208100000]
1 rows × 19 columns
# FINDS THE ELEMENT!?!

因此,这些列是使用'object'dtype定义的.以字符串搜索它们不会产生任何结果.将它们搜索为整数确实会返回结果,并且我认为这就是合并在上方无法正常工作的原因.

So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..

有什么想法吗?

几乎可以认为Pandas可以将df1.col1转换为整数,因为即使在匹配时应将其 视为字符串.

It's almost as thought Pandas converts df1.col1 to an integer just because it can, even though it should be treated as a string while matching.

(我尝试使用示例数据帧来复制此示例,但是对于较小的示例,我没有看到这种行为.关于如何找到更具描述性的示例的任何建议也将不胜感激.)

(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

推荐答案

问题是object dtype具有误导性.我认为这意味着所有项目都是字符串.但是显然,在读取文件时,pandas正在将某些元素转换为int,其余部分则保留为字符串.

The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

解决方案是确保每个字段都是一个字符串:

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)

然后合并将按预期进行.

Then the merge works as expected.

(我希望有一种方法可以指定strdtype ...)

(I wish there was a way of specifying a dtype of str...)

这篇关于大 pandas -合并字符串列不起作用(错误?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆