pandas - 合并字符串列不起作用(错误?) [英] pandas - Merging on string columns not working (bug?)

查看:54
本文介绍了 pandas - 合并字符串列不起作用(错误?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在两个数据帧之间进行简单的合并.它们来自两个不同的 SQL 表,其中连接键是字符串:

<预><代码>>>>df1.col1.dtypedtype('O')>>>df2.col2.dtypedtype('O')

我尝试使用此方法合并它们:

<预><代码>>>>merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

内连接的结果是空的,这首先提示我可能没有交集的条目:

<预><代码>>>>合并res.shape(0, 19)

但是当我尝试匹配单个元素时,我看到了这种非常奇怪的行为.

# 在第二个数据框中选择随机元素>>>df2.iloc[5,:].col2'95498208100000'# 在第一个数据框中手动查找>>>df1[df1.col1 == '95498208100000']0 行 × 19 列# 空,鉴于上面的合并结果,这是有道理的# 现在寻找与整数相同的值>>>df1[df1.col1 == 95498208100000]1 行 × 19 列# 找到元素!?!

因此,列是用对象"dtype 定义的.将它们作为字符串搜索不会产生任何结果.以整数形式搜索它们确实会返回结果,我认为这就是上面合并不起作用的原因..

知道发生了什么吗?

这几乎就像 Pandas 将 df1.col1 转换为整数一样,因为它可以,即使它应该在匹配时被视为字符串.

(我尝试使用示例数据帧来复制它,但对于小示例,我没有看到这种行为.关于如何找到更具描述性的示例的任何建议也将不胜感激.)

解决方案

问题在于 object dtype 具有误导性.我认为这意味着所有项目都是字符串.但显然,在读取文件时,pandas 正在将一些元素转换为整数,而将其余部分保留为字符串.

解决方案是确保每个字段都是一个字符串:

<预><代码>>>>df1.col1 = df1.col1.astype(str)>>>df2.col2 = df2.col2.astype(str)

然后合并按预期工作.

(我希望有一种方法可以指定 strdtype...)

I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:

>>> df1.col1.dtype
dtype('O')
>>> df2.col2.dtype
dtype('O')

I try to merge them using this:

>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:

>>> merge_res.shape
(0, 19)

But when I try to match a single element, I see this really odd behavior.

# Pick random element in second dataframe
>>> df2.iloc[5,:].col2
'95498208100000'

# Manually look for it in the first dataframe
>>> df1[df1.col1 == '95498208100000']
0 rows × 19 columns
# Empty, which makes sense given the above merge result

# Now look for the same value as an integer
>>> df1[df1.col1 == 95498208100000]
1 rows × 19 columns
# FINDS THE ELEMENT!?!

So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..

Any ideas what's going on?

It's almost as thought Pandas converts df1.col1 to an integer just because it can, even though it should be treated as a string while matching.

(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

解决方案

The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)

Then the merge works as expected.

(I wish there was a way of specifying a dtype of str...)

这篇关于 pandas - 合并字符串列不起作用(错误?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆