大 pandas 在字符串上合并时合并结果丢失的行 [英] Pandas merge result missing rows when joining on strings

查看:69
本文介绍了大 pandas 在字符串上合并时合并结果丢失的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个正在清理的数据集,要清理它,我需要将其放入数据透视表中以汇总一些数据.我现在将其放回数据框中,以便可以将其与其他一些数据框合并. df1看起来像这样:

I have a data set that I've been cleaning and to clean it I needed to put it into a pivot table to summarize some of the data. I'm now putting it back into a dataframe so that I can merge it with some other dataframes. df1 looks something like this:

    Count   Region              Period      ACV  PRJ
    167     REMAINING US WEST   3/3/2018    5    57
    168     REMAINING US WEST   3/31/2018   10   83
    169     SAN FRANCISCO       1/13/2018   99   76
    170     SAN FRANCISCO       1/20/2018   34   21

df2看起来像这样:

df2 looks something like this:

Count  MKTcode  Region
11     RSMR0    REMAINING US SOUTH
12     RWMR0    REMAINING US WEST
13     SFR00    SAN FRANCISCO

我尝试将其与以下代码合并:

I've tried merging them with this code:

df3 = pd.merge(df1, df2, on='Region', how='inner')

但是由于某种原因,pandas没有将Region列解释为相同的数据,并且合并使MKTcode列中的NaN数据出现,并且似乎将df2附加到df1,如下所示:

but for some reason pandas is not interpreting the Region columns as the same data and the merge is turning up NaN data in the MKTcode column and it seems to be appending df2 to df1, like this:

Count  Region             Period    ACV             PRJ         MKTcode
193 WASHINGTON, D.C.    3/3/2018    36               38             NaN
194 WASHINGTON, D.C.    3/31/2018   12                3             NaN
195 ATLANTA              NaN        NaN             NaN           ATMR0
196 BOSTON               NaN        NaN             NaN           B2MRN

我尝试了内部和外部联接,但是真正的问题似乎是熊猫将每个数据框的Region列解释为不同的元素. df2中的MKTcode列和Region列只有12个观测值,每个观测值仅出现一次,而df1在Region列中有多个重复实例(同一城市的多个实例).有没有一种方法可以创建所需的12个MKTcode的列表,并在与我指定的每个区域匹配的位置执行合并?喜欢一对多的比赛吗?

I've tried inner and outer joins, but the real problem seems to be that pandas is interpreting the Region column of each dataframe as different elements. The MKTcode column and Region column in df2 has only 12 observations and each observation occurs only once, whereas df1 has several repeating instances in the Region column (multiples of the same city). Is there a way where I can just create a list of the 12 MKTcodes that I need and perform a merge where it matches with each region that I designate? Like a one to many match?

谢谢.

推荐答案

当合并未按预期进行时,首先要做的就是查看有问题的列.

When a merge isn't working as expected, the first thing to do is look at the offending columns.

在大多数情况下,最大的罪魁祸首是尾随/前导空格.这些错误通常是在从文件中错误读取DataFrame时引入的.

The biggest culprit in most cases is trailing/leading whitespaces. These are usually introduced when DataFrames are incorrectly read from files.

尝试通过去除多余的空白字符来消除它们.假设您需要加入地区"列,请使用

Try getting rid of extra whitespace characters by stripping them out. Assuming you need to join on the "Region" column, use

for df in (df1, df2):
    # Strip the column(s) you're planning to join with
    df['Region'] = df['Region'].str.strip()

现在,合并应该可以按预期进行,

Now, merging should work as expected,

pd.merge(df1, df2, on='Region', how='inner')

   Count_x             Region     Period  ACV  PRJ  Count_y MKTcode
0      167  REMAINING US WEST   3/3/2018    5   57       12   RWMR0
1      168  REMAINING US WEST  3/31/2018   10   83       12   RWMR0
2      169      SAN FRANCISCO  1/13/2018   99   76       13   SFR00
3      170      SAN FRANCISCO  1/20/2018   34   21       13   SFR00


如果您仍然遇到NaN,则另一种可能是因为 个单词之间的空白字符有所不同.例如,'REMAINING US WEST'不会与'REMAINING US WEST'相等.


Another possibility if you're still getting NaNs, could be because of a difference in whitespace characters between words. For example, 'REMAINING US WEST' will not compare as equal with 'REMAINING US WEST'.

这次,解决方法是使用str.replace:

This time, the fix is to use str.replace:

for df in (df1, df2):
    df['Region'] = df['Region'].str.replace(r'\s+', ' ')

这篇关于大 pandas 在字符串上合并时合并结果丢失的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆