Python数据框:删除Python列中同一单元格中的重复单词 [英] Python Dataframe: Remove duplicate words in the same cell within a column in Python

查看:469
本文介绍了Python数据框:删除Python列中同一单元格中的重复单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面显示的一列包含我拥有的数据,另一列包含我想要重复数据删除的数据.

Below shows a column with data I have and another column with the de-duplicated data I want.

老实说,我什至不知道如何在Python代码中开始这样做.我已经在R中阅读了几篇关于此的文章,但在Python中却没有.

I honestly don't even know how to start doing this in Python code. I've read a couple of posts on this in R, but not in Python.

推荐答案

如果您希望摆脱仅 的连续重复项,就足够了:

If you're looking to get rid of consecutive duplicates only, this should suffice:

df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
df

           Current          Desired
0       Racoon Dog       Racoon Dog
1          Cat Cat              Cat
2  Dog Dog Dog Dog              Dog
3  Rat Fox Chicken  Rat Fox Chicken

详细信息

\b        # word boundary
(\w+)     # 1st capture group of a single word
( 
\s+       # 1 or more spaces
\1        # reference to first group 
)+        # one or more repeats
\b

正则表达式来自此处.

要删除非连续重复项,我建议一种涉及OrderedDict数据结构的解决方案:

To remove non-consecutive duplicates, I'd suggest a solution involving the OrderedDict data structure:

from collections import OrderedDict

df['Desired'] = (df['Current'].str.split()
                              .apply(lambda x: OrderedDict.fromkeys(x).keys())
                              .str.join(' '))
df

           Current          Desired
0       Racoon Dog       Racoon Dog
1          Cat Cat              Cat
2  Dog Dog Dog Dog              Dog
3  Rat Fox Chicken  Rat Fox Chicken

这篇关于Python数据框:删除Python列中同一单元格中的重复单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆