pandas :合并没有重复的栏/合并后找到唯一的单词 [英] Pandas: combine columns without duplicates/ find unique words after combining

查看:65
本文介绍了 pandas :合并没有重复的栏/合并后找到唯一的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要连接某些列的数据框.

I have a dataframe where I would like to concatenate certain columns.

我的问题是这些列中的文本可能包含也可能不包含重复信息.我想删除重复项,以便仅保留相关信息.

My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.

例如,如果我有一个数据框,例如:

For example, if I had a data frame such as:

pd.read_csv("animal.csv")

  animal1         animal2        label  
1 cat dog         dolphin        19
2 dog cat         cat            72
3 pilchard 26     koala          26
4 newt bat 81     bat            81

我想合并各列,但仅保留每个字符串中的唯一信息.

I want to combine the columns but retain only unique information from each of the strings.

您可以看到在第2行中,"Animal1"和"Animal2"列中都包含"cat".在第3行中,数字"26"同时位于"Animal1"和"Label"列中.而在第4行中,"Animal2"和"Label"列中的信息已经按顺序包含在"Animal1"中.

You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.

我通过执行以下操作合并列

I combine the columns by doing the following

animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat cat 72
3 pilchard 26     koala          26           pilchard 26 koala 26
4 newt bat 81     bat            81           newt bat 81 bat 81

第1行很好,但是其他行当然也包含如上所述的重复项.

Row 1 is fine, but the other rows, of course, contain duplicates as described above.

我想要的输出是:

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat 72
3 pilchard 26     koala          26           pilchard koala 26
4 newt bat 81     bat            81           newt bat 81

或者如果我只能在详细信息列中每行保留每个单词/数字的第一个唯一实例,这也将是合适的,即:

or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:

  detail 
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81

我已经看过如何在python中为一个字符串做这个事情,例如如何使用Python?如何获取数据框中所有唯一的单词?显示不同的单词pyspark数据框中的列值:python 但无法弄清楚如何将其应用于详细信息列中的各个行.我已经考虑了在合并各列之后拆分文本,然后使用apply和lambda的方法,但是还没有解决这个问题.还是在合并列时有可能做到这一点?

I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?

我在R中有解决方案,但想在python中进行重新编码.

I have the solution in R but want to recode in python.

非常感谢您的任何帮助或建议.我目前正在使用Spyder(Python 3.5)

Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)

推荐答案

您可以添加自定义函数,该函数首先用空格分隔,然后通过

You can add custom function where first split by whitespace, then get unique values by pandas.unique and last join to string back:

animals["detail"] = animals["animal1"].map(str) + ' ' + 
                    animals["animal2"].map(str) + ' ' +
                    animals["label"].map(str)

animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

apply中也可能是联接值:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

使用set的解决方案,但更改顺序:

Solution with set, but it change order:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dolphin 19 dog
2      dog cat      cat     72          cat dog 72
3  pilchard 26    koala     26   26 pilchard koala
4  newt bat 81      bat     81         bat 81 newt

这篇关于 pandas :合并没有重复的栏/合并后找到唯一的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆