pandas :合并没有重复的栏/合并后找到唯一的单词 [英] Pandas: combine columns without duplicates/ find unique words after combining
问题描述
我有一个要连接某些列的数据框.
I have a dataframe where I would like to concatenate certain columns.
我的问题是这些列中的文本可能包含也可能不包含重复信息.我想删除重复项,以便仅保留相关信息.
My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.
例如,如果我有一个数据框,例如:
For example, if I had a data frame such as:
pd.read_csv("animal.csv")
animal1 animal2 label
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81
我想合并各列,但仅保留每个字符串中的唯一信息.
I want to combine the columns but retain only unique information from each of the strings.
您可以看到在第2行中,"Animal1"和"Animal2"列中都包含"cat".在第3行中,数字"26"同时位于"Animal1"和"Label"列中.而在第4行中,"Animal2"和"Label"列中的信息已经按顺序包含在"Animal1"中.
You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.
我通过执行以下操作合并列
I combine the columns by doing the following
animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat cat 72
3 pilchard 26 koala 26 pilchard 26 koala 26
4 newt bat 81 bat 81 newt bat 81 bat 81
第1行很好,但是其他行当然也包含如上所述的重复项.
Row 1 is fine, but the other rows, of course, contain duplicates as described above.
我想要的输出是:
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard koala 26
4 newt bat 81 bat 81 newt bat 81
或者如果我只能在详细信息列中每行保留每个单词/数字的第一个唯一实例,这也将是合适的,即:
or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:
detail
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81
我已经看过如何在python中为一个字符串做这个事情,例如如何使用Python?,如何获取数据框中所有唯一的单词?,显示不同的单词pyspark数据框中的列值:python 但无法弄清楚如何将其应用于详细信息列中的各个行.我已经考虑了在合并各列之后拆分文本,然后使用apply和lambda的方法,但是还没有解决这个问题.还是在合并列时有可能做到这一点?
I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?
我在R中有解决方案,但想在python中进行重新编码.
I have the solution in R but want to recode in python.
非常感谢您的任何帮助或建议.我目前正在使用Spyder(Python 3.5)
Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)
推荐答案
You can add custom function where first split by whitespace, then get unique values by pandas.unique
and last join to string back:
animals["detail"] = animals["animal1"].map(str) + ' ' +
animals["animal2"].map(str) + ' ' +
animals["label"].map(str)
animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
在apply
中也可能是联接值:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
使用set
的解决方案,但更改顺序:
Solution with set
, but it change order:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dolphin 19 dog
2 dog cat cat 72 cat dog 72
3 pilchard 26 koala 26 26 pilchard koala
4 newt bat 81 bat 81 bat 81 newt
这篇关于 pandas :合并没有重复的栏/合并后找到唯一的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!