如何合并具有一定数量子字符串的字符串以在Python的数据框中生成一些组 [英] How to merge strings that have certain number of substrings in common to produce some groups in a data frame in Python

查看:72
本文介绍了如何合并具有一定数量子字符串的字符串以在Python的数据框中生成一些组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我问了一个这样的问题.但这很简单.哪个已经解决.如何合并具有子字符串的字符串,以在Python的数据框中生成一些组.

I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.

但是在这里,我有类似问题的高级版本:

But here, I have an advanced version of the similar question:

我有一个样本数据:

a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})

我想做的是合并一些字符串,如果它们具有相同的子字符串.因此,在此示例中,字符串'b,c','a','a,c,d,e'应该合并在一起,因为它们可以彼此链接.'j,k,l'和'k,l,m'应该在一组中.最后,我希望可以有这样的东西:

What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:

               group
'b,c',         0
'a',           0
'a,c,d,e',     0
'f,g,h,i',     1
'j,k,l',       2
'k,l,m'        2

所以,我可以分为三组,而在任何两个组之间都没有公用的子字符串.

So, I can have three groups and there is no common sub strings between any two groups.

现在,我正在尝试建立一个相似性数据帧,其中1表示两个字符串具有相同的子字符串.这是我的代码:

Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:

commonWords=1

for i in np.arange(a.shape[0]):
    a.loc[:,a.loc[i,'ACTIVITY']]=0

for i in a.loc[:,'ACTIVITY']:
    il=i.split(',')
    for j in a.loc[:,'ACTIVITY']:
        jl=j.split(',')
        c=[x in il for x in jl]
        c1=[x for x in c if x==True]
        a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
    
a

结果是:

    ACTIVITY    b,c     a   a,c,d,e     f,g,h,i     j,k,l   k,l,m
0   b,c          1      0       1           0       0       0
1   a            0      1       1           0       0       0
2   a,c,d,e      1      1       1           0       0       0
3   f,g,h,i      0      0       0           1       0       0
4   j,k,l        0      0       0           0       1       1
5   k,l,m        0      0       0           0       1       1

在此代码中,commonWords表示我希望两个字符串有多少个共同的子字符串.例如,如果commonWords = 2,则只有在其中有两个或两个以上子字符串的情况下,两个字符串才会合并在一起.当commonWords = 2时,组应为:

In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:

               group
'b,c',         0
'a',           1
'a,c,d,e',     2
'f,g,h,i',     3
'j,k,l',       4
'k,l,m'        4

推荐答案

使用:

a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})


from itertools import combinations, chain
from collections import Counter

#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')

commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))


#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]

#create new columns for matched sets
for val in f1:
    j = ','.join(val)
    a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)

#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]

a = a[['ACTIVITY']].assign(group = new)
print (a)
  ACTIVITY  group
0      b,c      0
1        a      1
2  a,c,d,e      2
3  f,g,h,i      3
4    j,k,l      4
5    k,l,m      4

这篇关于如何合并具有一定数量子字符串的字符串以在Python的数据框中生成一些组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆