如何合并具有子字符串的字符串以在Python的数据框中生成一些组 [英] How to merge strings that have substrings in common to produce some groups in a data frame in Python

查看:48
本文介绍了如何合并具有子字符串的字符串以在Python的数据框中生成一些组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个样本数据:

a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})

我想做的是合并一些字符串,如果它们具有相同的子字符串.因此,在此示例中,字符串'b,c','a','a,c,d,e'应该合并在一起,因为它们可以彼此链接.'j,k,l'和'k,l,m'应该在一组中.最后,我希望可以有这样的东西:

What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:

               group
'b,c',         0
'a',           0
'a,c,d,e',     0
'f,g,h,i',     1
'j,k,l',       2
'k,l,m'        2

所以,我可以分为三组,而在任何两个组之间都没有公用的子字符串.

So, I can have three groups and there is no common sub strings between any two groups.

现在,我正在尝试建立一个相似性数据帧,其中1表示两个字符串具有相同的子字符串.这是我的代码:

Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:

commonWords=1

for i in np.arange(a.shape[0]):
    a.loc[:,a.loc[i,'ACTIVITY']]=0

for i in a.loc[:,'ACTIVITY']:
    il=i.split(',')
    for j in a.loc[:,'ACTIVITY']:
        jl=j.split(',')
        c=[x in il for x in jl]
        c1=[x for x in c if x==True]
        a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
    
a

结果是:

    ACTIVITY    b,c     a   a,c,d,e     f,g,h,i     j,k,l   k,l,m
0   b,c          1      0       1           0       0       0
1   a            0      1       1           0       0       0
2   a,c,d,e      1      1       1           0       0       0
3   f,g,h,i      0      0       0           1       0       0
4   j,k,l        0      0       0           0       1       1
5   k,l,m        0      0       0           0       1       1

从这里,您可以查看是否有1,然后应将相关的行和列合并在一起.

From here, you can see if there is 1, then the related row and columns should be merged together.

推荐答案

networkx


#create the graph from the lists
G=nx.Graph()
G.add_edges_from(L2)
connected_comp = nx.connected_components(G)

#create dict for common values
node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}

# create groups by mapping first value of series called splitted
a['group'] = [node2id.get(x[0]) for x in splitted]
print (a)
  ACTIVITY  group
0      b,c      0
1        a      0
2  a,c,d,e      0
3  f,g,h,i      1
4    j,k,l      2
5    k,l,m      2

这篇关于如何合并具有子字符串的字符串以在Python的数据框中生成一些组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆