从pandas df中的一列创建一个bigram [英] create a bigram from a column in pandas df

查看:54
本文介绍了从pandas df中的一列创建一个bigram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在熊猫数据框中有这个测试表

i have this test table in pandas dataframe

   Leaf_category_id  session_id  product_id
0               111           1         987
3               111           4         987
4               111           1         741
1               222           2         654
2               333           3         321

这是我上一个问题的扩展,@jazrael回答了该问题. 查看答案

this is the extension of my previous question, which was answered by @jazrael. view answer

因此,在product_id列中获得的值是(只是一个假设,与我上一个问题的输出没有什么不同,

so after getting the values in product_id column as(just an assumption, little different from the output of my previous question,

   |product_id               |
   ---------------------------
   |111,987,741,34,12        |
   |987,1232                 |
   |654,12,324,465,342,324   |
   |321,741,987              |
   |324,654,862,467,243,754  |
   |6453,123,987,741,34,12   |

,依此类推, 我想创建一个新列,其中一行中的所有值都应作为具有其下一个和最后一个编号的二元组来进行.与第一个结合的行中,例如:

and so on, i want to create a new column, in which all the values in a row should be made as a bigram with its next one, and the last no. in the row combined with the first one,for example:

   |product_id               |Bigram
   -------------------------------------------------------------------------
   |111,987,741,34,12        |(111,987),**(987,741)**,(741,34),(34,12),(12,111)
   |987,1232                 |(987,1232),(1232,987)
   |654,12,324,465,342,32    |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654)
   |321,741,987              |(321,741),**(741,987)**,(987,321)
   |324,654,862              |(324,654),(654,862),(862,324)
   |123,987,741,34,12        |(123,987),(987,741),(34,12),(12,123)

忽略**(我稍后会告诉你为什么要加注星标

ignore the **( i'll tell you later on why i starred that)

实现二元组的代码是

for i in df.Leaf_category_id.unique(): 
    print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index())

从这个df中,我想考虑bigram列,并再增加一个列命名为frequency,这使我知道了bigram发生的频率.

from this df, i want to consider the bigram column and make one more column named as frequency, which gives me frequency of bigram occured.

注释*:(987,741)和(741,987)被视为相同,并且应删除一个重复项,因此(987,741)的频率应为2. 与(34,12)相似,它发生两次,所以频率应为2

Note* : (987,741) and (741,987) are to be considered as same and one dublicate entry should be removed and thus frequency of (987,741) should be 2. similar is the case with (34,12) it occurs two times, so frequency should be 2

   |Bigram
   ---------------
   |(111,987),
   |**(987,741)**
   |(741,34)
   |(34,12)
   |(12,111)
   |**(741,987)**
   |(987,321)
   |(34,12)
   |(12,123)

最终结果应该是.

   |Bigram       | frequency |
   --------------------------
   |(111,987)    |  1 
   |(987,741)    |  2
   |(741,34)     |  1
   |(34,12)      |  2
   |(12,111)     |  1
   |(987,321)    |  1
   |(12,123)     |  1

我希望在这里找到答案,请帮助我,我已尽可能详细地阐述了它.

i am hoping to find answer here, please help me, i have elaborated it as much as possible.

推荐答案

尝试此代码

from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_csv("data.csv")
#consecutive
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in zip(x,x[1:])]).reset_index()

df1=pd.DataFrame(grouped_consecutive_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_consecutive = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_consecutive["index"]

用于组合(所有可能的二元组合)

for combinations (all possible bi-grams)

from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_csv("data.csv")
#combinations
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)]).reset_index()

df1=pd.DataFrame(grouped_combination_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_combinations = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_combinations["index"]

其中data.csv包含

Leaf_category_id,session_id,product_id
0,111,1,111
3,111,4,987
4,111,1,741
1,222,2,654
2,333,3,321
5,111,1,87
6,111,1,34
7,111,1,12
8,111,1,987
9,111,4,1232
10,222,2,12
11,222,2,324
12,222,2,465
13,222,2,342
14,222,2,32
15,333,3,321
16,333,3,741
17,333,3,987
18,333,3,324
19,333,3,654
20,333,3,862
21,222,1,123
22,222,1,987
23,222,1,741
24,222,1,34
25,222,1,12

结果bigram_frequency_consecutive将是

         Bigram  freq
0      (12, 34)     2
1     (12, 324)     1
2     (12, 654)     1
3     (12, 987)     1
4     (32, 342)     1
5      (34, 87)     1
6     (34, 741)     1
7     (87, 741)     1
8    (111, 741)     1
9    (123, 987)     1
10   (321, 321)     1
11   (321, 741)     1
12   (324, 465)     1
13   (324, 654)     1
14   (324, 987)     1
15   (342, 465)     1
16   (654, 862)     1
17   (741, 987)     2
18  (987, 1232)     1

结果bigram_frequency_combinations将是

           Bigram  freq
0      (12, 32)     1
1      (12, 34)     2
2      (12, 87)     1
3     (12, 111)     1
4     (12, 123)     1
5     (12, 324)     1
6     (12, 342)     1
7     (12, 465)     1
8     (12, 654)     1
9     (12, 741)     2
10    (12, 987)     2
11    (32, 324)     1
12    (32, 342)     1
13    (32, 465)     1
14    (32, 654)     1
15     (34, 87)     1
16    (34, 111)     1
17    (34, 123)     1
18    (34, 741)     2
19    (34, 987)     2
20    (87, 111)     1
21    (87, 741)     1
22    (87, 987)     1
23   (111, 741)     1
24   (111, 987)     1
25   (123, 741)     1
26   (123, 987)     1
27   (321, 321)     1
28   (321, 324)     2
29   (321, 654)     2
30   (321, 741)     2
31   (321, 862)     2
32   (321, 987)     2
33   (324, 342)     1
34   (324, 465)     1
35   (324, 654)     2
36   (324, 741)     1
37   (324, 862)     1
38   (324, 987)     1
39   (342, 465)     1
40   (342, 654)     1
41   (465, 654)     1
42   (654, 741)     1
43   (654, 862)     1
44   (654, 987)     1
45   (741, 862)     1
46   (741, 987)     3
47   (862, 987)     1
48  (987, 1232)     1

在上述情况下,两者均会分组

in the above case it groups by both

这篇关于从pandas df中的一列创建一个bigram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆