用 Pandas 从字符串生成 N-Grams [英] Generate N-Grams from strings with pandas

查看:56
本文介绍了用 Pandas 从字符串生成 N-Grams的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的 DataFrame df:

I have a DataFrame df like this:

Pattern    String                                       
101        hi, how are you?
104        what are you doing?
108        Python is good to learn.

我想为字符串列创建 ngram.我使用 split()stack()

I want to create ngrams for String Column. I've create unigram using split() and stack()

new= df.String.str.split(expand=True).stack()

但是,我想创建 ngrams(bi、tri、quad 等)

However, I want to create ngrams (bi, tri, quad etc)

推荐答案

对你的文本列做一点预处理,然后一点点移位 + 连接:

Do a little preprocessing on your text column, and then a little shifting + concatenation:

# generate unigrams 
unigrams  = (
    df['String'].str.lower()
                .str.replace(r'[^a-z\s]', '')
                .str.split(expand=True)
                .stack())

# generate bigrams by concatenating unigram columns
bigrams = unigrams + ' ' + unigrams.shift(-1)
# generate trigrams by concatenating unigram and bigram columns
trigrams = bigrams + ' ' + unigrams.shift(-2)

# concatenate all series vertically, and remove NaNs
pd.concat([unigrams, bigrams, trigrams]).dropna().reset_index(drop=True)

0                   hi
1                  how
2                  are
3                  you
4                 what
5                  are
6                  you
7                doing
8               python
9                   is
10                good
11                  to
12               learn
13              hi how
14             how are
15             are you
16            you what
17            what are
18             are you
19           you doing
20        doing python
21           python is
22             is good
23             good to
24            to learn
25          hi how are
26         how are you
27        are you what
28        you what are
29        what are you
30       are you doing
31    you doing python
32     doing python is
33      python is good
34          is good to
35       good to learn
dtype: object

这篇关于用 Pandas 从字符串生成 N-Grams的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆