通过Pandas数据框运行nltk sent_tokenize [英] Run nltk sent_tokenize through Pandas dataframe

查看:110
本文介绍了通过Pandas数据框运行nltk sent_tokenize的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由两列组成的数据框:ID和TEXT.伪装数据如下:

I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:

ID      TEXT
265     The farmer plants grain. The fisher catches tuna.
456     The sky is blue.
434     The sun is bright.
921     I own a phone. I own a book.

我知道所有nltk函数都不适用于数据帧.如何将send_tokenize应用于上述数据框?

I know all nltk functions do not work on dataframes. How could sent_tokenize be applied to the above dataframe?

当我尝试时:

df.TEXT.apply(nltk.sent_tokenize)  

输出与原始数据帧相同.我想要的输出是:

The output is unchanged from the original dataframe. My desired output is:

TEXT
The farmer plants grain.
The fisher catches tuna.
The sky is blue.
The sun is bright.
I own a phone.
I own a book.

此外,我想将这个新的(期望的)数据框绑定到原始的ID数字,如下所示(在进一步的文本清除之后):

In addition, I would like to tie back this new (desired) dataframe to the original ID numbers like this (following further text cleansing):

ID    TEXT
265     'farmer', 'plants', 'grain'
265     'fisher', 'catches', 'tuna'
456     'sky', 'blue'
434     'sun', 'bright'
921     'I', 'own', 'phone'
921     'I', 'own', 'book'

此问题与我的另一个问题

This question is related to another of my questions here. Please let me know if I can provide anything to help clarify my question!

推荐答案

编辑:由于@alexis的有力保证,这里的回答更好.

edit: as a result of warranted prodding by @alexis here is a better response

句子标记化

Sentence Tokenization

这应该为您提供一个数据框,其中每个ID&句子:

This should get you a DataFrame with one row for each ID & sentence:

sentences = []
for row in df.itertuples():
    for sentence in row[2].split('.'):
        if sentence != '':
            sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])

其输出如下所示:

split('.')如果句子实际上由句点分隔并且句点未用于其他事物(例如表示缩写),则会迅速将字符串分解为句子,并将删除句点正在进行中.如果句点有多个用例,并且/或者不是所有的句子结尾都用句点表示,这将失败.正如您所要求的那样,一种更慢但更健壮的方法是使用 sent_tokenize 按句子拆分行:

split('.') will quickly break strings up into sentences if sentences are in fact separated by periods and periods are not being used for other things (e.g. denoting abbreviations), and will remove periods in the process. This will fail if there are multiple use cases for periods and/or not all sentence endings are denoted by periods. A slower but much more robust approach would be to use, as you had asked, sent_tokenize to split rows up by sentence:

sentences = []
for row in df.itertuples():
    for sentence in sent_tokenize(row[2]):
        sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])

这将产生以下输出:

如果您想从这些行中快速删除句点,可以执行以下操作:

If you want to quickly remove periods from these lines you could do something like:

new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.'))

哪个会产生:

您还可以采用apply->映射方法( df 是原始表):

You can also take the apply -> map approach (df is your original table):

df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES'))

屈服:

继续:

sentences = df.SENTENCES.apply(pandas.Series)
sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns]

这将产生:

由于我们的索引没有更改,因此我们可以将其重新加入到原始表中:

As our indices have not changed, we can join this back into our original table:

df = df.join(sentences)

单词标记化

Word Tokenization

从上方继续 df ,我们可以如下提取给定句子中的标记:

Continuing with df from above, we can extract the tokens in a given sentence as follows:

df['sent_1_words'] = df['sentence 1'].apply(word_tokenize)

这篇关于通过Pandas数据框运行nltk sent_tokenize的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆