如何有效地将pos_tag_sents()应用于 pandas 数据框 [英] How to apply pos_tag_sents() to pandas dataframe efficiently

查看:156
本文介绍了如何有效地将pos_tag_sents()应用于 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在您希望对存储在熊猫数据框中的一列文本进行POS标记的情况下,每行有1个句子,因此大多数SO实现都使用apply方法

In situations where you wish to POS tag a column of text stored in a pandas dataframe with 1 sentence per row the majority of implementations on SO use the apply method

dfData['POSTags']= dfData['SourceText'].apply(
                 lamda row: [pos_tag(word_tokenize(row) for item in row])

NLTK文档建议使用pos_tag_sents()进行有效的标记一个以上的句子.

The NLTK documentation recommends using the pos_tag_sents() for efficient tagging of more than one sentence.

是否适用于本示例,如果适用,代码是否像将pso_tag更改为pos_tag_sents一样简单,或者NLTK表示段落的文本来源

Does that apply to this example and if so would the code be as simple as changing pso_tag to pos_tag_sents or does NLTK mean text sources of paragraphs

如评论中所述pos_tag_sents()旨在每次减少prceptor的负载,但问题是如何做到这一点并仍然在熊猫数据框中生成一列?

As mentioned in the comments pos_tag_sents() aims to reduce the loading of the preceptor each time but the issue is how to do this and still produce a column in a pandas dataframe?

链接到示例数据集20kRows

推荐答案

输入

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat

TL; DR

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0    cozily married practical athletics Mr. Brown flat
1       active married expensive soccer Mr. Chang flat
2    healthy single expensive badminton Mrs. Green ...
3    cozily married practical soccer Mr. Brown hier...
4     cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
   ID                 Task        label  \
0   1  Collect Information  no response   
1   2           New Credit  no response   
2   3  Collect Information     response   
3   4  Collect Information     response   
4   5  Collect Information     response   

                                                Text  \
0  cozily married practical athletics Mr. Brown flat   
1     active married expensive soccer Mr. Chang flat   
2  healthy single expensive badminton Mrs. Green ...   
3  cozily married practical soccer Mr. Brown hier...   
4   cozily single practical badminton Mr. Brown flat   

                                                 POS  
0  [(cozily, RB), (married, JJ), (practical, JJ),...  
1  [(active, JJ), (married, VBD), (expensive, JJ)...  
2  [(healthy, JJ), (single, JJ), (expensive, JJ),...  
3  [(cozily, RB), (married, JJ), (practical, JJ),...  
4  [(cozily, RB), (single, JJ), (practical, JJ), ... 


详细内容:

首先,您可以将Text列提取到字符串列表中:

First, you can extract the Text column to a list of string:

texts = df['Text'].tolist()

然后您可以应用word_tokenize函数:

map(word_tokenize, texts)

请注意,使用df.apply时,@ Boud的建议几乎相同:

Note that, @Boud's suggested is almost the same, using df.apply:

df['Text'].apply(word_tokenize)

然后,将标记化的文本转储到字符串列表的列表中:

Then you dump the tokenized text into a list of list of string:

df['Text'].apply(word_tokenize).tolist()

然后您可以使用pos_tag_sents:

pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

然后将列添加回DataFrame:

Then you add the column back to the DataFrame:

df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

这篇关于如何有效地将pos_tag_sents()应用于 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆