Python / Pandas / Spacy-遍历DataFrame并计算pos_标签的数量 [英] Python / Pandas / spacy - iterate over a DataFrame and count the number of pos_ tags
问题描述
我有一个Pandas数据框,其中包含来自作者的一些文本,并且想要对不同单词类型的总和进行一些统计工作。
i have a Pandas Dataframe with some Texts from an Author and want to do some statistical stuff with the sum of the different word types.
数据框-我的数据:
>>> data
name style text year year_dt
number
0001 Demetrius D Demetrius an der russischen Grenze Er ist vo... 1805 1805-01-01
0002 Der versöhnte Menschenfeind D Der versöhnte Menschenfeind -Fragment Gegend... 1790 1790-01-01
0003 Die Braut von Messina D Die Braut von Messina oder die feindlichen B... 1803 1803-01-01
几个月前,我编写了一个函数,它在df的每一行中进行迭代,并使用书的名称和内容,从spacy进行标记,并计算开头的名词,形容词和动词的数量。之后,数字将存储在新列中。
Some months ago i wrote a function that iterates from line to line of the df, take the name and the content of "the book", made a pos tagging from spacy, and count the number of nouns, adjectives and verbs for the beginning. after that, the number is stored in a new column.
我的功能:
import spacy
from spacy.lang.de import German
from collections import defaultdict
nlp = spacy.load('de')
def calculate_the_word_types(data):
nouns = defaultdict(lambda: 0)
verbs = defaultdict(lambda: 0)
adjectives = defaultdict(lambda: 0)
# count all tokens, but not the punctuations
for i, row in data.iterrows():
doc = nlp(row["name"] + " " + row["text"])
data.set_value(i, "nr_token", len(list(map(lambda x: x.text,
filter(lambda x: x.pos_ != 'PUNCT', doc)))))
# count only the adjectives
for a in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'ADJ', doc)):
adjectives[a] += 1
data.set_value(i, "nr_adj", len(list(map(lambda x: x.text,
filter(lambda x: x.pos_ == 'ADJ', doc)))))
# count only the nouns
for n in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'NOUN', doc)):
nouns[n] +=1
data.set_value(i, "nr_noun", len(list(map(lambda x: x.text,
filter(lambda x: x.pos_ == 'NOUN', doc)))))
# count only the verbs
for v in map(lambda x: x.lemma_, filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)):
verbs[v] += 1
data.set_value(i, "nr_verb", len(list(map(lambda x: x.text,
filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)))))
return data
输出
>>> data
name style text year year_dt nr_token br_adj nr_noun nr_verb
number
0001 Deme... D Deme... 1805 1805-01-01 NaN NaN NaN NaN
0002 Der ... D Der ... 1790 1790-01-01 NaN NaN NaN NaN
0003 Die ... D Die ... 1803 1803-01-01 7127.0 584.0 1328.0 1286.0
我认为这可以追溯到那时,但不是现在。因为我的函数输出是以下内容,并且通过测试我知道它可以正常工作,但是数字始终仅在最后一行,所以我认为它会覆盖自身。
i think this worked back then, but not now. because my function output is the following and through testing i know, that it works, but the numbers are always in the last line only, so it overwrites itself, i think.
失败在哪里?
推荐答案
缩进您的二传手,以便将其缩进去
Indent your setter so that is it inside the outer for loop.
# count all tokens, but not the punctuations
for i, row in data.iterrows():
doc = nlp(row["name"] + " " + row["text"])
data.set_value(i, "nr_token", len(list(map(lambda x: x.text,
filter(lambda x: x.pos_ != 'PUNCT', doc)))))
# count only the adjectives
for a in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'ADJ', doc)):
adjectives[a] += 1
data.set_value(i, "nr_adj", len(list(map(lambda x: x.text,
filter(lambda x: x.pos_ == 'ADJ', doc)))))
# count only the nouns
for n in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'NOUN', doc)):
nouns[n] +=1
data.set_value(i, "nr_noun", len(list(map(lambda x: x.text,
filter(lambda x: x.pos_ == 'NOUN', doc)))))
# count only the verbs
for v in map(lambda x: x.lemma_, filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)):
verbs[v] += 1
data.set_value(i, "nr_verb", len(list(map(lambda x: x.text,
filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)))))
这篇关于Python / Pandas / Spacy-遍历DataFrame并计算pos_标签的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!