Python / Pandas / Spacy-遍历DataFrame并计算pos_标签的数量 [英] Python / Pandas / spacy - iterate over a DataFrame and count the number of pos_ tags

查看：70 发布时间：2020/10/17 0:39:55 python pandas dataframe spacy

本文介绍了Python / Pandas / Spacy-遍历DataFrame并计算pos_标签的数量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Pandas数据框，其中包含来自作者的一些文本，并且想要对不同单词类型的总和进行一些统计工作。

i have a Pandas Dataframe with some Texts from an Author and want to do some statistical stuff with the sum of the different word types.

数据框-我的数据：

>>> data
             name                   style                                              text     year       year_dt
number  
0001    Demetrius                       D   Demetrius an der russischen Grenze Er ist vo...     1805    1805-01-01
0002    Der versöhnte Menschenfeind     D   Der versöhnte Menschenfeind -Fragment Gegend...     1790    1790-01-01
0003    Die Braut von Messina           D   Die Braut von Messina oder die feindlichen B...     1803    1803-01-01

几个月前，我编写了一个函数，它在df的每一行中进行迭代，并使用书的名称和内容，从spacy进行标记，并计算开头的名词，形容词和动词的数量。之后，数字将存储在新列中。

Some months ago i wrote a function that iterates from line to line of the df, take the name and the content of "the book", made a pos tagging from spacy, and count the number of nouns, adjectives and verbs for the beginning. after that, the number is stored in a new column.

我的功能：

import spacy
from spacy.lang.de import German
from collections import defaultdict
nlp = spacy.load('de')

def calculate_the_word_types(data):
    nouns = defaultdict(lambda: 0)
    verbs = defaultdict(lambda: 0)
    adjectives = defaultdict(lambda: 0)

    # count all tokens, but not the punctuations
    for i, row in data.iterrows():
        doc = nlp(row["name"] + " " + row["text"])
    data.set_value(i, "nr_token", len(list(map(lambda x: x.text, 
                                     filter(lambda x: x.pos_ != 'PUNCT', doc)))))

    # count only the adjectives
    for a in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'ADJ', doc)):
        adjectives[a] += 1
    data.set_value(i, "nr_adj", len(list(map(lambda x: x.text, 
                                     filter(lambda x: x.pos_ == 'ADJ', doc)))))  

    # count only the nouns
    for n in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'NOUN', doc)):
        nouns[n] +=1
    data.set_value(i, "nr_noun", len(list(map(lambda x: x.text, 
                                     filter(lambda x: x.pos_ == 'NOUN', doc)))))

    # count only the verbs
    for v in map(lambda x: x.lemma_, filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)):
        verbs[v] += 1
    data.set_value(i, "nr_verb", len(list(map(lambda x: x.text, 
                                     filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)))))  

    return data

输出

>>> data
           name style      text     year       year_dt  nr_token  br_adj   nr_noun   nr_verb
number  
0001    Deme...     D   Deme...     1805    1805-01-01       NaN     NaN       NaN       NaN
0002    Der ...     D   Der ...     1790    1790-01-01       NaN     NaN       NaN       NaN
0003    Die ...     D   Die ...     1803    1803-01-01    7127.0   584.0    1328.0    1286.0

我认为这可以追溯到那时，但不是现在。因为我的函数输出是以下内容，并且通过测试我知道它可以正常工作，但是数字始终仅在最后一行，所以我认为它会覆盖自身。

i think this worked back then, but not now. because my function output is the following and through testing i know, that it works, but the numbers are always in the last line only, so it overwrites itself, i think.

失败在哪里？

推荐答案

缩进您的二传手，以便将其缩进去

Indent your setter so that is it inside the outer for loop.

# count all tokens, but not the punctuations
for i, row in data.iterrows():
    doc = nlp(row["name"] + " " + row["text"])
    data.set_value(i, "nr_token", len(list(map(lambda x: x.text, 
                                 filter(lambda x: x.pos_ != 'PUNCT', doc)))))

    # count only the adjectives
    for a in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'ADJ', doc)):
        adjectives[a] += 1
    data.set_value(i, "nr_adj", len(list(map(lambda x: x.text, 
                                 filter(lambda x: x.pos_ == 'ADJ', doc)))))  

    # count only the nouns
    for n in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'NOUN', doc)):
        nouns[n] +=1
    data.set_value(i, "nr_noun", len(list(map(lambda x: x.text, 
                                 filter(lambda x: x.pos_ == 'NOUN', doc)))))

    # count only the verbs
    for v in map(lambda x: x.lemma_, filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)):
        verbs[v] += 1
    data.set_value(i, "nr_verb", len(list(map(lambda x: x.text, 
                                 filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)))))

这篇关于Python / Pandas / Spacy-遍历DataFrame并计算pos_标签的数量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python / Pandas / Spacy-遍历DataFrame并计算pos_标签的数量 [英] Python / Pandas / spacy - iterate over a DataFrame and count the number of pos_ tags

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python / Pandas / Spacy-遍历DataFrame并计算pos_标签的数量 [英] Python / Pandas / spacy - iterate over a DataFrame and count the number of pos_ tags

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭