所有 pandas 细胞的均化 [英] Lemmatization of all pandas cells

查看:110
本文介绍了所有 pandas 细胞的均化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框. 只有一列,我们将其命名为:"col" 此列的每个条目都是单词列表. ['word1','word2'等"

I have a panda dataframe. There is one column, let's name it: 'col' Each entry of this column is a list of words. ['word1', 'word2', etc.]

如何使用nltk库有效地计算所有这些单词的引理?

How can I efficiently compute the lemma of all of those words using the nltk library?

import nltk
nltk.stem.WordNetLemmatizer().lemmatize('word')

我希望能够为熊猫数据集的一列中所有单元的所有单词找到一个引理.

I want to be able to find a lemma for all words of all cells in one column of a pandas dataset.

我的数据类似于:

import pandas as pd
data = [[['walked','am','stressed','Fruit']],[['going','gone','walking','riding','running']]]
df = pd.DataFrame(data,columns=['col'])

推荐答案

您可以在pandas中使用apply并使用函数对给定字符串中的每个单词进行定形.请注意,有很多方法可以标记文本.如果使用空白令牌生成器,则可能必须删除.之类的符号.

You can use apply from pandas with a function to lemmatize each words in the given string. Note that there are many ways to tokenize your text. You might have to remove symbols like . if you use whitespace tokenizer.

下面,我举一个例子,说明如何对示例数据框的列进行定标.

Below, I give an example on how to lemmatize a column of example dataframe.

import nltk

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df = pd.DataFrame(['this was cheesy', 'she likes these books', 'wow this is great'], columns=['text'])
df['text_lemmatized'] = df.text.apply(lemmatize_text)

这篇关于所有 pandas 细胞的均化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆