合法化 pandas (Python) [英] Lemmatization Pandas (Python)

查看:127
本文介绍了合法化 pandas (Python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Pandas的初学者,我试图弄清楚如何对数据框的单个列进行定标.以下面的示例为例(这是我想对词进行(非)常用词去除后的一些文本):

I am a beginner at Pandas and I am trying to figure out how to lemmatize a single column of my dataframe. Take the following example (this is some text after (un)common word removal which I'd like to lemmatize):

0个好的需求发生变化,自然酿造出纯天然啤酒...

0 good needs changes virgils natural micro brewe...

有1个新的喜欢的人给了惊喜,发现他们...

1 new favorite given delightful surprise find fl...

2个最喜欢的红酱享受强大的单宁ok拉...

2 red sauce favorite enjoy strong tannin ok pull...

3种品质出色的1800年代21世纪尝试饮料...

3 quality fantastic 1800s 21st century try drink...

4红第一次尝试恋爱100完美融合...

4 red first time trying love 100excellent blend ...

这是我用来进行词法化的代码(摘自

This is the code I use to do lemmatization (taken from here):

df['words'] = df['words'].apply(lambda x: "".join([Word(word).lemmatize() for word in x]))
df['words'].head()

但是运行此代码后,输出不会更改:

But once this code is run the output doesn't change:

0好的需求变化virgil天然微酿造啤酒...

0 good need change virgil natural micro brewed r...

有1个新的喜欢的人给了惊喜,发现他们...

1 new favorite given delightful surprise find fl...

2个最喜欢的红酱享受强大的单宁ok拉...

2 red sauce favorite enjoy strong tannin ok pull...

3种品质奇妙的1800年代21世纪尝试喝酒...

3 quality fantastic 1800s 21st century try drink...

4红第一次尝试恋爱100完美融合...

4 red first time trying love 100excellent blend ...

任何帮助将不胜感激:)

Any help would be greatly appreciated :)

P.S:words是标记词的列表

P.S: words is a list of tokenized words

推荐答案

您可能不再需要解决方案,但是如果要在许多pos上进行词素化,则可以使用:

You probably don't need anymore solution, but if you want to lemmatize on many pos, you can use:

如果您想要更多,可以尝试以下代码:

If you want more, you can try the following code:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')


def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)



# Lemmatizing
df['Lemmatize'] = df['word'].apply(lambda x: lemmatize_sentence(x))
print(df.head())

df结果:

         word                       |        Lemmatize

0  Best scores, good cats, it rocks | Best score , good cat , it rock

1          You received best scores |          You receive best score

2                         Good news |                       Good news

3                          Bad news |                        Bad news

4                    I am loving it |                    I be love it

5                    it rocks a lot |                   it rock a lot

6     it is still good to do better |     it be still good to do good

这篇关于合法化 pandas (Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆