如何在WordNetLemmatizer中传递词性? [英] How to pass part-of-speech in WordNetLemmatizer?

查看:323
本文介绍了如何在WordNetLemmatizer中传递词性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在预处理文本数据.但是,我面临着残词化的问题. 下面是示例文本:

I am preprocessing text data. However, I am facing issue with lemmatizing. Below is the sample text:

'据称周四一名18岁男孩被移交检察官 最后窃取了价值约1500万日元(134,300美元)的加密货币 警察说,今年是通过黑客攻击数字货币存储网站来实现的.", ``此案是日本首例被刑事指控的案件 警方就加密货币损失对黑客进行追捕 说.','\ n','男孩,来自To木县宇都宫市 因未成年而被隐瞒名字的地区, 据称,黑客入侵了Monappy网站后,用户偷了这笔钱 可以在8月14日至9月1日之间保留虚拟货币monacoin 去年,"他说,他使用了称为Tor的软件,因此很难 识别谁正在访问该系统,但警方通过以下方式识别了他 分析留在网站服务器上的通讯记录.", 警方说,这名男孩已经接受了指控,并称他为 说,我觉得自己找到了一个没人知道的窍门,就好像我 在玩电子游戏.",他利用了游戏中的弱点 网站的功能,使用户可以将货币转移到 另一个用户,该用户知道如果转移,系统将出现故障 在短时间内被重复.",他反复提交 对自己的货币转移请求,使系统不堪重负, 允许他在自己的帐户中注册更多的钱.",大约7,700 用户受到影响,运营商将赔偿他们.",男孩 后来将被盗的摩纳哥放入由其他人建立的帐户中 加密货币运营商,以不同的方式收到付款 加密货币和购买的物品,例如智能手机,警察 说.",据Monappy的运营商说,被盗的monacoins 一直使用具有始终在线互联网连接的系统,并且 那些保持离线状态的文件没有被盗.'

'An 18-year-old boy was referred to prosecutors Thursday for allegedly stealing about ¥15 million ($134,300) worth of cryptocurrency last year by hacking a digital currency storage website, police said.', 'The case is the first in Japan in which criminal charges have been pursued against a hacker over cryptocurrency losses, the police said.', '\n', 'The boy, from the city of Utsunomiya, Tochigi Prefecture, whose name is being withheld because he is a minor, allegedly stole the money after hacking Monappy, a website where users can keep the virtual currency monacoin, between Aug. 14 and Sept. 1 last year.', 'He used software called Tor that makes it difficult to identify who is accessing the system, but the police identified him by analyzing communication records left on the website’s server.', 'The police said the boy has admitted to the allegations, quoting him as saying, "I felt like I’d found a trick no one knows and did it as if I were playing a video game."', 'He took advantage of a weakness in a feature of the website that enables a user to transfer the currency to another user, knowing that the system would malfunction if transfers were repeated over a short period of time.', 'He repeatedly submitted currency transfer requests to himself, overwhelming the system and allowing him to register more money in his account.', 'About 7,700 users were affected and the operator will compensate them.', 'The boy later put the stolen monacoins in an account set up by a different cryptocurrency operator, received payouts in a different cryptocurrency and bought items such as a smartphone, the police said.', 'According to the operator of Monappy, the stolen monacoins were kept using a system with an always-on internet connection, and those kept offline were not stolen.'

我的代码是:

import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

df = pd.read_csv('All Articles.csv')
df['Articles'] = df['Articles'].str.lower()

stemming = PorterStemmer()
stops = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

def identify_tokens(row):
    Articles = row['Articles']
    tokens = nltk.word_tokenize(Articles)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words


df['words'] = df.apply(identify_tokens, axis=1)


def stem_list(row):
    my_list = row['words']
    stemmed_list = [stemming.stem(word) for word in my_list]
    return (stemmed_list)


df['stemmed_words'] = df.apply(stem_list, axis=1)


def lemma_list(row):
    my_list = row['stemmed_words']
    lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
    return (lemma_list)


df['lemma_words'] = df.apply(lemma_list, axis=1)


def remove_stops(row):
    my_list = row['lemma_words']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)


df['stem_meaningful'] = df.apply(remove_stops, axis=1)


def rejoin_words(row):
    my_list = row['stem_meaningful']
    joined_words = (" ".join(my_list))
    return joined_words


df['processed'] = df.apply(rejoin_words, axis=1)

从代码中可以明显看出我正在使用熊猫.但是,在这里我给出了示例文本.

As it is clear from the code that I am using pandas. However here I have given sample text.

我的问题所在是:

def lemma_list(row):
    my_list = row['stemmed_words']
    lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
    return (lemma_list)

df['lemma_words'] = df.apply(lemma_list, axis=1)

尽管代码正在运行但没有任何错误,引理函数无法正常工作.

Though the code is running without any error lemma function is not working expectedly.

预先感谢.

推荐答案

在上面的代码中,您试图对词干进行词法化.当lemmatizer遇到一个无法识别的单词时,它将简单地返回该单词.例如,词干offline会产生offlin,而当您通过lemmatizer运行该词时,它只会给出相同的词offlin.

In your code above you are trying to lemmatize words that have been stemmed. When the lemmatizer runs into a word that it doesn't recognize, it'll simply return that word. For instance stemming offline produces offlin and when you run that through the lemmatizer it just gives back the same word, offlin.

您应该修改您的代码以使words脱模,就像这样...

Your code should be modified to lemmatize words, like this...

def lemma_list(row):
    my_list = row['words']  # Note: line that is changed
    lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
    return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
print('Words: ',  df.ix[0,'words'])
print('Stems: ',  df.ix[0,'stemmed_words'])
print('Lemmas: ', df.ix[0,'lemma_words'])

这会产生...

Words:  ['and', 'those', 'kept', 'offline', 'were', 'not', 'stolen']
Stems:  ['and', 'those', 'kept', 'offlin',  'were', 'not', 'stolen']
Lemmas: ['and', 'those', 'keep', 'offline', 'be',   'not', 'steal']

哪个是正确的

这篇关于如何在WordNetLemmatizer中传递词性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆