Python提取(带有pandas数据框) [英] Python stemming (with pandas dataframe)

查看:245
本文介绍了Python提取(带有pandas数据框)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个带有词干的数据框. 我想使用Snowballstemmer通过我的分类算法获得更高的准确性.我该如何实现?

I created a dataframe with sentences to be stemmed. I would like to use a Snowballstemmer to obtain higher accuracy with my classification algorithm. How can I achieve this?

import pandas as pd
from nltk.stem.snowball import SnowballStemmer

# Use English stemmer.
stemmer = SnowballStemmer("english")

# Sentences to be stemmed.
data = ["programers program with programing languages", "my code is working so there must be a bug in the optimizer"] 

# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed']) 

# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()

# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)

# Print dataframe.
df 

+----+--------------------------------------------------------------+
|    | unstemmed                                                    |
|----+--------------------------------------------------------------|
|  0 | ['programers', 'program', 'with', 'programing', 'languages'] |
|  1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must',       |   
|    |  'be', 'a', 'bug', 'in', 'the', 'interpreter']               |
+----+--------------------------------------------------------------+

推荐答案

您必须在每个单词上应用词干并将其存储在词干"列中.

You have to apply the stemming on each word and store it into the "stemmed" column.

df['stemmed'] = df['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
df = df.drop(columns=['unstemmed']) # Get rid of the unstemmed column.
df # Print dataframe.

+----+--------------------------------------------------------------+
|    | stemmed                                                      |
|----+--------------------------------------------------------------|
|  0 | ['program', 'program', 'with', 'program', 'languag']         |
|  1 | ['my', 'code', 'is', 'work', 'so', 'there', 'must',          |   
|    |  'be', 'a', 'bug', 'in', 'the', 'interpret']                 |
+----+--------------------------------------------------------------+

这篇关于Python提取(带有pandas数据框)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆