使用 pandas 进行基于 NLTK 的文本处理 [英] NLTK-based text processing with pandas

查看:24
本文介绍了使用 pandas 进行基于 NLTK 的文本处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标点符号和数字、小写字母在使用 nltk 时不起作用.

The punctuation and numerical,lowercase are not working while using nltk.

我的代码

stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']                    
new_stop_words=stopwords+user_defined_stop_words

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]

miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)

样本输入

23FLOOR 9 DES VOEUX RD WEST     HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG

预期产出

 floor des voeux west
 pag consulting flat aia central connaught central
 co city lost studios flat f hillier sheung

推荐答案

你的函数并且不完整.首先,问题 -

Your function is slow and is incomplete. First, with the issues -

  1. 您没有将数据小写.
  2. 您没有正确去除数字和标点符号.
  3. 你没有返回一个字符串(你应该使用 str.join 加入列表并返回它)
  4. 此外,使用文本处理的列表理解是引入可读性问题的主要方式,更不用说可能的冗余(您可以多次调用一个函数,对于它出现的每个 if 条件.
  1. You're not lowercasing your data.
  2. You're not getting rid of digits and punctuation properly.
  3. You're not returning a string (you should join the list using str.join and return it)
  4. Furthermore, a list comprehension with text processing is a prime way to introduce readability issues, not to mention possible redundancies (you may call a function multiple times, for each if condition it appears in.

接下来,您的函数存在一些明显的低效问题,尤其是停用词删除代码.

Next, there are a couple of glaring inefficiencies with your function, especially with the stopword removal code.

  1. 你的stopwords结构是一个list,并且in对列表的检查.首先要做的是将其转换为 set,使 not in 检查常数时间.

  1. Your stopwords structure is a list, and in checks on lists are slow. The first thing to do would be to convert that to a set, making the not in check constant time.

您使用的 nltk.word_tokenize 速度太慢了.

You're using nltk.word_tokenize which is unnecessarily slow.

最后,您不应该总是依赖 apply,即使您正在使用 NLTK,因为很少有任何矢量化解决方案可用.几乎总是有其他方法可以做完全相同的事情.通常,即使是 python 循环也更快.但这并不是一成不变的.

Lastly, you shouldn't always rely on apply, even if you are working with NLTK where there's rarely any vectorised solution available. There are almost always other ways to do the exact same thing. Oftentimes, even a python loop is faster. But this isn't set in stone.

首先,将增强的stopwords 创建为set -

First, create your enhanced stopwords as a set -

user_defined_stop_words = ['st','rd','hong','kong'] 

i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words

stopwords = set(i).union(j)

下一个修复是摆脱列表理解并将其转换为多行函数.这使事情变得更容易处理.函数的每一行都应专门用于解决特定任务(例如,去除数字/标点符号,或去除停用词或小写)-

The next fix is to get rid of the list comprehension and convert this into a multi-line function. This makes things so much easier to work with. Each line of your function should be dedicated to solving a particular task (example, getting rid of digits/punctuation, or getting rid of stopwords, or lowercasing) -

def preprocess(x):
    x = re.sub('[^a-zs]', '', x.lower())                  # get rid of noise
    x = [w for w in x.split() if w not in set(stopwords)]  # remove stopwords
    return ' '.join(x)                                     # join the list

举个例子.然后这将 applyied 到你的专栏 -

As an example. This would then be applyied to your column -

df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)

<小时>

作为替代,这里有一种不依赖于 apply 的方法.这应该适用于小句子.


As an alternative, here's an approach that doesn't rely on apply. This should be work well for small sentences.

将您的数据加载到一个系列中 -

Load your data into a series -

v = miss_data['Adj_Addr']
v

0            23FLOOR 9 DES VOEUX RD WEST     HONG KONG
1    PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2    C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object

现在是繁重的工作.

  1. 小写与 str.lower
  2. 使用str.replace去除噪音
  3. 使用 str.split 将单词拆分为单独的单元格
  4. 使用 pd.DataFrame.isin + pd.DataFrame.where
  5. 去除停用词
  6. 最后,使用 agg 加入数据帧.
  1. Lowercase with str.lower
  2. Remove noise using str.replace
  3. Split words into separate cells using str.split
  4. Apply stopword removal using pd.DataFrame.isin + pd.DataFrame.where
  5. Finally, join the dataframe using agg.

v = v.str.lower().str.replace('[^a-zs]', '').str.split(expand=True)

v.where(~v.isin(stopwords) & v.notnull(), '')
 .agg(' '.join, axis=1)
 .str.replace('s+', ' ')
 .str.strip()

0                                 floor des voeux west
1    pag consulting flat aia central connaught central
2           co city lost studios flat f hillier sheung
dtype: object

要在多列上使用它,将此代码放在函数 preprocess2 中并调用 apply -

To use this on multiple columns, place this code in a function preprocess2 and call apply -

def preprocess2(v):
     v = v.str.lower().str.replace('[^a-zs]', '').str.split(expand=True)

     return v.where(~v.isin(stopwords) & v.notnull(), '')
             .agg(' '.join, axis=1)
             .str.replace('s+', ' ')
             .str.strip()

c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)

您仍然需要一个 apply 调用,但是对于少量列,它不应该太严重.如果您不喜欢 apply,那么这里有一个适合您的循环变体 -

You'll still need an apply call, but with a small number of columns, it shouldn't scale too badly. If you dislike apply, then here's a loopy variant for you -

for _c in c:
    df[_c] = preprocess2(df[_c])

<小时>

让我们看看我们的非循环版本和原始版本之间的区别 -


Let's see the difference between our non-loopy version and the original -

s = pd.concat([s] * 100000, ignore_index=True) 

s.size
300000

首先,进行健全性检查 -

First, a sanity check -

preprocess2(s).eq(s.apply(preprocess)).all()
True

现在是时间.

%timeit preprocess2(s)   
1 loop, best of 3: 13.8 s per loop

%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop

这令人惊讶,因为 apply 很少比非循环解决方案快.但这在这种情况下是有道理的,因为我们已经对 preprocess 进行了相当多的优化,并且 Pandas 中的字符串操作很少被向量化(它们通常是,但性能增益没有你想象的那么多期待).

This is surprising, because apply is seldom faster than a non-loopy solution. But this makes sense in this case because we've optimised preprocess quite a bit, and string operations in pandas are seldom vectorised (they usually are, but the performance gain isn't as much as you'd expect).

让我们看看我们是否可以做得更好,绕过apply,使用np.vectorize

Let's see if we can do better, bypassing the apply, using np.vectorize

preprocess3 = np.vectorize(preprocess)

%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop

这与 apply 相同,但由于减少了隐藏"循环的开销,所以速度要快一些.

Which is identical to apply but happens to be a bit faster because of the reduced overhead around the "hidden" loop.

这篇关于使用 pandas 进行基于 NLTK 的文本处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆