基于NLTK的 pandas 文字处理 [英] NLTK-based text processing with pandas

查看:49
本文介绍了基于NLTK的 pandas 文字处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用nltk时标点和数字小写不起作用.

我的代码

stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']                    
new_stop_words=stopwords+user_defined_stop_words

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]

miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)

样本输入

23FLOOR 9 DES VOEUX RD WEST     HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG

预期产量

 floor des voeux west
 pag consulting flat aia central connaught central
 co city lost studios flat f hillier sheung

解决方案

您的函数且不完整.首先,要解决的问题-

  1. 您不会降低数据的大小写.
  2. 您没有正确消除数字和标点符号.
  3. 您不返回字符串(您应使用str.join加入列表并返回它)
  4. 此外,具有文本处理功能的列表理解是引入可读性问题的主要方法,更不用说可能的冗余(对于出现的每个if条件,您可以多次调用该函数.

接下来,您的功能有很多明显的低效之处,尤其是停用词删除代码.

  1. 您的stopwords结构是一个列表,对列表的in检查是 slow .首先要做的是将其转换为set,使not in检查恒定时间.

  2. 您正在使用的nltk.word_tokenize速度太慢.

  3. 最后,即使您在使用NLTK的情况下(几乎没有任何矢量化解决方案),也不应始终依赖于apply.几乎总是有其他方法可以做完全相同的事情.通常,即使是python循环也更快.但这不是一成不变的.

首先,将增强的stopwords创建为 set -

user_defined_stop_words = ['st','rd','hong','kong'] 

i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words

stopwords = set(i).union(j)

下一个解决方法是摆脱列表理解,并将其转换为多行函数.这使事情变得更容易使用.函数的每一行都应专门用于解决特定任务(例如,去除数字/标点符号或去除停用词或小写字母)-

def preprocess(x):
    x = re.sub('[^a-z\s]', '', x.lower())                  # get rid of noise
    x = [w for w in x.split() if w not in set(stopwords)]  # remove stopwords
    return ' '.join(x)                                     # join the list

作为一个例子.然后,这将apply链接到您的列-

df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)


作为替代方案,这是一种不依赖apply的方法.对于小句子,这应该很好用.

将数据加载到系列中-

v = miss_data['Adj_Addr']
v

0            23FLOOR 9 DES VOEUX RD WEST     HONG KONG
1    PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2    C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object

现在是沉重的负担.

  1. 带有str.lower
  2. 的小写字母
  3. 使用str.replace
  4. 消除噪声
  5. 使用str.split
  6. 将单词拆分为单独的单元格
  7. 使用pd.DataFrame.isin + pd.DataFrame.where
  8. 应用停用词删除
  9. 最后,使用agg加入数据框.

v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

v.where(~v.isin(stopwords) & v.notnull(), '')\
 .agg(' '.join, axis=1)\
 .str.replace('\s+', ' ')\
 .str.strip()

0                                 floor des voeux west
1    pag consulting flat aia central connaught central
2           co city lost studios flat f hillier sheung
dtype: object

要在多列上使用此代码,请将此代码放在函数preprocess2中,然后调用apply-

def preprocess2(v):
     v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

     return v.where(~v.isin(stopwords) & v.notnull(), '')\
             .agg(' '.join, axis=1)\
             .str.replace('\s+', ' ')\
             .str.strip()

c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)

您仍然需要一个apply调用,但是列数很少,它的伸缩性应该不会太差.如果您不喜欢apply,那么这里有一个适合您的loopy变体-

for _c in c:
    df[_c] = preprocess2(df[_c])


让我们看看我们的非循环版本与原始版本之间的区别-

s = pd.concat([s] * 100000, ignore_index=True) 

s.size
300000

首先,进行健全性检查-

preprocess2(s).eq(s.apply(preprocess)).all()
True

现在是时候了.

%timeit preprocess2(s)   
1 loop, best of 3: 13.8 s per loop

%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop

这令人惊讶,因为apply很少比非循环解决方案快.但这在这种情况下是有道理的,因为我们已经对preprocess进行了优化,并且熊猫中的字符串操作很少矢量化(通常是矢量化的,但是性能提升并没有您期望的那么大).

让我们看看是否可以做得更好,使用np.vectorize

绕过apply

preprocess3 = np.vectorize(preprocess)

%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop

apply相同,但是由于隐藏"循环周围的开销减少了,所以碰巧快了一些.

The punctuation and numerical,lowercase are not working while using nltk.

My code

stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']                    
new_stop_words=stopwords+user_defined_stop_words

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]

miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)

Sample Input

23FLOOR 9 DES VOEUX RD WEST     HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG

Expected Output

 floor des voeux west
 pag consulting flat aia central connaught central
 co city lost studios flat f hillier sheung

解决方案

Your function is slow and is incomplete. First, with the issues -

  1. You're not lowercasing your data.
  2. You're not getting rid of digits and punctuation properly.
  3. You're not returning a string (you should join the list using str.join and return it)
  4. Furthermore, a list comprehension with text processing is a prime way to introduce readability issues, not to mention possible redundancies (you may call a function multiple times, for each if condition it appears in.

Next, there are a couple of glaring inefficiencies with your function, especially with the stopword removal code.

  1. Your stopwords structure is a list, and in checks on lists are slow. The first thing to do would be to convert that to a set, making the not in check constant time.

  2. You're using nltk.word_tokenize which is unnecessarily slow.

  3. Lastly, you shouldn't always rely on apply, even if you are working with NLTK where there's rarely any vectorised solution available. There are almost always other ways to do the exact same thing. Oftentimes, even a python loop is faster. But this isn't set in stone.

First, create your enhanced stopwords as a set -

user_defined_stop_words = ['st','rd','hong','kong'] 

i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words

stopwords = set(i).union(j)

The next fix is to get rid of the list comprehension and convert this into a multi-line function. This makes things so much easier to work with. Each line of your function should be dedicated to solving a particular task (example, getting rid of digits/punctuation, or getting rid of stopwords, or lowercasing) -

def preprocess(x):
    x = re.sub('[^a-z\s]', '', x.lower())                  # get rid of noise
    x = [w for w in x.split() if w not in set(stopwords)]  # remove stopwords
    return ' '.join(x)                                     # join the list

As an example. This would then be applyied to your column -

df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)


As an alternative, here's an approach that doesn't rely on apply. This should be work well for small sentences.

Load your data into a series -

v = miss_data['Adj_Addr']
v

0            23FLOOR 9 DES VOEUX RD WEST     HONG KONG
1    PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2    C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object

Now comes the heavy lifting.

  1. Lowercase with str.lower
  2. Remove noise using str.replace
  3. Split words into separate cells using str.split
  4. Apply stopword removal using pd.DataFrame.isin + pd.DataFrame.where
  5. Finally, join the dataframe using agg.

v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

v.where(~v.isin(stopwords) & v.notnull(), '')\
 .agg(' '.join, axis=1)\
 .str.replace('\s+', ' ')\
 .str.strip()

0                                 floor des voeux west
1    pag consulting flat aia central connaught central
2           co city lost studios flat f hillier sheung
dtype: object

To use this on multiple columns, place this code in a function preprocess2 and call apply -

def preprocess2(v):
     v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

     return v.where(~v.isin(stopwords) & v.notnull(), '')\
             .agg(' '.join, axis=1)\
             .str.replace('\s+', ' ')\
             .str.strip()

c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)

You'll still need an apply call, but with a small number of columns, it shouldn't scale too badly. If you dislike apply, then here's a loopy variant for you -

for _c in c:
    df[_c] = preprocess2(df[_c])


Let's see the difference between our non-loopy version and the original -

s = pd.concat([s] * 100000, ignore_index=True) 

s.size
300000

First, a sanity check -

preprocess2(s).eq(s.apply(preprocess)).all()
True

Now come the timings.

%timeit preprocess2(s)   
1 loop, best of 3: 13.8 s per loop

%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop

This is surprising, because apply is seldom faster than a non-loopy solution. But this makes sense in this case because we've optimised preprocess quite a bit, and string operations in pandas are seldom vectorised (they usually are, but the performance gain isn't as much as you'd expect).

Let's see if we can do better, bypassing the apply, using np.vectorize

preprocess3 = np.vectorize(preprocess)

%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop

Which is identical to apply but happens to be a bit faster because of the reduced overhead around the "hidden" loop.

这篇关于基于NLTK的 pandas 文字处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆