基于NLTK的 pandas 文字处理 [英] NLTK-based text processing with pandas
问题描述
使用nltk时标点和数字小写不起作用.
我的代码
stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']
new_stop_words=stopwords+user_defined_stop_words
def preprocess(text):
return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]
miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)
样本输入
23FLOOR 9 DES VOEUX RD WEST HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG
预期产量
floor des voeux west
pag consulting flat aia central connaught central
co city lost studios flat f hillier sheung
您的函数慢且不完整.首先,要解决的问题-
- 您不会降低数据的大小写.
- 您没有正确消除数字和标点符号.
- 您不返回字符串(您应使用
str.join
加入列表并返回它) - 此外,具有文本处理功能的列表理解是引入可读性问题的主要方法,更不用说可能的冗余(对于出现的每个
if
条件,您可以多次调用该函数.
接下来,您的功能有很多明显的低效之处,尤其是停用词删除代码.
-
您的
stopwords
结构是一个列表,对列表的in
检查是 slow .首先要做的是将其转换为set
,使not in
检查恒定时间. -
您正在使用的
nltk.word_tokenize
速度太慢. -
最后,即使您在使用NLTK的情况下(几乎没有任何矢量化解决方案),也不应始终依赖于
apply
.几乎总是有其他方法可以做完全相同的事情.通常,即使是python循环也更快.但这不是一成不变的.
首先,将增强的stopwords
创建为 set -
user_defined_stop_words = ['st','rd','hong','kong']
i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words
stopwords = set(i).union(j)
下一个解决方法是摆脱列表理解,并将其转换为多行函数.这使事情变得更容易使用.函数的每一行都应专门用于解决特定任务(例如,去除数字/标点符号或去除停用词或小写字母)-
def preprocess(x):
x = re.sub('[^a-z\s]', '', x.lower()) # get rid of noise
x = [w for w in x.split() if w not in set(stopwords)] # remove stopwords
return ' '.join(x) # join the list
作为一个例子.然后,这将apply
链接到您的列-
df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)
作为替代方案,这是一种不依赖apply
的方法.对于小句子,这应该很好用.
将数据加载到系列中-
v = miss_data['Adj_Addr']
v
0 23FLOOR 9 DES VOEUX RD WEST HONG KONG
1 PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2 C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object
现在是沉重的负担.
- 带有
str.lower
的小写字母
- 使用
str.replace
消除噪声
- 使用
str.split
将单词拆分为单独的单元格
- 使用
pd.DataFrame.isin
+pd.DataFrame.where
应用停用词删除
- 最后,使用
agg
加入数据框.
v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)
v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()
0 floor des voeux west
1 pag consulting flat aia central connaught central
2 co city lost studios flat f hillier sheung
dtype: object
要在多列上使用此代码,请将此代码放在函数preprocess2
中,然后调用apply
-
def preprocess2(v):
v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)
return v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()
c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)
您仍然需要一个apply
调用,但是列数很少,它的伸缩性应该不会太差.如果您不喜欢apply
,那么这里有一个适合您的loopy变体-
for _c in c:
df[_c] = preprocess2(df[_c])
让我们看看我们的非循环版本与原始版本之间的区别-
s = pd.concat([s] * 100000, ignore_index=True)
s.size
300000
首先,进行健全性检查-
preprocess2(s).eq(s.apply(preprocess)).all()
True
现在是时候了.
%timeit preprocess2(s)
1 loop, best of 3: 13.8 s per loop
%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop
这令人惊讶,因为apply
很少比非循环解决方案快.但这在这种情况下是有道理的,因为我们已经对preprocess
进行了优化,并且熊猫中的字符串操作很少矢量化(通常是矢量化的,但是性能提升并没有您期望的那么大).>
让我们看看是否可以做得更好,使用np.vectorize
apply
preprocess3 = np.vectorize(preprocess)
%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop
与apply
相同,但是由于隐藏"循环周围的开销减少了,所以碰巧快了一些.
The punctuation and numerical,lowercase are not working while using nltk.
My code
stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']
new_stop_words=stopwords+user_defined_stop_words
def preprocess(text):
return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]
miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)
Sample Input
23FLOOR 9 DES VOEUX RD WEST HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG
Expected Output
floor des voeux west
pag consulting flat aia central connaught central
co city lost studios flat f hillier sheung
Your function is slow and is incomplete. First, with the issues -
- You're not lowercasing your data.
- You're not getting rid of digits and punctuation properly.
- You're not returning a string (you should join the list using
str.join
and return it) - Furthermore, a list comprehension with text processing is a prime way to introduce readability issues, not to mention possible redundancies (you may call a function multiple times, for each
if
condition it appears in.
Next, there are a couple of glaring inefficiencies with your function, especially with the stopword removal code.
Your
stopwords
structure is a list, andin
checks on lists are slow. The first thing to do would be to convert that to aset
, making thenot in
check constant time.You're using
nltk.word_tokenize
which is unnecessarily slow.Lastly, you shouldn't always rely on
apply
, even if you are working with NLTK where there's rarely any vectorised solution available. There are almost always other ways to do the exact same thing. Oftentimes, even a python loop is faster. But this isn't set in stone.
First, create your enhanced stopwords
as a set -
user_defined_stop_words = ['st','rd','hong','kong']
i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words
stopwords = set(i).union(j)
The next fix is to get rid of the list comprehension and convert this into a multi-line function. This makes things so much easier to work with. Each line of your function should be dedicated to solving a particular task (example, getting rid of digits/punctuation, or getting rid of stopwords, or lowercasing) -
def preprocess(x):
x = re.sub('[^a-z\s]', '', x.lower()) # get rid of noise
x = [w for w in x.split() if w not in set(stopwords)] # remove stopwords
return ' '.join(x) # join the list
As an example. This would then be apply
ied to your column -
df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)
As an alternative, here's an approach that doesn't rely on apply
. This should be work well for small sentences.
Load your data into a series -
v = miss_data['Adj_Addr']
v
0 23FLOOR 9 DES VOEUX RD WEST HONG KONG
1 PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2 C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object
Now comes the heavy lifting.
- Lowercase with
str.lower
- Remove noise using
str.replace
- Split words into separate cells using
str.split
- Apply stopword removal using
pd.DataFrame.isin
+pd.DataFrame.where
- Finally, join the dataframe using
agg
.
v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)
v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()
0 floor des voeux west
1 pag consulting flat aia central connaught central
2 co city lost studios flat f hillier sheung
dtype: object
To use this on multiple columns, place this code in a function preprocess2
and call apply
-
def preprocess2(v):
v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)
return v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()
c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)
You'll still need an apply
call, but with a small number of columns, it shouldn't scale too badly. If you dislike apply
, then here's a loopy variant for you -
for _c in c:
df[_c] = preprocess2(df[_c])
Let's see the difference between our non-loopy version and the original -
s = pd.concat([s] * 100000, ignore_index=True)
s.size
300000
First, a sanity check -
preprocess2(s).eq(s.apply(preprocess)).all()
True
Now come the timings.
%timeit preprocess2(s)
1 loop, best of 3: 13.8 s per loop
%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop
This is surprising, because apply
is seldom faster than a non-loopy solution. But this makes sense in this case because we've optimised preprocess
quite a bit, and string operations in pandas are seldom vectorised (they usually are, but the performance gain isn't as much as you'd expect).
Let's see if we can do better, bypassing the apply
, using np.vectorize
preprocess3 = np.vectorize(preprocess)
%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop
Which is identical to apply
but happens to be a bit faster because of the reduced overhead around the "hidden" loop.
这篇关于基于NLTK的 pandas 文字处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!