用 pandas 快速删除标点符号 [英] Fast punctuation removal with pandas

查看:81
本文介绍了用 pandas 快速删除标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个自我回答的帖子.下面,我概述了NLP领域中的一个常见问题,并提出了一些有效的方法来解决它.

This is a self-answered post. Below I outline a common problem in the NLP domain and propose a few performant methods to solve it.

通常需要在文本清理和预处理期间删除标点.标点符号定义为string.punctuation中的任何字符:

Oftentimes the need arises to remove punctuation during text cleaning and pre-processing. Punctuation is defined as any character in string.punctuation:

>>> import string
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

这是一个很常见的问题,已经在广告恶心之前问过.最惯用的解决方案是使用熊猫str.replace.但是,对于涉及大量 文本的情况,可能需要考虑采用性能更高的解决方案.

This is a common enough problem and has been asked before ad nauseam. The most idiomatic solution uses pandas str.replace. However, for situations which involve a lot of text, a more performant solution may need to be considered.

处理成千上万条记录时,str.replace有什么好的,高性能的替代方案?

What are some good, performant alternatives to str.replace when dealing with hundreds of thousands of records?

推荐答案

设置

出于演示目的,让我们考虑一下此DataFrame.

Setup

For the purpose of demonstration, let's consider this DataFrame.

df = pd.DataFrame({'text':['a..b?!??', '%hgh&12','abc123!!!', '$$$1234']})
df
        text
0   a..b?!??
1    %hgh&12
2  abc123!!!
3    $$$1234

下面,我按照性能从高到低的顺序逐一列出了替代方案

Below, I list the alternatives, one by one, in increasing order of performance

包含此选项是为了建立默认方法,作为比较其他性能更高的解决方案的基准.

This option is included to establish the default method as a benchmark for comparing other, more performant solutions.

这使用了熊猫内置的str.replace函数,该函数执行基于正则表达式的替换.

This uses pandas in-built str.replace function which performs regex-based replacement.

df['text'] = df['text'].str.replace(r'[^\w\s]+', '')

df
     text
0      ab
1   hgh12
2  abc123
3    1234

这很容易编码,可读性很强,但是很慢.

This is very easy to code, and is quite readable, but slow.

这涉及使用re库中的sub函数.预编译正则表达式模式以提高性能,并在列表推导内调用regex.sub.如果您可以节省一些内存,请事先将df['text']转换为列表,这样您将获得一些不错的性能提升.

This involves using the sub function from the re library. Pre-compile a regex pattern for performance, and call regex.sub inside a list comprehension. Convert df['text'] to a list beforehand if you can spare some memory, you'll get a nice little performance boost out of this.

import re
p = re.compile(r'[^\w\s]+')
df['text'] = [p.sub('', x) for x in df['text'].tolist()]

df
     text
0      ab
1   hgh12
2  abc123
3    1234

注意:如果您的数据具有NaN值,则此方法(以及下面的下一个方法)将无法正常使用.请参阅"其他注意事项"部分.

Note: If your data has NaN values, this (as well as the next method below) will not work as is. See the section on "Other Considerations".

python的str.translate函数是用C实现的,因此非常快.

python's str.translate function is implemented in C, and is therefore very fast.

这是如何工作的:

  1. 首先,使用您选择的单个(或多个)字符分隔符,将所有字符串连接在一起以形成一个巨大字符串.您必须使用可以保证不会属于您数据的字符/子字符串.
  2. 对大字符串执行str.translate,删除标点符号(不包括步骤1中的分隔符).
  3. 在步骤1中用于连接的分隔符上分割字符串.结果列表必须具有与初始列相同的长度.
  1. First, join all your strings together to form one huge string using a single (or more) character separator that you choose. You must use a character/substring that you can guarantee will not belong inside your data.
  2. Perform str.translate on the large string, removing punctuation (the separator from step 1 excluded).
  3. Split the string on the separator that was used to join in step 1. The resultant list must have the same length as your initial column.

在此示例中,我们考虑管道分隔符|.如果您的数据包含管道,则必须选择另一个分隔符.

Here, in this example, we consider the pipe separator |. If your data contains the pipe, then you must choose another separator.

import string

punct = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}~'   # `|` is not present here
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['text'] = '|'.join(df['text'].tolist()).translate(transtab).split('|')

df
     text
0      ab
1   hgh12
2  abc123
3    1234


性能

到目前为止,

str.translate表现最好.请注意,下图包含 MaxU的答案的另一种变体Series.str.translate.


Performance

str.translate performs the best, by far. Note that the graph below includes another variant Series.str.translate from MaxU's answer.

(有趣的是,我第二次再次执行此操作,结果与之前略有不同.在第二次运行中,re.sub似乎在str.translate上赢得了很少的数据.)

(Interestingly, I reran this a second time, and the results are slightly different from before. During the second run, it seems re.sub was winning out over str.translate for really small amounts of data.)

使用translate存在固有的风险(尤其是自动化决定使用哪个分隔符的过程不容易)的问题,但是权衡是值得的风险.

There is an inherent risk involved with using translate (particularly, the problem of automating the process of deciding which separator to use is non-trivial), but the trade-offs are worth the risk.

使用列表推导方法处理NaN; 请注意,只要您的数据没有NaN,此方法(以及下一个方法)将只适用.在处理NaN时,您将必须确定非空值的索引并仅替换它们.尝试这样的事情:

Handling NaNs with list comprehension methods; Note that this method (and the next) will only work as long as your data does not have NaNs. When handling NaNs, you will have to determine the indices of non-null values and replace those only. Try something like this:

df = pd.DataFrame({'text': [
    'a..b?!??', np.nan, '%hgh&12','abc123!!!', '$$$1234', np.nan]})

idx = np.flatnonzero(df['text'].notna())
col_idx = df.columns.get_loc('text')
df.iloc[idx,col_idx] = [
    p.sub('', x) for x in df.iloc[idx,col_idx].tolist()]

df
     text
0      ab
1     NaN
2   hgh12
3  abc123
4    1234
5     NaN

处理DataFrames; 如果要处理DataFrames,其中每一个列都需要替换,则过程很简单:

Dealing with DataFrames; If you are dealing with DataFrames, where every column requires replacement, the procedure is simple:

v = pd.Series(df.values.ravel())
df[:] = translate(v).values.reshape(df.shape)

或者,

v = df.stack()
v[:] = translate(v)
df = v.unstack()

请注意,translate函数是在下面用基准测试代码定义的.

Note that the translate function is defined below in with the benchmarking code.

每个解决方案都需要权衡,因此决定哪种解决方案最适合您的需求将取决于您愿意付出的代价.两个非常常见的注意事项是性能(我们已经看过)和内存使用情况. str.translate是需要大量内存的解决方案,因此请谨慎使用.

Every solution has tradeoffs, so deciding what solution best fits your needs will depend on what you're willing to sacrifice. Two very common considerations are performance (which we've already seen), and memory usage. str.translate is a memory-hungry solution, so use with caution.

另一个要考虑的问题是您的正则表达式的复杂性.有时,您可能希望删除任何不是字母数字或空格的内容.有时,您将需要保留某些字符,例如连字符,冒号和句子终止符[.!?].明确指定这些选项会给您的正则表达式增加复杂性,从而可能反过来影响这些解决方案的性能.确保您测试了这些解决方案 在决定使用什么数据之前先对数据进行处理.

Another consideration is the complexity of your regex. Sometimes, you may want to remove anything that is not alphanumeric or whitespace. Othertimes, you will need to retain certain characters, such as hyphens, colons, and sentence terminators [.!?]. Specifying these explicitly add complexity to your regex, which may in turn impact the performance of these solutions. Make sure you test these solutions on your data before deciding what to use.

最后,此解决方案将删除unicode字符.您可能需要调整您的正则表达式(如果使用基于正则表达式的解决方案),否则可以直接使用str.translate.

Lastly, unicode characters will be removed with this solution. You may want to tweak your regex (if using a regex-based solution), or just go with str.translate otherwise.

要获得更高的性能(对于更大的N)(对于更大的N),请通过 Paul Panzer .

For even more performance (for larger N), take a look at this answer by Paul Panzer.

功能

def pd_replace(df):
    return df.assign(text=df['text'].str.replace(r'[^\w\s]+', ''))


def re_sub(df):
    p = re.compile(r'[^\w\s]+')
    return df.assign(text=[p.sub('', x) for x in df['text'].tolist()])

def translate(df):
    punct = string.punctuation.replace('|', '')
    transtab = str.maketrans(dict.fromkeys(punct, ''))

    return df.assign(
        text='|'.join(df['text'].tolist()).translate(transtab).split('|')
    )

# MaxU's version (https://stackoverflow.com/a/50444659/4909087)
def pd_translate(df):
    punct = string.punctuation.replace('|', '')
    transtab = str.maketrans(dict.fromkeys(punct, ''))

    return df.assign(text=df['text'].str.translate(transtab))

性能基准测试代码

from timeit import timeit

import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['pd_replace', 're_sub', 'translate', 'pd_translate'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        l = ['a..b?!??', '%hgh&12','abc123!!!', '$$$1234'] * c
        df = pd.DataFrame({'text' : l})
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=30)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

这篇关于用 pandas 快速删除标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆