什么是好的 Python 脏话过滤器库? [英] What’s a good Python profanity filter library?
问题描述
喜欢 https://stackoverflow.com/questions/1521646/best-profanity-filter,但是对于 Python,我正在寻找可以在本地运行和控制自己的库,而不是 Web 服务.
Like https://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I’m looking for libraries I can run and control myself locally, as opposed to web services.
(虽然听到您对脏话过滤原则的基本反对意见总是很棒,但我并不是在这里专门寻找它们.我知道脏话过滤不能处理所有伤害性的事情.我知道发誓,在宏伟的计划,并不是一个特别大的问题.我知道您需要一些人工输入来处理内容问题.我只想找到一个好的图书馆,看看我能用它做什么.)
(And whilst it’s always great to hear your fundamental objections of principle to profanity filtering, I’m not specifically looking for them here. I know profanity filtering can’t pick up every hurtful thing being said. I know swearing, in the grand scheme of things, isn’t a particularly big issue. I know you need some human input to deal with issues of content. I’d just like to find a good library, and see what use I can make of it.)
推荐答案
我没有找到任何 Python 脏话库,所以我自己做了一个.
I didn't found any Python profanity library, so I made one myself.
匹配禁用词的正则表达式列表.请不要使用,它会根据
inside_words
插入.
A list of regular expressions that match a forbidden word. Please do not use , it will be inserted depending on
inside_words
.
示例:['bad', 'unw+']
默认值:True
不言自明.
默认:"$@%-?!"
包含随机生成替换字符串的字符的字符串.
A string with characters from which the replacements strings will be randomly generated.
示例:"%&$?!"
或 "-"
等
默认值:True
控制是替换整个字符串还是保留第一个和最后一个字符.
Controls if the entire string will be replaced or if the first and last chars will be kept.
默认值:False
控制是否也在其他词中搜索词.禁用此
Controls if words are searched inside other words too. Disabling this
(例子在最后)
"""
Module that provides a class that filters profanities
"""
__author__ = "leoluk"
__version__ = '0.0.1'
import random
import re
class ProfanitiesFilter(object):
def __init__(self, filterlist, ignore_case=True, replacements="$@%-?!",
complete=True, inside_words=False):
"""
Inits the profanity filter.
filterlist -- a list of regular expressions that
matches words that are forbidden
ignore_case -- ignore capitalization
replacements -- string with characters to replace the forbidden word
complete -- completely remove the word or keep the first and last char?
inside_words -- search inside other words?
"""
self.badwords = filterlist
self.ignore_case = ignore_case
self.replacements = replacements
self.complete = complete
self.inside_words = inside_words
def _make_clean_word(self, length):
"""
Generates a random replacement string of a given length
using the chars in self.replacements.
"""
return ''.join([random.choice(self.replacements) for i in
range(length)])
def __replacer(self, match):
value = match.group()
if self.complete:
return self._make_clean_word(len(value))
else:
return value[0]+self._make_clean_word(len(value)-2)+value[-1]
def clean(self, text):
"""Cleans a string from profanity."""
regexp_insidewords = {
True: r'(%s)',
False: r'(%s)',
}
regexp = (regexp_insidewords[self.inside_words] %
'|'.join(self.badwords))
r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)
return r.sub(self.__replacer, text)
if __name__ == '__main__':
f = ProfanitiesFilter(['bad', 'unw+'], replacements="-")
example = "I am doing bad ungood badlike things."
print f.clean(example)
# Returns "I am doing --- ------ badlike things."
f.inside_words = True
print f.clean(example)
# Returns "I am doing --- ------ ---like things."
f.complete = False
print f.clean(example)
# Returns "I am doing b-d u----d b-dlike things."
这篇关于什么是好的 Python 脏话过滤器库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!