什么是好的Python亵渎过滤器库? [英] What’s a good Python profanity filter library?

查看:98
本文介绍了什么是好的Python亵渎过滤器库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

类似于 https://stackoverflow.com/questions/1521646/best-profanity-filter ,但是对于Python来说,我正在寻找可以在本地运行和控制自己的库(而不是Web服务).

Like https://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I’m looking for libraries I can run and control myself locally, as opposed to web services.

(虽然很高兴听到您对亵渎性过滤的基本反对意见,但我并不是在这里专门寻找它们.我知道亵渎性过滤不能掩盖所有被说的伤害性内容.我知道在此发誓,宏大的计划并不是一个特别大的问题.我知道您需要一些人工输入来处理内容问题.我只是想找到一个好的库,并看看我可以如何利用它.)

(And whilst it’s always great to hear your fundamental objections of principle to profanity filtering, I’m not specifically looking for them here. I know profanity filtering can’t pick up every hurtful thing being said. I know swearing, in the grand scheme of things, isn’t a particularly big issue. I know you need some human input to deal with issues of content. I’d just like to find a good library, and see what use I can make of it.)

推荐答案

我没有找到任何Python亵渎库,所以我自己做了一个.

I didn't found any Python profanity library, so I made one myself.

与禁止字匹配的正则表达式列表.请不要使用\b,它将根据inside_words插入.

A list of regular expressions that match a forbidden word. Please do not use \b, it will be inserted depending on inside_words.

示例: ['bad', 'un\w+']

默认:True

不言自明.

默认值:"$@%-?!"

包含将随机生成替换字符串的字符的字符串.

A string with characters from which the replacements strings will be randomly generated.

示例:"%&$?!""-"

默认:True

控制是要替换整个字符串还是要保留第一个和最后一个字符.

Controls if the entire string will be replaced or if the first and last chars will be kept.

默认值:False

控制是否也在其他单词内搜索单词.禁用

Controls if words are searched inside other words too. Disabling this

(末尾为示例)

"""
Module that provides a class that filters profanities

"""

__author__ = "leoluk"
__version__ = '0.0.1'

import random
import re

class ProfanitiesFilter(object):
    def __init__(self, filterlist, ignore_case=True, replacements="$@%-?!", 
                 complete=True, inside_words=False):
        """
        Inits the profanity filter.

        filterlist -- a list of regular expressions that
        matches words that are forbidden
        ignore_case -- ignore capitalization
        replacements -- string with characters to replace the forbidden word
        complete -- completely remove the word or keep the first and last char?
        inside_words -- search inside other words?

        """

        self.badwords = filterlist
        self.ignore_case = ignore_case
        self.replacements = replacements
        self.complete = complete
        self.inside_words = inside_words

    def _make_clean_word(self, length):
        """
        Generates a random replacement string of a given length
        using the chars in self.replacements.

        """
        return ''.join([random.choice(self.replacements) for i in
                  range(length)])

    def __replacer(self, match):
        value = match.group()
        if self.complete:
            return self._make_clean_word(len(value))
        else:
            return value[0]+self._make_clean_word(len(value)-2)+value[-1]

    def clean(self, text):
        """Cleans a string from profanity."""

        regexp_insidewords = {
            True: r'(%s)',
            False: r'\b(%s)\b',
            }

        regexp = (regexp_insidewords[self.inside_words] % 
                  '|'.join(self.badwords))

        r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)

        return r.sub(self.__replacer, text)


if __name__ == '__main__':

    f = ProfanitiesFilter(['bad', 'un\w+'], replacements="-")    
    example = "I am doing bad ungood badlike things."

    print f.clean(example)
    # Returns "I am doing --- ------ badlike things."

    f.inside_words = True    
    print f.clean(example)
    # Returns "I am doing --- ------ ---like things."

    f.complete = False    
    print f.clean(example)
    # Returns "I am doing b-d u----d b-dlike things."

这篇关于什么是好的Python亵渎过滤器库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆