你会如何建议过滤包含在巨大列表中定义的冒犯性词语的评论 [英] How would you recommended to filter comments which contains offensive words defined in a huge list

查看:177
本文介绍了你会如何建议过滤包含在巨大列表中定义的冒犯性词语的评论的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短:

JAVA / Hibernate / AJAX / SpringMVC

我希望每一条评论由用户发布的内容应在服务器端中读取,然后将其存储到数据库中 并且如果包含冒犯性文本则拒绝该评论。

攻击性文本列表非常庞大(可能有数千个)。看看这个例子列表: http://onlineslangdictionary.com/lists/most-vulgar-words/



我想迭代这个列表并执行如下的函数并不是那么快。有没有其他方法可以更快地完成此过滤器?
您认为搜索超过千分之一的项目会对资源CPU / RAM产生巨大影响吗?欢迎任何建议!

  for(String offensiveText:offensiveTextList){
if(commentText.contains(offensiveText)) {
//拒绝评论
}
}

更新:
攻击性项目列表可以包含由其中的几个单词组成的项目(如3个单词的文本,可以包含停用词)。
它甚至可以包含非字母字符,如*& ^%。



如果评论中包含相应的冒犯性内容(准确地说相同的字母),那么它会被拒绝

M 字和列表中的 N 个冒犯性词,那么您的算法复杂度将为 O( MN)= O(N ^ 2),这相当高。



查看 Lucene堆栈,您可能会发现一些非常好的想法,例如如何标记注释并通过删除无意义的单词来减少输入。



另外看看这篇论文:区分事实信息与侮辱性或辱骂性消息新闻文章中的词或短语


To be short:

JAVA/Hibernate/AJAX/SpringMVC

I would like that every comment which is posted by a user should be read on the server side before storing it into the database and reject the comment if it contains an offensive text.

The offensive text list is quite huge (maybe thousands). look at this example list: http://onlineslangdictionary.com/lists/most-vulgar-words/

I guess that iterating this list and execute a function like the following is not so fast. Is there any other way to do this filter more faster? Do you think search over thousandths of items will have a big impact over resources CPU/RAM? Any suggestion is welcomed!

for(String offensiveText : offensiveTextList ){     
    if(commentText.contains(offensiveText )){
         //reject comment
    }
}

Update: The offensive item list can contain items composed by a few words inside it (like a 3 words text, and can contain stop words). It can contain even non alphabet characters like *&^%.

If the comment contains the respective offensive item (exactly same letters) then it is considered rejected

解决方案

You would probably need to use some natural language processing library for this. If you are going to compare every M word from a comment with N offensive words from a list, then your algorithm complexity is going to be O(MN) = O(N^2), which is quite high.

Take a look at the Lucene stack, you may find some really good ideas, for example how to tokenize a comment and reduce the input by removing meaningless words.

Also take a look at the thesis: "Distinguishing Between Factual Information and Insulting or Abusive Messages bearing Words or Phrases in News Articles"

这篇关于你会如何建议过滤包含在巨大列表中定义的冒犯性词语的评论的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆