程序搜索不恰当的语言 [英] Procedure searching for inappropriate language

查看:101
本文介绍了程序搜索不恰当的语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有人知道或者知道我在哪里可以获得一些代码来检查

TextBox是否有不合适的语言。


目前,我们需要在

发布之前手动检查语言提交。这需要花费大量的时间和资源,并且在某些情况下需要大量的时间才能实际提交。


谢谢,


Tom

Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.

At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot of
time before a submission is actually made live.

Thanks,

Tom

推荐答案

Tom,


首先,你必须定义什么是不恰当。这将是人们广泛的范围。


我认为当你手动完成时,你已经制定了指导方针

表示什么是不恰当的语言。这应该作为你的设计

规格(或至少作为一个的基础)。


一旦你有了,其余应该很容易,因为它会真正归结为某些正则表达式代码的
,或者对字符串

类调用IndexOf。


-

- Nicholas Paldino [.NET / C#MVP]

- mv * @ spam.guard.caspershouse.com

" tshad" < t@home.com写了留言

新闻:OZ ************** @ TK2MSFTNGP02.phx.gbl ...
Tom,

First, you have to define what is "inappropriate". That''s going to
range widely among people.

I assume that when you do it manually, you have established guidelines
indicating what is inappropriate language. That should serve as your design
spec (or at least serve as the basis for one).

Once you have that, the rest should be easy, as it will really boil down
to some regular expression code, or some calls to IndexOf on the string
class.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com
"tshad" <t@home.comwrote in message
news:OZ**************@TK2MSFTNGP02.phx.gbl...

有没有人知道我在哪里可以获得一些代码来检查

TextBox是否有不合适的语言。


目前,我们需要在发布

之前手动检查语言提交。这需要花费大量的时间和资源,在某些情况下需要花费很多时间才能实际提交作品。


谢谢,


Tom
Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.

At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot
of time before a submission is actually made live.

Thanks,

Tom



我所做的是全局名为Badwords的静态Hashtable,

包含我的坏人列表,以及一个静态的IsBadWord方法。很快就通过这个传递帖子并且要么删除违规者,或者

决定不接受这个帖子。乔治卡林很自豪。


- 彼得

网站: http://www.eggheadcafe.com

UnBlog: http://petesbloggerama.blogspot.com
BlogMetaFinder(BETA): http://www.blogmetafinder.com

" tshad"写道:
What I do is have a Static Hashtable in global called "Badwords", that
contains my list of baddies, and a static accompanying IsBadWord method. It''s
pretty quick to pass a post through this and either remove the offenders or
decide not to accept the post at all. George Carlin would be proud.

-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
BlogMetaFinder(BETA): http://www.blogmetafinder.com

"tshad" wrote:

有没有人知道我在哪里可以得到一些代码来检查

TextBox是否有不适当的语言。


目前,我们需要在发布

之前手动检查语言提交。这需要花费大量的时间和资源,并且在某些情况下需要大量的时间才能实际提交。


谢谢,


Tom
Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.

At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot of
time before a submission is actually made live.

Thanks,

Tom


2007年7月9日星期一09:53:10 -0700,tshad< t@home.comwrote :
On Mon, 09 Jul 2007 09:53:10 -0700, tshad <t@home.comwrote:

有没有人知道我在哪里可以获得一些代码来检查

TextBox是否有不适当的语言。


目前,我们需要在发布

之前手动检查语言提交。这需要花费大量的时间和资源,并且在某些情况下需要花费很多时间才能提交实际提交的时间。


Does anyone have or know where I can get a some code that will check a
TextBox for inappropriate language.

At the moment, we need to manually check submissions for language before
posting. This takes a lot of time and resources and in some cases a lot
of
time before a submission is actually made live.



对于它的价值,您可能需要重新考虑将此作为自动化的

流程。正如尼古拉斯指出的那样,你可以使用各种文本搜索

机制来匹配不恰当语言的字典。反对

提交。然而,这可能会导致过于激进,

阻止文本在某些情况下完全正常,或者过于被动,

允许人们轻松绕过过滤器,或者同时兼顾两个问题

,阻止在

的同时阻止的事情,同时允许进攻性的事情太容易(通常在

用户故意混淆他们的冒犯性语言

使他们的文字对人类显而易见而没有计算机能够

理解它)。

当处理人类独有的问题时,通常最好将解决方案留给人类。你可以投入大量的时间和

努力创建一个基于字典的文本匹配系统,试图用b $ b过滤不合适的语言,或者你可以放一点点 ;报告帖子

链接在用户的查看用户界面并自动阻止帖子(甚至可能是

甚至是用户)的某个阈值(可能基于总比例

用户群)用户将帖子报告为不合适。


使用这种机制,相对少数用户仍然会受到不适当的影响语言,但希望它不是真的那么有害于他们,最终结果将是不恰当的语言更准确地识别和阻止。也就是说,即使你保证一些用户总是会在任何

帖子中看到不合适的语言,平均而言所有用户都可能会看到不那么不合适的语言<
比完全自动化的系统更好。


这就是说,如果你决定去字典路线,你可能会发现

简单的正则表达式或IndexOf,正如尼古拉斯建议的那样表现不佳。如果

提交的内容很短,而且字典中只有少量的

字,那可能就好了。但除此之外,你可能会发现

算法成本失控,因为提交长度和

字典长度变大。


如果是这样,您可能需要考虑基于现有索引的内容

和/或拼写检查功能。我承认,我不熟悉

那里已经存在的东西。我猜我已经有了很好的,功能齐全的

库(甚至可能是我所知道的.NET中的类),可以处理那些工作。但是,如果没有,你可能会发现我写这个类作为

练习,对类似的问题有用:

< http://groups.google.com/group /microsoft.public.dotnet.languages.csharp/msg/0f06f696d4500b77?dmode=source>


该帖子中的原始海报从未提及是否找到它

有用与否。也许他没有,也许你也不会。但是无论如何,我还是要提起它,以防万一。 :)


Pete

For what it''s worth, you may want to reconsider making this an automated
process. You could, as Nicholas points out, use various text searching
mechanisms to match a dictionary of "inappropriate language" against
submissions. However, this runs the risk of either being too aggressive,
blocking text that is in some contexts perfectly fine, or too passive,
allowing people to easily bypass the filter, or even having both problems
at the same time, blocking things that shouldn''t be blocked while at the
same time allowing offensive things through far too easily (usually when
the user intentionally obfuscates their offensive language in a way that
makes their text obvious to a human without a computer being able to
understand it).

When dealing with problems that are unique to humans, it is usually best
to leave the solution to humans. You can either invest a lot of time and
effort into creating a dictionary-based text matching system that tries to
filter inappropriate language, or you can just put a little "report post"
link in the user''s viewing UI and automatically block posts (and maybe
even users) when some threshold (probably based on proportion of total
user base) of users reports the post as inappropriate.

Using such a mechanism, a relative handful of users will still be
subjected to inappropriate language, but hopefully it''s not really that
harmful to them, and the end result will be that inappropriate language is
much more accurately identified and blocked. That is, even though you''re
guaranteed some users will always see the inappropriate language in any
post, on average all users are likely to see less inappropriate language
than would be the case with a completely automated system.

That said, if you do decide to go the dictionary route, you may find that
simple Regex or IndexOf as Nicholas suggested doesn''t perform well. If
the submissions are short and the dictionary only has a small number of
words in it, that''s probably fine. But otherwise, you are likely to find
that the algorithm cost scales out of control as submission length and
dictionary length get large.

If so, you may want to consider something based on existing indexing
and/or spell-check functionality. I admit, I''m not that familiar with
what''s already out there. I''d guess there are already good, full-featured
libraries (maybe even classes in .NET for all I know) that can handle that
sort of work. However, if not you may find this class that I wrote as an
exercise for a similar problem useful:
<http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/0f06f696d4500b77?dmode=source>

The original poster in that thread never mentioned whether he found it
useful or not. Maybe he didn''t, and maybe you wouldn''t either. But I
mention it anyway, just in case. :)

Pete


这篇关于程序搜索不恰当的语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆