正则表达式有助于消除噪音词或停止串词 [英] RegEx help to remove noise words or stop words from string

查看:279
本文介绍了正则表达式有助于消除噪音词或停止串词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想删除从输入标签的所有噪音标签(字符串)
标签由逗号分隔。如果一个干扰词是一个大标签的一部分,它仍将



这是我有什么,但不工作:

 字符串input_string =这肯定一下,我们所有的,全部价值; 
串禁用词=这|是|关于我们|后|全部|也;
禁用词=的String.Format(@\s \b(?:?{0})\b\s禁用词);
字符串变量= Regex.Replace(input_string,停用词,,RegexOptions.IgnoreCase);

这是我从上面输入想要的:
,确定,,我们所有的,,值



这些话这,左右,所有将被替换为,因为它们是干扰词。
但是,我们所有的将保持即使它有它的噪音所有一词。
这是因为逗号标签边界



任何人都可以给我一个伸出援助之手?



我有放噪音词进入词典,然后搜索中输入字符串的每个单词的替代解决方案。但我更喜欢的正则表达式的方法。


解决方案

  VAR输入=这个,当然,一下,我们所有的,全部价值; 
变种禁用词=新的正则表达式(^(这|是|关于我们|后|全部|也)$);
变种结果=的string.join(,,input.Split(,)
其中(X =方式>!stopWords.IsMatch(x.Trim())));


I want to remove all noise tags from input tags (a string) The tags are separated by comma. If a noise word is part of a big tag, it will remain.

This is what I have but not working:

string input_string = "This,sure,about,all of our, all, values";
string stopWords = "this|is|about|after|all|also";
stopWords = string.Format(@"\s?\b(?:{0})\b\s?", stopWords);
string tags = Regex.Replace(input_string, stopWords, "", RegexOptions.IgnoreCase); 

This is what I want from above input: ",sure,,all of our,,values"

These words "This", "about", "all" will be replaced with "" since they are noise words. But "all of our" will remain even if it has the noise word "all" in it. This is because comma is the tag boundary

Anyone can give me a helping hand?

I had an alternate solution that puts the noise words into a dictionary and then search each word in input string. But I prefer RegEx approach.

解决方案

        var input = "This,sure,about,all of our, all, values";
        var stopWords = new Regex("^(this|is|about|after|all|also)$");
        var result = String.Join(",", input.Split(',').
            Where(x => !stopWords.IsMatch(x.Trim())));

这篇关于正则表达式有助于消除噪音词或停止串词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆