从字符串列表中删除子字符串 [英] Remove substring from a list of strings
问题描述
我有一个包含被禁止单词的字符串列表.检查字符串是否包含任何禁止的单词并将其从字符串中删除的有效方法是什么?此刻,我有这个:
I have a list of strings that contain banned words. What's an efficient way of checking if a string contains any of the banned words and removing it from the string? At the moment, I have this:
cleaned = String.Join(" ", str.Split().Where(b => !bannedWords.Contains(b,
StringComparer.OrdinalIgnoreCase)).ToArray());
这对于单个禁止的单词有效,但不适用于短语(例如more than one word
). more than one word
的任何实例也应删除.我想尝试的另一种方法是使用List的Contains方法,但这仅返回布尔值,而不返回匹配单词的索引.如果我可以找到匹配单词的索引,则可以使用String.Replace(bannedWords[i],"");
This works fine for single banned words, but not for phrases (e.g. more than one word
). Any instance of more than one word
should also be removed. An alternative I thought of trying is to use the List's Contains method, but that only returns a bool and not an index of the matching word. If I could get an index of the matching word, I could just use String.Replace(bannedWords[i],"");
推荐答案
简单的String.Replace
无效,因为它将删除单词部分.如果性别"是一个禁止的词,而您却拥有一个禁止"的词,则应保持原样.
A simple String.Replace
will not work as it will remove word parts. If "sex" is a banned word and you have the word "sextet", which is not banned, you should keep it as is.
使用Regex
,您可以在文本中找到整个单词和短语
Using Regex
you can find whole words and phrases in a text with
string text = "A sextet is a musical composition for six instruments or voices.".
string word = "sex";
var matches = Regex.Matches(text, @"(?<=\b)" + word + @"(?=\b)");
在这种情况下,matches集合将为空.
The matches collection will be empty in this case.
您可以使用Regex.Replace
方法
foreach (string word in bannedWords) {
text = Regex.Replace(text, @"(?<=\b)" + word + @"(?=\b)", "")
}
注意:我使用了以下Regex
模式
Note: I used the following Regex
pattern
(?<=prefix)find(?=suffix)
其中'prefix'和'suffix'均为\b
,表示单词的开头和结尾.
where 'prefix' and 'suffix' are both \b
, which denotes word beginnings and ends.
如果您所禁止的单词或短语可以包含特殊字符,那么使用Regex.Escape(word)
避开它们会更安全.
If your banned words or phrases can contain special characters, it would be safer to escape them with Regex.Escape(word)
.
使用@zmbq的想法,您可以一次创建一个Regex
模式
Using @zmbq's idea you could create a Regex
pattern once with
string pattern =
@"(?<=\b)(" +
String.Join(
"|",
bannedWords
.Select(w => Regex.Escape(w))
.ToArray()) +
@")(?=\b)";
var regex = new Regex(pattern); // Is compiled by default
,然后使用
string result = regex.Replace(text, "");
这篇关于从字符串列表中删除子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!