正则表达式:如何从字符串中获取单词(C#) [英] Regex : how to get words from a string (C#)
问题描述
我的输入由用户发布的字符串组成.
My input consists of user-posted strings.
我想做的是创建一个包含单词的字典,以及它们的使用频率.这意味着我想解析一个字符串,删除所有垃圾,并获得一个单词列表作为输出.
What I want to do is create a dictionary with words, and how often they’ve been used. This means I want to parse a string, remove all garbage, and get a list of words as output.
例如,假设输入是"#@!@LOLOLOL 你去过***PWN3D*** !:') !!!1einszwei drei !"
我需要的输出是列表:
LOLOLOL"
你"
BEEN"
"PWN3D"
"einszwei"
"drei"
我不是正则表达式的英雄,一直在谷歌搜索,但我的谷歌功夫似乎很弱…
I’m no hero at regular expressions and have been Googling, but my Google-kungfu seams to be weak …
我如何从输入到想要的输出?
How would I go from input to the wanted output?
推荐答案
简单的正则表达式:
w+
这匹配一串单词"字符.这几乎就是你想要的.
This matches a string of "word" characters. That is almost what you want.
这稍微准确一些:
w(?
它匹配任意数量的单词字符,确保第一个字符不是数字.
It matches any number of word characters, ensuring that the first character was not a digit.
这是我的比赛:
1个LOOLOLOL
2 你已经
3 去过
4 PWN3D
5 艾因茨威
6 德雷
1 LOLOLOL
2 YOU'VE
3 BEEN
4 PWN3D
5 einszwei
6 drei
现在更像了.
负面回顾的原因是一些正则表达式支持 Unicode 字符.使用 [a-zA-Z] 会遗漏很多需要的单词"字符.允许 w
和禁止 d
包括所有 Unicode 字符,可以想象在任何文本块中开始一个单词.
The reason for the negative look-behind, is that some regex flavors support Unicode characters. Using [a-zA-Z] would miss quite a few "word" characters that are desirable. Allowing w
and disallowing d
includes all Unicode characters that would conceivably start a word in any block of text.
编辑 2:
我找到了一种更简洁的方法来获得否定后视的效果:双否定字符类与单个否定排除.
EDIT 2:
I have found a more concise way to get the effect of the negative lookbehind: Double negative character class with a single negative exclusion.
[^Wd][w'-]*(?<=w)
这与上面的相同,除了它还确保单词以单词字符结尾.最后,还有:
This is the same as the above with the exception that it also ensures that the word ends with a word character. And, finally, there is:
[^Wd](w|[-']{1,2}(?=w))*
确保一行中的非单词字符不超过两个.Aka,它匹配word-up"但不匹配word--up",这是有道理的.如果你想让它匹配word--up",而不是word---up",你可以将2
改为3
.
Ensuring that there are no more than two non-word-characters in a row. Aka, It matches "word-up" but not "word--up", which makes sense. If you want it to match "word--up", but not "word---up", you can change the 2
to a 3
.
这篇关于正则表达式:如何从字符串中获取单词(C#)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!