正则表达式:如何从字符串中获取单词(C#) [英] Regex : how to get words from a string (C#)

查看:41
本文介绍了正则表达式:如何从字符串中获取单词(C#)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的输入由用户发布的字符串组成.

My input consists of user-posted strings.

我想做的是创建一个包含单词的字典,以及它们的使用频率.这意味着我想解析一个字符串,删除所有垃圾,并获得一个单词列表作为输出.

What I want to do is create a dictionary with words, and how often they’ve been used. This means I want to parse a string, remove all garbage, and get a list of words as output.

例如,假设输入是"#@!@LOLOLOL 你去过***PWN3D*** !:') !!!1einszwei drei !"

我需要的输出是列表:

  • LOLOLOL"
  • 你"
  • BEEN"
  • "PWN3D"
  • "einszwei"
  • "drei"

我不是正则表达式的英雄,一直在谷歌搜索,但我的谷歌功夫似乎很弱…

I’m no hero at regular expressions and have been Googling, but my Google-kungfu seams to be weak …

我如何从输入到想要的输出?

How would I go from input to the wanted output?

推荐答案

简单的正则表达式:

w+

这匹配一串单词"字符.这几乎就是你想要的.

This matches a string of "word" characters. That is almost what you want.

这稍微准确一些:

w(?

它匹配任意数量的单词字符,确保第一个字符不是数字.

It matches any number of word characters, ensuring that the first character was not a digit.

这是我的比赛:

1个LOOLOLOL
2 你已经
3 去过
4 PWN3D
5 艾因茨威
6 德雷

1 LOLOLOL
2 YOU'VE
3 BEEN
4 PWN3D
5 einszwei
6 drei

现在更像了.


负面回顾的原因是一些正则表达式支持 Unicode 字符.使用 [a-zA-Z] 会遗漏很多需要的单词"字符.允许 w 和禁止 d 包括所有 Unicode 字符,可以想象在任何文本块中开始一个单词.


The reason for the negative look-behind, is that some regex flavors support Unicode characters. Using [a-zA-Z] would miss quite a few "word" characters that are desirable. Allowing w and disallowing d includes all Unicode characters that would conceivably start a word in any block of text.

编辑 2:
我找到了一种更简洁的方法来获得否定后视的效果:双否定字符类与单个否定排除.

EDIT 2:
I have found a more concise way to get the effect of the negative lookbehind: Double negative character class with a single negative exclusion.

[^Wd][w'-]*(?<=w)

这与上面的相同,除了它还确保单词以单词字符结尾.最后,还有:

This is the same as the above with the exception that it also ensures that the word ends with a word character. And, finally, there is:

[^Wd](w|[-']{1,2}(?=w))*

确保一行中的非单词字符不超过两个.Aka,它匹配word-up"但不匹配word--up",这是有道理的.如果你想让它匹配word--up",而不是word---up",你可以将2改为3.

Ensuring that there are no more than two non-word-characters in a row. Aka, It matches "word-up" but not "word--up", which makes sense. If you want it to match "word--up", but not "word---up", you can change the 2 to a 3.

这篇关于正则表达式:如何从字符串中获取单词(C#)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆