正则表达式:如何从字符串中获取单词(C#) [英] Regex : how to get words from a string (C#)

查看：41 发布时间：2021/12/25 8:55:13 c# regex string replace

本文介绍了正则表达式:如何从字符串中获取单词(C#)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的输入由用户发布的字符串组成.

My input consists of user-posted strings.

我想做的是创建一个包含单词的字典，以及它们的使用频率.这意味着我想解析一个字符串，删除所有垃圾，并获得一个单词列表作为输出.

What I want to do is create a dictionary with words, and how often they’ve been used. This means I want to parse a string, remove all garbage, and get a list of words as output.

例如，假设输入是"#@!@LOLOLOL 你去过***PWN3D*** ！:') !!!1einszwei drei ！"

我需要的输出是列表:

LOLOLOL"
你"
BEEN"
"PWN3D"
"einszwei"
"drei"

我不是正则表达式的英雄，一直在谷歌搜索，但我的谷歌功夫似乎很弱…

I’m no hero at regular expressions and have been Googling, but my Google-kungfu seams to be weak …

我如何从输入到想要的输出?

How would I go from input to the wanted output?

推荐答案

简单的正则表达式:

w+

这匹配一串单词"字符.这几乎就是你想要的.

This matches a string of "word" characters. That is almost what you want.

这稍微准确一些:

w(?


它匹配任意数量的单词字符，确保第一个字符不是数字.
It matches any number of word characters, ensuring that the first character was not a digit.
这是我的比赛:
1个LOOLOLOL
2 你已经
3 去过
4 PWN3D
5 艾因茨威
6 德雷

  1 LOLOLOL

  2 YOU'VE

  3 BEEN

  4 PWN3D

  5 einszwei

  6 drei
现在更像了.

负面回顾的原因是一些正则表达式支持 Unicode 字符.使用 [a-zA-Z] 会遗漏很多需要的单词"字符.允许 w 和禁止 d 包括所有 Unicode 字符，可以想象在任何文本块中开始一个单词.


The reason for the negative look-behind, is that some regex flavors support Unicode characters.  Using [a-zA-Z] would miss quite a few "word" characters that are desirable.  Allowing w and disallowing d includes all Unicode characters that would conceivably start a word in any block of text.
编辑 2:
我找到了一种更简洁的方法来获得否定后视的效果:双否定字符类与单个否定排除.
EDIT 2:

I have found a more concise way to get the effect of the negative lookbehind:  Double negative character class with a single negative exclusion.
[^Wd][w'-]*(?<=w)
这与上面的相同，除了它还确保单词以单词字符结尾.最后，还有:
This is the same as the above with the exception that it also ensures that the word ends with a word character.  And, finally, there is:
[^Wd](w|[-']{1,2}(?=w))*
确保一行中的非单词字符不超过两个.Aka，它匹配word-up"但不匹配word--up"，这是有道理的.如果你想让它匹配word--up"，而不是word---up"，你可以将2改为3.
Ensuring that there are no more than two non-word-characters in a row.  Aka, It matches "word-up" but not "word--up", which makes sense.  If you want it to match "word--up", but not "word---up", you can change the 2 to a 3.

                        这篇关于正则表达式:如何从字符串中获取单词(C#)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

正则表达式:如何从字符串中获取单词(C#) [英] Regex : how to get words from a string (C#)

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

正则表达式:如何从字符串中获取单词(C#) [英] Regex : how to get words from a string (C#)

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭