正则表达式去除除单词以外的所有内容 [英] Regular expression to strip everything but words

查看:51
本文介绍了正则表达式去除除单词以外的所有内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对正则表达式无能为力,所以请帮助我解决这个问题.

I'm helpless on regular expressions so please help me on this problem.

基本上,我正在下载网页和 RSS 提要,并想删除除普通文字以外的所有内容.没有句号、逗号、if、ands 和 buts.从字面上看,我有一个英语中最常用单词的列表,我也想删除它们,但我想我知道如何做到这一点并且不需要正则表达式,因为它真的很长.

Basically I am downloading web pages and rss feeds and want to strip everything except plain words. No periods, commas, if, ands, and buts. Literally I have a list of the most common words used in English and I also want to strip those too but I think I know how to do that and don't need a regular expression because it would be really way to long.

如何从一大块文本中删除除以空格分隔的单词之外的所有内容?其他所有东西都扔进垃圾箱.

How do I strip everything from a chunk of text except words that are delimited by spaces? Everything else goes in the trash.

感谢 Pavel .split(/[^[:alpha:]]/).uniq!

推荐答案

我认为最适合您的是将字符串拆分为单词.在这种情况下,String::split 功能将是更好的选择.它接受一个匹配子字符串的正则表达式,它应该将源字符串拆分为数组元素.

I think that what fits you best would be splitting of the string into words. In this case, String::split function would be the better option. It accepts a regexp that matches substrings, which should split the source string into array elements.

在您的情况下,它应该是一些非字母字符".字母字符类[:alpha:]表示.所以,这是您需要的示例:

In your case, it should be "some non-alphabetic characters". Alphabetic character class is denoted by [:alpha:]. So, here's the example of what you need:

irb(main):001:0> "asd, < er >w , we., wZr,fq.".split(/[^[:alpha:]]+/)
=> ["asd", "er", "w", "we", "wZr", "fq"]

您可以通过将结果数组与数组相交来进一步过滤结果只包含英语单词的:

You may further filter the result by intersecting the resultant array with array that contains only English words:

irb(main):001:0> ["asd", "er", "w", "we", "wZr", "fq"] & ["we","you","me"]
=> ["we"]

这篇关于正则表达式去除除单词以外的所有内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆