正则表达式可以匹配引号之外的所有单词吗? [英] Can regex match all the words outside quotation marks?
问题描述
我最近在我的课堂上打了一篇短文,我的老师专门说了一个字数限制,其中不包括文章中的引文.而且我想,为什么不编写一个可以为您计算出来的脚本?当然,我可以通过遍历整个文本并忽略引号内的单词来完成无聊的工作,但是我感觉使用Regex和Array.count
有一种更整齐的方法.据我所知,关于Regex几乎一无所知,有人可以帮助我/告诉我Regex是不可能的吗?
I recently typed an essay for my lit class, and my teacher specifically stated a word limit that does not include quotations from the piece. And I thought, why not make a script that calculates that for you? I could, of course, do this the boring way by going though the whole text and ignoring the words inside quotation marks, but I have a feeling that there's a neater way using Regex and Array.count
. As I know next to nothing about Regex, can someone help me/tell me that it's impossible with Regex?
Tl; dr:使用正则表达式匹配文本中引号之外的所有单词(或空格,不要紧),并计算结果数组中的项.
Tl;dr: use Regex to match all words (or spaces, doesn't matter) that are outside quotation marks from a text, and count the items in the resulting array.
推荐答案
使用PCRE(当然也可以是Perl),这很容易做到:
This is easy enough using PCRE (or Perl of course):
".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+
使用g
修饰符,如果要处理多行引号,请使用s
.
Use the g
modifier, and s
if you want to handle multiline quotes.
以下是x
版本,以提高可读性:
Here's the x
version for readability:
".*?" (*SKIP)(?!)
| (?<!\w)'.*?'(?!\w) (*SKIP)(?!)
| [\w]+
第一部分将匹配"
或'
引号内的所有内容,并将其丢弃((*SKIP)(?!)
).第二部分将匹配所有单词(在此示例中,我将'
作为单词的一部分包含在内). '
字符仅在单词的开头/结尾才被视为引号边界,以便让您使用诸如不是之类的东西.
The first part will match everything inside "
or '
quotes and will discard it ((*SKIP)(?!)
). The second part will match all words (I've included '
as being part of a word in this example). The '
character will be counted as a quote boundary only at start/end of words, to let you use things like isn't for instance.
可能的修改:
- 要将文本不是算作两个单词,请将
[\w']+
替换为\w+
. - 要将像岳母这样的文本计为一个单词而不是3个单词,请将
[\w']+
替换为[-\w']+
.
- To count the text isn't as two words, replace
[\w']+
with\w+
. - To count text like mother-in-law as one word instead of 3, replace
[\w']+
with[-\w']+
.
您明白了;)
这是使用此正则表达式的完整Perl脚本:
And here's a full Perl script that uses this regex:
#!/usr/bin/env perl
use strict;
use warnings;
$_ = do { local $/; <> };
print scalar (() = /".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+/gs), "\n";
执行该操作,将其传递到包含您要在其中计算单词的文本的文件或STDIN中,它将在STDOUT上输出单词计数.
Execute it passing in a file or STDIN containing the text you want to count the words in, and it will output the word count on STDOUT.
这篇关于正则表达式可以匹配引号之外的所有单词吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!