正则表达式可以匹配引号之外的所有单词吗? [英] Can regex match all the words outside quotation marks?

查看:90
本文介绍了正则表达式可以匹配引号之外的所有单词吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在我的课堂上打了一篇短文,我的老师专门说了一个字数限制,其中不包括文章中的引文.而且我想,为什么不编写一个可以为您计算出来的脚本?当然,我可以通过遍历整个文本并忽略引号内的单词来完成无聊的工作,但是我感觉使用Regex和Array.count有一种更整齐的方法.据我所知,关于Regex几乎一无所知,有人可以帮助我/告诉我Regex是不可能的吗?

I recently typed an essay for my lit class, and my teacher specifically stated a word limit that does not include quotations from the piece. And I thought, why not make a script that calculates that for you? I could, of course, do this the boring way by going though the whole text and ignoring the words inside quotation marks, but I have a feeling that there's a neater way using Regex and Array.count. As I know next to nothing about Regex, can someone help me/tell me that it's impossible with Regex?

Tl; dr:使用正则表达式匹配文本中引号之外的所有单词(或空格,不要紧),并计算结果数组中的项.

Tl;dr: use Regex to match all words (or spaces, doesn't matter) that are outside quotation marks from a text, and count the items in the resulting array.

推荐答案

使用PCRE(当然也可以是Perl),这很容易做到:

This is easy enough using PCRE (or Perl of course):

".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+

使用g修饰符,如果要处理多行引号,请使用s.

Use the g modifier, and s if you want to handle multiline quotes.

演示

以下是x版本,以提高可读性:

Here's the x version for readability:

  ".*?"              (*SKIP)(?!)
| (?<!\w)'.*?'(?!\w) (*SKIP)(?!)
| [\w]+

第一部分将匹配"'引号内的所有内容,并将其丢弃((*SKIP)(?!)).第二部分将匹配所有单词(在此示例中,我将'作为单词的一部分包含在内). '字符仅在单词的开头/结尾才被视为引号边界,以便让您使用诸如不是之类的东西.

The first part will match everything inside " or ' quotes and will discard it ((*SKIP)(?!)). The second part will match all words (I've included ' as being part of a word in this example). The ' character will be counted as a quote boundary only at start/end of words, to let you use things like isn't for instance.

可能的修改:

  • 要将文本不是算作两个单词,请将[\w']+替换为\w+.
  • 要将像岳母这样的文本计为一个单词而不是3个单词,请将[\w']+替换为[-\w']+.
  • To count the text isn't as two words, replace [\w']+ with \w+.
  • To count text like mother-in-law as one word instead of 3, replace [\w']+ with [-\w']+.

您明白了;)

这是使用此正则表达式的完整Perl脚本:

And here's a full Perl script that uses this regex:

#!/usr/bin/env perl
use strict;
use warnings;

$_ = do { local $/; <> };
print scalar (() = /".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+/gs), "\n";

执行该操作,将其传递到包含您要在其中计算单词的文本的文件或STDIN中,它将在STDOUT上输出单词计数.

Execute it passing in a file or STDIN containing the text you want to count the words in, and it will output the word count on STDOUT.

这篇关于正则表达式可以匹配引号之外的所有单词吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆