从TAG格式创建更复杂的正则表达式 [英] Creating more complex regexes from TAG format

查看:124
本文介绍了从TAG格式创建更复杂的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我无法在这里弄清楚我的正则表达式有什么问题。 (原始对话包含对这些TAG格式的说明,可以在以下位置找到:从TAG格式转换为语料库的正则表达式)。

So I can't figure out what's wrong with my regex here. (The original conversation, which includes an explanation of these TAG formats, can be found here: Translate from TAG format to Regex for Corpus).

我以这样的字符串开头:

I am starting with a string like this:


Arms_NNSfolded_VVN,_ ,

Arms_NNS folded_VVN ,_,

NNS也可以是NN,而VVN也可以是VBG。而且我只想查找具有相同标签的字符串以及其他字符串(NNS或NN后跟b VVN或VBG,后跟逗号)。

The NNS could also NN, and the VVN could also be VBG. And I just want to find that and other strings with the same tags (NNS or NN followed b VVN or VBG followed by comma).

以下正则表达式是我的意思尝试使用,但找不到任何东西:

The following regex is what I am trying to use, but it is not finding anything:

[\w-]+_(?:NN|NNS)\W+[\w-]+ _(?:VBG|VVN)\W+[\w-]+ _,


推荐答案

给出输入字符串

Arms_NNS folded_VVN ,_,

以下正则表达式

(\w+_(?:NN|NNS) \w+_(?:VBG|VVN) ,_,)

匹配整个字符串(并捕获它-如果您不知道那是什么意思,那可能意味着它对您来说没有关系)。

matches the whole string (and captures it - if you don't know what that means, that probably means it doesn't matter to you).

给出一个更长的字符串(由我组成)

Given a longer string (which I made up)

Dog_NN Arms_NNS folded_VVN ,_, burp_VV

它仍然与您想要的部分匹配。

it still matches the part you want.

如果_VVN部分是可选的,则可以使用

If the _VVN part is optional, you can use

(\w+_(?:NN|NNS) (?:\w+_(?:VBG|VVN) )?,_,)

要么匹配,要么匹配word_VVN / word_VBG部分。

which matches either witout, or with exactly one, word_VVN / word_VBG part.

您的一般性问题:

我很难解释这些问题的工作原理。我将尝试解释组成部分:

I find it hard to explain how these things work. I'll try to explain the constituent parts:


  • \w匹配单词字符-通常希望在单词中找到的字符

  • \w *匹配一个或多个

  • (NN | NNS)表示匹配NN或NNS

  • ?:表示匹配但不捕获-建议使用谷歌搜索与正则表达式相关的捕获方法。

  • ?单独的意思是匹配我之前的事物的0或1-所以x?会匹配或 x,但不匹配 xx。

  • ,_中的所有字符都不是

  • \w matches word characters - characters you'd normally expect to find in words
  • \w* matches one-or-more of them
  • (NN|NNS) means "match NN or NNS"
  • ?: means "match but don't capture" - suggest googling what capturing means in relation to regexes.
  • ? alone means "match 0 or 1 of the thing before me - so x? would match "" or "x" but not "xx".
  • None of the characters in ,_, are special, so we can match them just by putting them in the regex.

您的正则表达式的一个问题是\w不会

One problem with your regex is that \w will not match a comma (only "word characters").

我不知道[\w-]的作用。看起来有点奇怪。我认为可能不有效,但我不确定。

I don't know what [\w-] does. Looks a bit weird. I think it's probably not valid, but I don't know for sure.

我的解决方案假定您标记的单词之间只有一个空格,而没有其他内容。

My solution assumes there is exactly one space, and nothing else, between your tagged words.

这篇关于从TAG格式创建更复杂的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆