从 TAG 格式创建更复杂的正则表达式 [英] Creating more complex regexes from TAG format
问题描述
所以我无法弄清楚我的正则表达式有什么问题.(原始对话,包括对这些 TAG 格式的解释,可以在这里找到:从 TAG 格式转换为 Corpus 的 Regex).
So I can't figure out what's wrong with my regex here. (The original conversation, which includes an explanation of these TAG formats, can be found here: Translate from TAG format to Regex for Corpus).
我以这样的字符串开头:
I am starting with a string like this:
Arms_NNS folded_VVN ,_,
Arms_NNS folded_VVN ,_,
NNS 也可以是 NN,VVN 也可以是 VBG.我只想找到那个和其他具有相同标签的字符串(NNS 或 NN 后跟 b VVN 或 VBG 后跟逗号).
The NNS could also NN, and the VVN could also be VBG. And I just want to find that and other strings with the same tags (NNS or NN followed b VVN or VBG followed by comma).
我正在尝试使用以下正则表达式,但没有找到任何内容:
The following regex is what I am trying to use, but it is not finding anything:
[w-]+_(?:NN|NNS)W+[w-]+ _(?:VBG|VVN)W+[w-]+ _,
推荐答案
给定输入字符串
Arms_NNS folded_VVN ,_,
以下正则表达式
(w+_(?:NN|NNS) w+_(?:VBG|VVN) ,_,)
匹配整个字符串(并捕获它 - 如果您不知道这意味着什么,那可能意味着它对您无关紧要).
matches the whole string (and captures it - if you don't know what that means, that probably means it doesn't matter to you).
给定一个更长的字符串(我编的)
Given a longer string (which I made up)
Dog_NN Arms_NNS folded_VVN ,_, burp_VV
它仍然匹配您想要的部分.
it still matches the part you want.
如果_VVN部分是可选的,你可以使用
If the _VVN part is optional, you can use
(w+_(?:NN|NNS) (?:w+_(?:VBG|VVN) )?,_,)
不匹配或仅匹配一个 word_VVN/word_VBG 部分.
which matches either witout, or with exactly one, word_VVN / word_VBG part.
您更一般的问题:
我发现很难解释这些事情是如何运作的.我将尝试解释组成部分:
I find it hard to explain how these things work. I'll try to explain the constituent parts:
- w 匹配单词字符 - 您通常希望在单词中找到的字符
- w* 匹配一个或多个
- (NN|NNS) 表示匹配 NN 或 NNS"
- ?: 表示匹配但不捕获" - 建议谷歌搜索与正则表达式相关的捕获意味着什么.
- ?单独意味着匹配我之前的事物的 0 或 1 - 所以 x? 将匹配"或x"但不匹配xx".
- ,_, 中的字符都不是特殊的,因此我们只需将它们放入正则表达式中就可以匹配它们.
正则表达式的一个问题是 w 不匹配逗号(仅单词字符").
One problem with your regex is that w will not match a comma (only "word characters").
我不知道 [w-] 是做什么的.看起来有点奇怪.我认为它可能无效,但我不确定.
I don't know what [w-] does. Looks a bit weird. I think it's probably not valid, but I don't know for sure.
我的解决方案假设您的标记词之间只有一个空格,没有其他空格.
My solution assumes there is exactly one space, and nothing else, between your tagged words.
这篇关于从 TAG 格式创建更复杂的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!