如何有选择地向捕获组添加逗号和空格? [英] How to optionally add a comma and whitespace to a capture group?
问题描述
我正在尝试在每个文本块中匹配五个子字符串(总共有100个块).
I am trying to match five substrings in each block of text (there are 100 blocks total).
我匹配99%的文本块,但是关于第3组和第4组有一些错误.
I am matching 99% of the blocks of text, but with a few errors regarding groups 3 and 4.
这是一个演示链接: https://regex101.com/r/cW2Is3/4
第3组是语言的一部分",第4组是英语翻译.
Group 3 is "parts of speech", and group 4 is an English translation.
在第一行文本中,det, pro
应该全部在第3组中,然后the; him, her, it, them
应该在第4组中.
In the first block of text, det, pro
should all be in group 3, and then the; him, her, it, them
should be in group 4.
在第三段文本中再次出现相同的问题.
第3组应为adj, det, nm, pro
,第4组应为a, an, one
.
The same issue occurs again in the third block of text.
Group 3 should be adj, det, nm, pro
and Group 4 should be a, an, one
.
这是我的模式:
([0-9]+)\s+(\w+(?:, \w+)?)\s+(\N+?)\s+(\H.+).*?\r?\n•\s+([\s\S]*?)\s+[0-9]+\s\|.*\s*
推荐答案
当您必须描述包含许多部分的长字符串时,第一个反射就是使用自由空间模式(x修饰符)和命名组(即使命名组在替换上下文中不是很有用,它们有助于使模式更易读且更易于调试):
When you have to describe a long string with many parts, the first reflex is to use the free-space mode (x modifier) and named groups (even if named groups aren't very useful in a replacement context, they help to make the pattern readable and more easy to debug):
~^
(?<No> [0-9]+ ) \h+
(?<word> \pL+ ) \h+
(?<type> [\pL()]+ (?: , \h* [\pL()]+ )* ) \h+
(?<wd_tr> [^•]* [^•\s] ) \h* \R
• \h*
(?<sent_fr> [^–]* [^\s–] ) \s* – \s*
(?<sent_eng> .* (?:\R .*)*? ) \h* \R
(?<num1> [0-9]+ ) \h* \| \h*
(?<num2> .*\S )
~xum
没有神奇的方法可以为格式模糊的字符串构建模式.您所能做的就是在一开始就采取最严格的措施,并在遇到不匹配的案件时增加灵活性.
There are no magic recipe to build a pattern for a string with a blurred format. All you can do is to be the most constrictive at the beginning and to add flexibility when you encounter cases that don't match.
这篇关于如何有选择地向捕获组添加逗号和空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!