如何有选择地向捕获组添加逗号和空格? [英] How to optionally add a comma and whitespace to a capture group?

查看:83
本文介绍了如何有选择地向捕获组添加逗号和空格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在每个文本块中匹配五个子字符串(总共有100个块).

I am trying to match five substrings in each block of text (there are 100 blocks total).

我匹配99%的文本块,但是关于第3组和第4组有一些错误.

I am matching 99% of the blocks of text, but with a few errors regarding groups 3 and 4.

这是一个演示链接: https://regex101.com/r/cW2Is3/4

第3组是语言的一部分",第4组是英语翻译.

Group 3 is "parts of speech", and group 4 is an English translation.

在第一行文本中,det, pro应该全部在第3组中,然后the; him, her, it, them应该在第4组中.

In the first block of text, det, pro should all be in group 3, and then the; him, her, it, them should be in group 4.

在第三段文本中再次出现相同的问题.
第3组应为adj, det, nm, pro,第4组应为a, an, one.

The same issue occurs again in the third block of text.
Group 3 should be adj, det, nm, pro and Group 4 should be a, an, one.

这是我的模式:

([0-9]+)\s+(\w+(?:, \w+)?)\s+(\N+?)\s+(\H.+).*?\r?\n•\s+([\s\S]*?)\s+[0-9]+\s\|.*\s*

推荐答案

当您必须描述包含许多部分的长字符串时,第一个反射就是使用自由空间模式(x修饰符)和命名组(即使命名组在替换上下文中不是很有用,它们有助于使模式更易读且更易于调试):

When you have to describe a long string with many parts, the first reflex is to use the free-space mode (x modifier) and named groups (even if named groups aren't very useful in a replacement context, they help to make the pattern readable and more easy to debug):

~^
(?<No> [0-9]+ )  \h+
(?<word> \pL+ )  \h+
(?<type> [\pL()]+ (?: , \h* [\pL()]+ )* )  \h+
(?<wd_tr> [^•]* [^•\s] )  \h* \R

• \h*
(?<sent_fr> [^–]* [^\s–] )   \s* – \s*
(?<sent_eng> .* (?:\R .*)*? )  \h* \R

(?<num1> [0-9]+ )  \h* \| \h*
(?<num2> .*\S )
~xum

演示

没有神奇的方法可以为格式模糊的字符串构建模式.您所能做的就是在一开始就采取最严格的措施,并在遇到不匹配的案件时增加灵活性.

There are no magic recipe to build a pattern for a string with a blurred format. All you can do is to be the most constrictive at the beginning and to add flexibility when you encounter cases that don't match.

这篇关于如何有选择地向捕获组添加逗号和空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆