.net regex - 在最后一个列表项上不包含句号的字符串 [英] .net regex - strings that don't contain full stop on last list item

查看:50
本文介绍了.net regex - 在最后一个列表项上不包含句号的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 .net regex 来识别 XML 数据中最后一个标记前不包含句号的字符串.我对正则表达式没有太多经验.我不确定我需要改变什么&为什么要得到我正在寻找的结果.

数据中每行末尾都有换行符和回车符.

架构用于 XML.

良好的 XML 数据示例:

<item>abc</item><item>abc</item><item>abc.</item></randlist>

错误的 XML 数据示例 - regexp 应该匹配 - 最后一个 </item> 之前没有句号:

<item>abc</item><item>abc</item><item>abc</item></randlist>

我尝试过的 Reg exp 模式在错误的 XML 数据中不起作用(未在良好的 XML 数据上测试):

^<randlist \w*=[\S\s]*\.*[^.]<\/item>[\n]*<\/randlist>$

使用 http://regexstorm.net/tester 的结果:

0 个匹配项

使用 https://regex101.com/ 的结果:

0 个匹配项

由于字符串标准的句号和开头,此问题与以下 imo 不同:

不以给定后缀结尾的字符串的正则表达式

来自3的解释:

<预><代码>/^<randlist \w*=[\S\s]*\.*[^.]<\/item>[\n]*<\/randlist>$/通用汽车^ 在行首断言位置<randlist 逐字匹配字符 <randlist(区分大小写)\w* 匹配任何单词字符(等于 [a-zA-Z0-9_])* 量词——在零次和无限次之间匹配,尽可能多次,根据需要回馈(贪婪)= 匹配字符 = 字面意思(区分大小写)匹配以下列表中的单个字符 [\S\s]** 量词——在零次和无限次之间匹配,尽可能多次,根据需要回馈(贪婪)\S 匹配任何非空白字符(等于 [^\r\n\t\f\v ])\s 匹配任何空白字符(等于 [\r\n\t\f\v ])\.* 匹配字符 .字面意思(区分大小写)* 量词——在零次和无限次之间匹配,尽可能多次,根据需要回馈(贪婪)匹配以下列表中不存在的单个字符 [^.].匹配字符.字面意思(区分大小写)<匹配字符 <字面意思(区分大小写)\/匹配字符/字面意思(区分大小写)项目>匹配字符 item>字面意思(区分大小写)匹配下面列表中的单个字符 [\n]*<匹配字符 <字面意思(区分大小写)\/匹配字符/字面意思(区分大小写)randlist>匹配字符 randlist>字面意思(区分大小写)$ 在行尾断言位置全局模式标志g 修饰符:全局.所有比赛(第一场比赛后不返回)m 修饰符:多行.导致 ^ 和 $ 匹配每行的开头/结尾(不仅是字符串的开头/结尾)

解决方案

@Silvanas 是完全正确的.对于这个问题,您不应该使用 Regex,您应该使用某种形式的 XML 解析器来读取数据并找到带有 . 的行.但是,如果出于某种可怕的原因您必须使用 Regex,并且如果您的数据结构与您的示例完全相同,那么 Regex 解决方案将如下所示:

^\s+[^<]*?(?<=\.)<\/item>$

如果 ARE 与该正则表达式匹配,则您的 xml 格式错误.但同样,如果空格不正确,如果行中还有其他内容,如果标签不是 ..</item>,等等,这个正则表达式会失败上.同样,除非您可以绝对保证.之外的所有内容都是格式良好的XML,否则您最好不要使用Regex解决这个问题.>

如果开始和结束标记在同一行,但不一定标题为项目",并且可能有属性,请继续尝试以下操作:

^\s+<([^<>\s]+)[^<>]*>[^<>]*?(?<=\.)<\/\1>$分解:^ 锚定到行首\s+ 跳过任何空格<找到了一个看起来像开始标签的东西([^[]\s]+) 匹配 "<" 之后找到的第一个单词,存储在捕获组 1 中[^<>]*>匹配任何剩余的直到结束>"[^<>]*?匹配所有内容直到下一个<"(?<=\.) 确保最后一个字符是."<\/\1>匹配结束标记,其中/之后的文本与开始标记的第一个单词相同(存储在捕获组 1 中)$ 锚点到行尾

确保您设置了 MultiLine regex 选项,否则 ^ 和 $ 将匹配整个字符串的开头/结尾.和以前一样,与此正则表达式的任何匹配都意味着该行上的 XML 格式不佳.

I'm trying to use .net regex for identifying strings in XML data that don't contain a full stop before the last tag. I have not much experience with regex. I'm not sure what I need to change & why to get the result I'm looking for.

There are line breaks and carriage returns at end of each line in the data.

A schema is used for the XML.

Example of good XML Data:

<randlist prefix="unorder">
    <item>abc</item>
    <item>abc</item>
    <item>abc.</item>
</randlist>

Example of bad XML Data - regexp should give matches - no full stop preceding last </item>:

<randlist prefix="unorder">
    <item>abc</item>
    <item>abc</item>
    <item>abc</item>
</randlist>

Reg exp pattern I tried that didn't work in the bad XML data (not tested on good XML data):

^<randlist \w*=[\S\s]*\.*[^.]<\/item>[\n]*<\/randlist>$

Results using http://regexstorm.net/tester:

0 matches

Results using https://regex101.com/:

0 matches

This question is different to the following imo, due to full stop and start of string criteria:

Regex for string not ending with given suffix

Explanation from 3:

/
^<randlist \w*=[\S\s]*\.*[^.]<\/item>[\n]*<\/randlist>$
/
gm
^ asserts position at start of a line
<randlist  matches the characters <randlist  literally (case sensitive)
\w* matches any word character (equal to [a-zA-Z0-9_])
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
= matches the character = literally (case sensitive)
Match a single character present in the list below [\S\s]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\S matches any non-whitespace character (equal to [^\r\n\t\f\v ])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\.* matches the character . literally (case sensitive)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character not present in the list below [^.]
. matches the character . literally (case sensitive)
< matches the character < literally (case sensitive)
\/ matches the character / literally (case sensitive)
item> matches the characters item> literally (case sensitive)
Match a single character present in the list below [\n]*
< matches the character < literally (case sensitive)
\/ matches the character / literally (case sensitive)
randlist> matches the characters randlist> literally (case sensitive)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

解决方案

@Silvanas is absolutely correct. You should not use Regex for this problem, you should use some form of XML parser to read the data and find the lines with .. However, if for some horrible reason you MUST use Regex, and If your data is structured exactly like your example, then the Regex solution would be the following:

^\s+<item>[^<]*?(?<=\.)<\/item>$

If there ARE any matches with that regex, your xml is malformed. But again, this regex fails if the whitespace isn't correct, if there's anything else on the line, if the tags arent <item>..</item>, and so on and so on. Again, you would be far, far better off not using Regex for this problem unless you can absolutely guarantee that everything but the . is going to be well-formed XML

EDIT: If the opening and closing tag are on the same line, but it isn't necessarily titled 'item', and may have attributes, go ahead and try the following:

^\s+<([^<>\s]+)[^<>]*>[^<>]*?(?<=\.)<\/\1>$

Breakdown:
^           anchor to beginning of line
\s+         skip over any whitespace
<           found what looks like an opening tag
([^[]\s]+)  match the first word found after the "<", store in capture group 1
[^<>]*>     match whatever remain until the closing ">"
[^<>]*?     match all of the contents up until the next "<"
(?<=\.)     ensure the last character was a "."
<\/\1>      match a closing tag where the text after the / is the same as the first word of the opening tag (stored in capture group 1)
$           anchor to end of line

Make sure you have the MultiLine regex option set, otherwise ^ and $ will match the beginning/end of the entire string. As with before, any matches with this regex mean the XML is poorly formed on that line.

这篇关于.net regex - 在最后一个列表项上不包含句号的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆