如何使用REGEX从RTF文件中提取内容控件? [英] How do I extract a content control from an RTF file using REGEX?

查看:122
本文介绍了如何使用REGEX从RTF文件中提取内容控件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从RTF文件中提取内容控件(主要是检查和组合框)并用它们的文本值替换它们。

OpenXML不是一个选项,因为Word文档的XML部分丢失了当内容保存为RTF时。



我得到了以下正则表达式,但最后一部分有问题。



((\} \ {\\ field\\fldpriv)([。\\\ ] *)FORMCHECKBOX([。\\\\ S] *)(\ {\\fldrslt \} \}))



需要将其更改为在FORMCHECKBOX之后找到第一次出现{\fldrslt}}。事实上,它似乎找到了最后一个。

我欢迎任何关于如何使这项工作的建议。我不想编写找到FORMCHECKBOX的代码,然后一次来回切换一个字符到该部分的开头和结尾。



感谢你我已经做了一些研究,并且认为问题确实是没有使用过非贪婪的匹配,所以它甚至会在搜索整个值之后再进行一些研究。找到一个匹配。正确的术语是使正则表达式匹配懒惰



尝试使用它如下:



((\} \ {\\field\\ fldpriv)([。\\\\ S] *?)FORMCHECKBOX([。\ s?\\ S] *?)(\ {\\fldrslt \} \}))



很棒可以在这里

发现这个测试工具

I need to extract content controls (Mainly Check and Combo boxes) from RTF files and replace them with their text value.
OpenXML is not an option because the XML part of the Word doc was lost when the content was saved as RTF.

I got as far as the following regex, but it has a problem with the last part.

((\}\{\\field\\fldpriv)([.\s\S]*)FORMCHECKBOX([.\s\S]*)(\{\\fldrslt \}\}))

What this needs to be changed to it find the FIRST occurrence of {\fldrslt }} after FORMCHECKBOX. As it is, it seems to find the last.
I would welcome any suggestion as to how to make this work. I don't want to write code that finds FORMCHECKBOX and then steps back and forth a character at a time to the start and end of the section.

Thanking you in advance.

解决方案

I have done some research and believe that the problem is indeed that there is no non-greedy matching used and so it will search the entire value even after a match is found. The correct term would be to make the regex matching lazy

Try using it like this:

((\}\{\\field\\fldpriv)([.\s\S]*?)FORMCHECKBOX([.\s\S]*?)(\{\\fldrslt \}\}))

A great tool to test this can be found here


这篇关于如何使用REGEX从RTF文件中提取内容控件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆