研究替换Word Rtf [英] Research and replace Word Rtf

查看:60
本文介绍了研究替换Word Rtf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发具有邮政工作流程的应用程序.这些邮政邮件是根据我的应用程序业务规则生成的.

I'm working on an application which has a workflow for postal mails. These postal mails are generated according to my application business rules.

模型是 html 或 Rtf,只要用户不使用 word 创建 rtf,它就可以完美运行.这不在规范范围内,但如果不涉及太多工作,我的层次结构会欢迎 Word 兼容性,这会令我们的客户感到高兴和轻松.

Models are in html or Rtf and it works perfectly as long the user do not create the rtf with word. This is not within the specs, but my hierarchy would welcome a Word compatibility if it don't involve too much work, and it would please and ease the life of our customer.

Rtf 模型具有由应用程序值替换的标签.在大多数 RTF 中,标签不会被拆分,因此搜索和替换工作完美.我希望做一些修改的句柄.

The Rtf models have tags which are replaced by application values. In most RTF, tags are not splitted, so the search and replace works perfectly. I wish to be handle word with few modifications.

示例数据:[[FooBuzz]] 在大多数 rtf 中它没有被拆分.

Example data : [[FooBuzz]] in most rtf it's not splited.

在 Word 2003 中:

In word 2003 :

{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid2708730 FooBuzz}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 ]]}

他们的词(词 2007)也分裂了 Foo{garbage inside} Buzz.

And their word (word 2007) splitted also Foo{garbage inside} Buzz.

所以我希望能够完美地处理常见的 RTF,即使标签被拆分也能检测到.

So i wish to be able to handle common RTF perfectly, and detect tags even if they are splitted.

我有 2 个限制条件.首先没有回归,其次它必须保持简单.性能在这里不是问题.

I have 2 constraints. First no regression, second it has to stay simple. Performance is not an issue here.

我使用的是 symfony 1.4.实际相关研究代码部分:

I'm using symfony 1.4. The actual relevant research code part :

$regExpression = '/\[\[([^\[\]]*)\]\]/';  

preg_match_all($regExpression, $sTemplate, $outKeys); 

更新:

我想我主要需要完善这个正则表达式.我正在研究一些正则表达式,但他们仍然需要一些改进:

I guess i mostly need to perfect this regex. I'm working on some regex but they need some improvements still :

/([\a-zA-Z0-9]+)/  

生产:

[0] => Array
    (
        [0] => \rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[
        [1] => \rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid2708730 FooBuzz
        [2] => \rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 ]]
    )

更新 2:

我对正则表达式仍有一些问题.它实际上为第一个找到标签值和纯文本.我不确定在合理的时间内我想要什么甚至是可能的.

I still have a few problem with the regex. It actually find tag value and plain text for the first. I'm not sure what i want is even possible in a reasonnable amount of time.

我需要修改正则表达式,所以她得到相同的结果,但在 [[]] 中,实际上它也适用于纯文本.

I need to modify the regex, so she catch the same results, but inside [[ ]], actually it works on plain text too.

更难的是,我必须能够通过任何我必须的方式捕获所有示例数据(但不是纯文本).

And even harder i have to be able to catch all my sample data (but not plain text) by whatever i have to.

对于我的替换正则表达式,它替换了我的标签和所有垃圾.我几乎成功了:

For my replace regex, which replace my tag and all the garbage. I have almost succedd :

/{.*?\[\[.*(?<!\\)\w+\b.*\]\].*?}/

但是太贪心了.我想匹配组 { [[}{tag}{ ]]} 并且它匹配 {plain text}{ [[}{tag}{ ]]}{plain text}

But it is too greedy. I want to match the group { [[}{tag}{ ]]} and it match {plain text}{ [[}{tag}{ ]]}{plain text}

我添加?因为我读了它会使 .* 非贪婪,但它不起作用.有任何想法吗 ?

I add the ? cause i read it would make the .* non greedy but it don't work. Any ideas ?

我不明白这个正则表达式有什么问题(标签查找的名称):

I can't get what's wrong with this regex (name of tag finding) :

\[\[(\b(?<!\\)\w+\b)\]\]

据我所知.它说在 [[ ]] 中,找到任何不以反斜杠开头的单词,后跟任何单词字符.我说得对吗?

According to my understanding. It says inside [[ ]], find any word which do no start with a backslawh followed by any word character. Am i right ?

更新 3:

对不起,我不清楚.

我的第一个正则表达式旨在捕获 [[FooBuzz]] 中的 FooBuzz.然后第二个抓住 [[FooBuzz]].因此,在第一个正则表达式中,我只想捕获文本 FooBuzz,而忽略其他所有内容(例如 {} \eoeoe).

My first regex aim to catch FooBuzz in [[FooBuzz]]. And the seconde to catch [[FooBuzz]]. So in the first regex, i want to catch only the text FooBuzz, and ignoring everything else (like {} \eoeoe).

在第二个地方,我必须完全替换 [[FooBuzz]].所以我必须赶上 {[[}{FooBuzz}}{]]} 仅此而已.

In the seconde place i have to replace [[FooBuzz]] completely. So i have to catch {[[}{FooBuzz}}{]]} and nothing more.

实际上我正在捕捉{纯文本我不会捕捉} {[[}{FooBuzz}}{]]}}.看到我也必须在这里抓到.我正在捕捉:纯文本我无法捕捉到 [[FooBuzz]].

Actually i'm catching {plain text i musn't catch} {[[}{FooBuzz}}{]]}}. See i catch too must here. I'm catching : plain text i musn't catch [[FooBuzz]].

对于 [[ 部分,我只需要抓住这个:{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}.我想那是因为他找不到不贪婪的匹配.所以他处于贪婪模式.使用此数据样本失败:

For the [[ part, i need to only catch this : {\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}. I guess that's because he can't find an ungreedy match. So he is in greedy mode. And fail with this data sample :

{\toto toto}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid2708730 FooBuzz}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 ]]}{\toto toto}

推荐答案

在您编辑之后,找到 FooBuzz 或您可以搜索的任何其他标签

After your edit, to find FooBuzz or any other tag you can search for

(?<=\[\[).+?\b(?<!\\)(\w+)\b(?=.+?\]\])

并匹配第一组.

它使用负向后向查找(?<!\\) 来查找前面没有\ 的整个单词,也告诉它需要前面有[[ 后跟 ]]

It finds a whole word not preceeded by a \ using negative lookbehind (?<!\\) also to tell that it needs to be preceeded by [[ and followed by ]]

这里一个例子,你可以看到第一组正确包含FooBar:)

Here an example, you can see the first group correctly containing FooBar :)

为了更好地理解 RTF 我找到了一个很好的链接,我认为你也可以考虑一个非正则表达式方法,即使在这种情况下我没有任何线索.

To better understand RTF I found a good link, I think that you could consider also a non regex approach, even if in this case I have no clues.

你的最后一个正则表达式是错误的,因为它需要一个 \w+ 正好在最后一个方括号之后,它只会匹配 [[wordWithoutSpaces]] 之类的东西.

Your last regex is wrong because it expects a \w+ exactly after the last square bracket, it will just match something like [[wordWithoutSpaces]].

第一个update 1"正则表达式正确匹配整个字符串,您说:从第一个 { 开始并找到所有内容".让我们看看:

The first "update 1" regex correctly matches the whole string, you say: "start at the first { and find quite everything". Let's see:

  • {.*?\[\[ 匹配 {[[
  • 之间的所有内容
  • .*(?<!\\)\w+\b 匹配 [[ 之后和第一个单词字符 \w 之前的所有内容> 前面没有反斜杠(可能在这里你想要一个 \b 在负向后视和 \w 之前)
  • .*\]\].*?}/ 匹配 ]] 和您找到的第一个 } 之间的所有内容(非贪婪)
  • {.*?\[\[ match everything between { and [[
  • .*(?<!\\)\w+\b match everything after [[ and before the first word character \w not preceeded by a backslash (probably here you want a \b before the negative lookbehind and the \w)
  • .*\]\].*?}/ match everything between ]] and the first } you find (non greedy)

但是如果要匹配单个部分,则需要创建不同的匹配项或不同的组

But if you want to match the single parts you need to create different matches or different groups

编辑:

因为只有一个正则表达式可以合并这两个正则表达式,所以这个答案是:

As only one regex itis possible to merge the two regexes crafedin this answer:

{[^{]?[[.(?<=[[).+?\b(?]].?}

Preg_match_all 将返回 2 个标签.1 包含正则表达式匹配的数据,第二个是标签.

Preg_match_all will return 2 tabs. 1 containing the data matched by the regex, the second the tag.

然后多亏了 strtr 函数,只替换了与翻译匹配的标签.(工作流程中的 3 轮).

And then thanks to the strtr function, only tags matched with translations are replaced. ( 3 rounds in the workflow).

这篇关于研究替换Word Rtf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆