复杂的文本替换算法或设计模式 [英] Complex text substitution algorithm or design pattern

查看:28
本文介绍了复杂的文本替换算法或设计模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对来自数据库的文本进行多次替换,然后再将其显示给用户.

I am in the need of doing multiple substitutions in a text coming from a database and before displaying it to the user.

我的示例是最有可能在 CRM 上找到的数据,输出是用于 Web 的 HTML,但该问题可以推广到任何其他文本替换需求.这个问题对于任何编程语言都是通用的.就我而言,我使用 PHP,但它更像是一个算法问题而不是 PHP 问题.

My example is for data most likely found on a CRM and the output is HTML for web, but the question is generalizable to any other text-subtitution need. The question is general for any programming language. In my case I use PHP but it's more an algorithm question than a PHP question.

我在下面写的 3 个示例中的每一个都非常容易通过正则表达式完成.但是,即使我进行多步替换,将它们组合在一个镜头中也不是那么直接.他们会干扰.

Each of the 3 examples I'm writing below are super-easy to do via regular expressions. But combining them in a single shot is not so direct even if I do multi-step substitutions. They interfere.

是否有一种设计模式可以进行多个干扰文本替换?

Is there a design-pattern for doing multiple interferring text substitutions?

我们使用 ID.ID 是 sha-1 摘要.ID 是通用的,可以代表公司中的任何实体,从用户到机场,从发票到汽车.

We work with IDs. The IDs are sha-1 digests. IDs are universal and can represent any entity in the company, from a user to an airport, from an invoice to a car.

因此在数据库中我们可以找到要显示给用户的文本:

So in the database we can find this text to be displayed to a user:

User d19210ac35dfc63bdaa2e495e17abe5fc9535f02 paid 50 EUR
in the payment 377b03b0b4e92502737eca2345e5bdadb1262230. We sent
an email a49c6737f80eadea0eb16f4c8e148f1c82e05c10 to confirm.

我们希望将所有 ID 转换为链接,以便观看它的用户可以点击该信息.有一个用于解码 ID 的通用 URL.假设它是 http://example.com/id/xxx

We want all IDs to be translated into links so the user watching it the info can click. There's one general URL for decoding IDs. Let's assume it's http://example.com/id/xxx

转换后的文本如下:

User <a href="http://example.com/id/d19210ac35dfc63bdaa2e495e17abe5fc9535f02">d19210ac35dfc63bdaa2e495e17abe5fc9535f02</a> paid 50 EUR
in the payment <a href="http://example.com/id/377b03b0b4e92502737eca2345e5bdadb1262230">377b03b0b4e92502737eca2345e5bdadb1262230</a>. We sent
an email <a href="http://example.com/id/a49c6737f80eadea0eb16f4c8e148f1c82e05c10">a49c6737f80eadea0eb16f4c8e148f1c82e05c10</a> to confirm

替换示例#2:链接

我们希望任何类似于 URI 的东西都是可点击的.让我们只关注 http 和 https 协议,其他的都忘了.

Example #2 of substitution: The Links

We want anything that ressembles a URI to be clickable. Let's focus only in http and https protocols and forget the rest.

如果我们在数据库中找到这个:

If we find this in the database:

Our website is http://mary.example.com and the info
you are requesting is in this page http://mary.example.com/info.php

会变成这样:

Our website is <a href="http://mary.example.com">http://mary.example.com</a> and the info
you are requesting is in this page <a href="http://mary.example.com/info.php">http://mary.example.com/info.php</a>

替换示例#3:HTML

当原始文本包含 HTML 时,不得将其发送 raw,因为它会被解释.我们想要将 <> 字符更改为转义形式 &lt;&gt;.HTML-5 的翻译表还包含要转换为 && 符号,这也会影响电子邮件的 Message Id 的翻译,例如.

Example #3 of substitution: The HTML

When the original text contains HTML it must not be sent raw as it would be interpreted. We want to change the < and > chars into the escaped form &lt; and &gt;. The translation table for HTML-5 also contains the & symbol to be converted to &amp;This also affects the translation of the Message Ids of the emails, for example.

例如如果我们在数据库中找到这个:

For example if we find this in the database:

We need to change the CSS for the <code> tag to a pure green.
Sent to John&Partners in Message-ID: <aaa@bbb.ccc> this morning.

结果替换为:

We need to change the CSS for the &lt;code&gt; tag to a pure green.
Sent to John&amp;Partners in Message-ID: &lt;aaa@bbb.ccc&gt; this morning.

好吧……但是……组合?

到此为止,本身"的每一个变化都是超级简单.

Allright... But... combinations?

Up to here, every change "per se" is super-easy.

但是当我们组合事物时,我们希望它们仍然是自然的";给用户.让我们假设原始文本包含 HTML.其中一个标签是 标签.我们仍然希望看到完整的标签显示"并且 HREF 是可点击的.如果是链接,还有锚的文本.

But when we combine things we want them to still be "natural" to the user. Let's assume that the original text contains HTML. And one of the tags is an <a> tag. We still want to see the complete tag "displayed" and the HREF be clickable. And also the text of the anchor if it was a link.

假设我们在数据库中有这个:

Let's say we have this in the database:

Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.

如果我们首先应用 #2 来转换链接,然后应用 #3 来编码 HTML,我们将有:

If we first apply #2 to transform the links and then #3 to encode HTML we would have:

在原始链接上应用规则 #2(注入链接),检测到链接 http://example.com/data.xml 并用 <a href="http 替换://example.com/data.xml">http://example.com/data.xml</a>

Applying rule #2 (inject links) on the original the link http://example.com/data.xmlis detected and subtituted by <a href="http://example.com/data.xml">http://example.com/data.xml</a>

Paste this <a class="dark" href="<a href="http://example.com/data.xml">http://example.com/data.xml</a>">Download</a> into your text editor.

这显然是一个损坏的 HTML 并且没有任何意义,但此外,在 #2 的输出上应用规则 #3(扁平化 HTML),我们将有:

which obviously is a broken HTML and makes no sense but, in addition, applying rule #3 (flatten HTML) on the output of #2 we would have:

Paste this &lt;a class="dark" href="&lt;a href="http://example.com/data.xml"&gt;http://example.com/data.xml&lt;/a&gt;"&gt;Download&lt;/a&gt; into your text editor.

反过来,这只是损坏的 HTML 的平面 HTML 表示,不可点击.错误输出:#2 和 #3 都不满意.

which in turn is the mere flat HTML representation of the broken HTML and not clickable. Wrong output: Neither #2 nor #3 were satisfied.

如果我首先应用规则 #3 来解码所有 HTML"然后我将规则 #2 应用于注入链接 HTML";它会发生这种情况:

If I first apply rule #3 to "decode all HTML" and then afterwards I apply rule #2 to "inject links HTML" it happens this:

原文(同上):

Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.

应用#3(扁平化 HTML)的结果

Result of applying #3 (flatten HTML)

Paste this &lt;a class="dark" href="http://example.com/data.xml">Download&lt;/a&gt; into your text editor.

然后我们应用规则#2(注入链接)它似乎有效:

Then we apply rule #2 (inject links) it seems to work:

Paste this &lt;a class="dark" href="<a href="http://example.com/data.xml">http://example.com/data.xml</a>">Download&lt;/a&gt; into your text editor.

这是可行的,因为 " 不是有效的 URL 字符,并且将 http://example.com/data.xml 检测为确切的 URL 限制.

This works because " is not a valid URL char and detects http://example.com/data.xml as the exact URL limit.

但是...如果原始文本在链接文本中也有一个链接呢?这是一个非常常见的案例场景.喜欢这个原文:

But... what if the original text had also a link inside the link text? This is a very common case scenario. Like this original text:

Paste this <a class="dark" href="http://example.com/data.xml">http://example.com/data.xml</a> into your text editor.

然后应用#2 会得到这个:

Then applying #2 would give this:

Paste this &lt;a class="dark" href="http://example.com/data.xml"&lt;http://example.com/data.xml&lt;/a&gt; into your text editor.

我们有一个问题

由于所有 &;/ 都是有效的 URL 字符,因此 URL 解析器会发现:http://example.com/data.xml&lt;/a> 作为 URL 而不是在 .xml 点结束.

As all of &, ; and / are valid URL characters, the URL parser would find this: http://example.com/data.xml&lt;/a&gt; as the URL instead of ending at the .xml point.

这会导致这个错误输出:

Paste this &lt;a class="dark" href="<a href="http://example.com/data.xml">http://example.com/data.xml</a>"&lt;<a href="http://example.com/data.xml&lt;/a&gt;">http://example.com/data.xml&lt;/a&gt;</a> into your text editor.

所以 http://example.com/data.xml&lt;/a>

So http://example.com/data.xml&lt;/a&gt; got substituted by <a href="http://example.com/data.xml&lt;/a&gt;">http://example.com/data.xml&lt;/a&gt;</a> but the problem is that the URL was not correctly detected.

如果规则 #2 和 #3 在处理时一团糟,想象一下如果我们将它们与规则 #1 混合在一起,我们有一个包含 sha-1 的 URL,就像这个数据库条目:

If rules #2 and #3 are a mess when processed together imagine if we mix them with rule #1 and we have a URL which contains a sha-1 like this database entry:

Paste this <a class="dark" href="http://example.com/id/89019b16ab155ba1c19e1ab9efdb9134c8f9e2b9">http://example.com/id/89019b16ab155ba1c19e1ab9efdb9134c8f9e2b9</a> into your text editor.

你能想象吗??

我想过创建一个语法标记器.但我觉得这太过分了.

I have thought of creating a syntax tokenizer. But I feel it's an overkill.

我想知道是否有可供阅读和研究的设计模式,它是如何命名的,以及在进行多个文本替换时记录在何处.

I wonder if there's a design-pattern to read and study, how is it called, and where is it documented, when it comes to do multiple text substitutions.

如果没有任何模式……那么……构建语法标记器是唯一的解决方案吗?

If there's not any pattern... then... is building a syntax tokenizer the only solution?

我觉得必须有一种更简单的方法来做到这一点.我真的必须在语法树中标记文本,然后通过遍历树重新渲染吗?

I feel there must be a much simpler way to do this. Do I really have to tokenize the text in a syntax-tree and then re-render by traversing the tree?

推荐答案

设计模式是你已经拒绝的那种,从左到右的标记化.当然,在有生成词法扫描器的代码生成器的语言中,这更容易做到.

The design pattern is the one you already rejected, left-to-right tokenisation. Of course, that's easier to do in languages for which there are code generators which produce lexical scanners.

无需解析或构建语法树.令牌的线性序列就足够了.实际上,扫描仪变成了换能器.每个标记要么原封不动地通过,要么立即替换为所需的翻译.

There's no need to parse or to build a syntax tree. A linear sequence of tokens suffices. In effect, the scanner becomes a transducer. Each token is either passed through unaltered, or is replaced immediately with the translation required.

标记器也不需要特别复杂.可以使用您当前拥有的三个正则表达式,并结合表示任何其他字符的第四个标记类型.重要的部分是在每个点尝试所有模式,选择一个,执行指定的替换,并在匹配后继续扫描.

Nor does the tokeniser need to be particularly complicated. The three regular expressions you currently have can be used, combined with a fourth token type representing any other character. The important part is that all patterns are tried at each point, one is selected, the indicated replacement is performed, and the scan resumes after the match.

这篇关于复杂的文本替换算法或设计模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆