替换所有的"\"在“< code>"内*不是*的字符标签 [英] Replace all "\" characters which are *not* inside "<code>" tags

查看:105
本文介绍了替换所有的"\"在“< code>"内*不是*的字符标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

第一件事:都不是,也不回答了我的问题.因此,我将打开一个新的.

First things first: Neither this, this, this nor this answered my question. So I'll open a new one.

好的,好的.我知道,正则表达式不是解析常规HTML的方法.请注意,创建的文档是使用受限制的受控HTML子集编写的.而且编写文档的人都知道他们在做什么.他们都是IT专业人员!

Okay okay. I know that regexes are not the way to parse general HTML. Please take note that the created documents are written using a limited, controlled HTML subset. And people writing the docs know what they're doing. They are all IT professionals!

鉴于受控语法, 可以使用正则表达式解析我在此处的文档.

Given the controlled syntax it is possible to parse the documents I have here using regexes.

我不是要从网络上下载任意文档并对其进行解析!

I am not trying to download arbitrary documents from the web and parse them!

如果解析失败失败,则文档将被编辑,因此它将进行解析.我在这里要解决的问题比这更笼统(即不要替换其他两个模式中的模式).

And if the parsing does fail, the document is edited, so it'll parse. The problem I am addressing here is more general than that (i.e. not replace patterns inside two other patterns).

在我们的办公室中,我们应该漂亮地打印"我们的文档.因此,为什么有人提出将其全部放入Word文档中的原因.幸运的是,到目前为止,我们还没有到那儿.而且,如果我完成此操作,则可能不需要.

In our office we are supposed to "pretty print" our documentation. Hence why some came up with putting it all into Word documents. So far we're thankfully not quite there yet. And, if I get this done, we might not need to.

文档的主要部分存储在TikiWiki数据库中.我创建了一个愚蠢的PHP脚本,该脚本将文档从HTML(通过LaTeX)转换为PDF.所选Wiki系统的必须具有功能之一是WYSIWYG编辑器.正如预期的那样,这使我们留下的文件的DOM形式不那么正式.

The main part of the docs are stored in a TikiWiki database. I've created a daft PHP script which converts the documents from HTML (via LaTeX) to PDF. One of the must have features of the selected Wiki-System was a WYSIWYG editor. Which, as expected leaves us with documents with a less then formal DOM.

因此,我正在使用简单"正则表达式对文档进行音译.到目前为止,一切都正常(大部分),但是我遇到了一个我自己还没有想到的问题.

Consequently, I am transliterating the document using "simple" regexes. It all works (mostly) fine so far, but I encountered one problem I haven't figured out on my own yet.

某些特殊字符需要替换为LaTeX标记.例如,\字符应替换为$\backslash$(除非有人知道其他解决方案?).

Some special characters need to replaced by LaTeX markup. For exaple, the \ character should be replaced by $\backslash$ (unless someone knows another solution?).

除了

我确实将<code>标签替换为verbatim部分.但是,如果此code块包含反斜杠(Windows文件夹名称就是这种情况),则脚本仍将替换这些反斜杠.

I do replace <code> tags with verbatim sections. But if this code block contains backslashes (as is the case for Windows folder names), the script still replaces these backslashes.

我认为我可以使用否定的LookBehinds和/或LookAheads解决此问题.但是我的尝试没有用.

I reckon I could solve this using negative LookBehinds and/or LookAheads. But my attempts did not work.

当然,使用真正的解析器会更好.实际上,这是我的脑内路线图"上的内容,但目前目前不在范围之内.该脚本对于我们有限的知识领域来说已经足够好了.创建解析器需要我从头开始.

Granted, I would be better off with a real parser. In fact, it is something on my "in-brain-roadmap", but it is currently out of the scope. The script works well enough for our limited knowledge domain. Creating a parser would require me to start pretty much from scratch.

The Hello \ World document is located in:
<code>C:\documents\hello_world.txt</code>

预期产量

The Hello $\backslash$ World document is located in:
\begin{verbatim}C:\documents\hello_world.txt\end{verbatim}

这是迄今为止我能想到的最好的方法:

This is the best I could come up with so far:

<?php
$patterns = array(
    "special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);

foreach( $patterns as $name => $p ){
    $tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>

请注意,这只是摘录,而[^$]是另一个LaTeX要求.

Note that this is only an excerpt, and the [^$] is another LaTeX requirement.

似乎起作用的另一种尝试:

Another attempt which seemed to work:

<?php
$patterns = array(
    "special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);

foreach( $patterns as $name => $p ){
    $tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>

...换句话说:忽略了负面的印象.

... in other words: leaving out the negative lookbehind.

但是,与向后看和向前看相比,这看起来更容易出错.

But this looks more error-prone than with both lookbehind and lookahead.

您可能已经注意到,模式是不贪心的(/.../U).那么,这种匹配在<code>块内只会尽可能少地匹配吗?考虑环顾四周吗?

As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?

推荐答案

如果有的话,我将尝试找到HTML解析器并对此进行处理.

If me, I will try to find HTML parser and will do with that.

另一个选择是尝试将字符串分成<code>.*?</code>其他部分.

Another option is will try to chunk the string into <code>.*?</code> and other parts.

,并将更新其他部分,并将其重新组合.

and will update other parts, and will recombine it.

$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";

$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);

for($i=0;$i<count($r);$i+=2)
    $r[$i]=str_replace("\\","$\\backslash$",$r[$i]);

$x=implode($r);

echo $x;

这是结果.

The Hello $\backslash$ World document is located in: 
C:\documents\hello_world.txt

对不起,如果我的方法不适合您.

Sorry, If my approach is not suitable for you.

这篇关于替换所有的"\"在“&lt; code&gt;"内*不是*的字符标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆