自动插入 LTR 标记 [英] Inserting LTR marks automatically
问题描述
我正在为一个项目处理双向文本(混合英语和希伯来语).文本以 HTML 格式显示,因此有时需要 LTR 或 RTL 标记(‎
或 ‏
)来制作弱字符",如标点符号显示适当地.由于技术限制,这些标记在源文本中不存在,因此我们需要添加它们以使最终显示的文本正确显示.
I am working with bidirectional text (mixed English and Hebrew) for a project. The text is displayed in HTML, so sometimes a LTR or RTL mark (‎
or ‏
) is required to make 'weak characters' like punctuation display properly. These marks are not present in the source text due to technical limitations, so we need to add them in order for the final displayed text to appear correct.
例如,以下文本:(example: מדגם) sample
在从右到左模式下呈现为 sample (מדגם :example)
.更正后的字符串看起来像 ‎(example:‎ מדגם) sample
并将呈现为 sample (מדגם (example:
.
For instance, the following text: (example: מדגם) sample
renders as sample (מדגם :example)
in right-to-left mode. The corrected string would look like ‎(example:‎ מדגם) sample
and would render as sample (מדגם (example:
.
我们希望即时插入这些标记,而不是重新创作所有文本.起初这看起来很简单:只需在每个标点符号实例后附加一个 ‎
即可.但是,一些需要即时修改的文本包含 HTML 和 CSS.造成这种情况的原因令人遗憾且不可避免.
We'd like to do on-the-fly insertion of these marks rather than re-authoring all the text. At first this seems simple: just append an ‎
to each instance of punctuation. However, some of the text that needs to get modified on-the-fly contains HTML and CSS. The reasons for this are unfortunate and unavoidable.
解析 HTML/CSS 的短文,是否有一种已知的算法来即时插入 Unicode 方向标记(伪强字符)?
推荐答案
我不知道有什么算法可以安全地将方向标记插入 HTML 字符串而不对其进行解析.将 HTML 解析为 DOM 并操作文本节点是确保您不会意外地向 和
中的文本添加方向标记的最安全方法> 标签.
I don't know of an algorithm to insert directional marks into an HTML string safely without parsing it. Parsing the HTML into a DOM and manipulating the text nodes is the safest way of ensuring you don't accidentally add directional marks to text inside <script>
and <style>
tags.
这是一个简短的 Python 脚本,可以帮助您自动转换文件.如有必要,逻辑应该很容易翻译成其他语言.我对您尝试编码的 RTL 规则不够熟悉,但您可以调整正则表达式 '(\W([^\W]+)(\W)'
和替换模式ur"\u200e\1\2\3\u200e"
得到你的预期结果:
Here is a short Python script which might help you transform your files automatically. The logic should be easy to translate into other languages if necessary. I'm not familiar enough with the RTL rules you're trying to encode, but you can tweak the regexp '(\W([^\W]+)(\W)'
and substituion pattern ur"\u200e\1\2\3\u200e"
to get your expected result:
import re
import lxml.html
_RE_REPLACE = re.compile('(\W)([^\W]+)(\W)', re.M)
def _replace(text):
if not text:
return text
return _RE_REPLACE.sub(ur'\u200e\1\2\3\u200e', text)
text = u'''
<html><body>
<div>sample (\u05de\u05d3\u05d2\u05dd :example)</div>
<script type="text/javascript">var foo = "ignore this";</script>
<style type="text/css">div { font-size: 18px; }</style>
</body></html>
'''
# convert the text into an html dom
tree = lxml.html.fromstring(text)
body = tree.find('body')
# iterate over all children of <body> tag
for node in body.iterdescendants():
# transform text with trails after the current html tag
node.tail = _replace(node.tail)
# ignore text inside script and style tags
if node.tag in ('script','style'):
continue
# transform text inside the current html tag
node.text = _replace(node.text)
# render the modified tree back to html
print lxml.html.tostring(tree)
输出:
python convert.py
<html><body>
<div>sample (מדגם ‎:example)‎</div>
<script type="text/javascript">var foo = "ignore this";</script>
<style type="text/css">div { font-size: 18px; }</style>
</body></html>
这篇关于自动插入 LTR 标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!