正则表达式:在HTML标记之间查找小写字母组 [英] Regex: Find groups of lowercase letters between HTML tag

查看:121
本文介绍了正则表达式:在HTML标记之间查找小写字母组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试开发可在 Sigil 中运行的正则表达式, ePub 2编辑器.

小写字母是当前ePub阅读器生态系统中的一个众所周知的问题.许多读者,例如Adobe Digital Editions,不支持字体变化:小写字母".在尝试了几种不同的解决方法之后,我决定通过将文本转换为大写并将先前的小写字母设置为"font-size:0.75em"来创建假小的大写字母.

此过程非常繁琐,尤其是在处理带有大量尾注并引用其他书籍的书籍时.

假设我在HTML页面中有一堆用"SC"标记的短语班级.我创建了一个测试短语:

<span class="SC">Hello World! Testing: one tWo thrEE &amp; W.T.F.</span>
<span class="foo">Don't touch me!</span>

目标是编写一个与"SC"中的所有小写字母匹配的正则表达式.仅跨度标记,然后将其替换为:

<span class="FSC">LETTERS</span>

我可以设法匹配和替换第一个单词"Hello"中的字母,但是此后一切都崩溃了.

这是到目前为止我得到的:

查找:

(<span class="SC">.*?)([a-z]+)(.*</span>)

替换:

\1<span class="FSC">\U\2\E</span>\3

然后,棘手的部分将继续查找该标签内的其余小写字母,因为现在已经有了新的"FSC". (假小写)span标签已被引入.再次尝试相同的正则表达式会导致"span"然后分类"得到 FSC 处理.理想情况下,我希望能够继续点击全部替换"按钮,直到找不到更多匹配项为止.

完成后,上面的示例将如下所示:

<span class="SC">H<span class="FSC">ELLO</span> W<span class="FSC">ORLD</span>! T<span class="FSC">ESTING</span>: <span class="FSC">ONE</span> <span class="FSC">T</span>W<span class="FSC">O</span> <span class="FSC">THR</span>EE &amp; W.T.F.</span>
<span class="foo">Don't touch me!</span>

它并不漂亮,但是它可以在我测试过的每个ePub阅读器上使用.

如果您用Google搜索"epub small caps regex",则会看到我编辑以包含此regex的MobileRead Wiki文章,我认为该文章并不令人满意:

(<span class="[a-zA-Z0-9\- ]*?(?<!F)SC[a-zA-Z0-9\-]*?">(?:.+?<span class="FSC">.+?</span>)*[\.|,|:|;|-|–|—|!|\?]? ?(?:&amp;)? ?[A-Z]+)([a-z'’\. ]+)(.*?</span>)

这最终使一堆标点符号最小化,有时会停在短语的中间.我重新开始,认为可能有更好的解决方案,而不是尝试针对每种可能性进行预先计划.

如果有人提出了更好的解决方案,那么您将成为整个ePub发布行业的英雄.

更新

我已经为移动阅读Wiki 添加了接受的(也是唯一的)答案.请注意,此正则表达式已专门针对Sigil进行了更改;在其他环境中的YMMV.

解决方案

以下情况的完美用例:

替换为:\1<span class="FSC">\U\2\E</span>

这是RegEx的解释: http://regex101.com/r/jU6bA5

这是"全部替换"的一种解决方案,因为它可以通过RegEx全局修饰符/g来工作!

I'm attempting to develop a regular expression that can be run in Sigil, the ePub 2 editor.

Small-caps are a well-known problem within the current ePub reader ecosystem. Many readers, such as Adobe Digital Editions, do not support "font-variant: small-caps". After trying several different workarounds, I've settled on creating fake small caps by transforming the text to uppercase and setting the previously lowercase letters to "font-size: 0.75em".

This process is extremely tedious, especially when working with books that have lots of endnotes with citations of other books.

Say that I have a bunch of phrases in an HTML page tagged with an "SC" class. I've created a test phrase:

<span class="SC">Hello World! Testing: one tWo thrEE &amp; W.T.F.</span>
<span class="foo">Don't touch me!</span>

The goal is to write a regex that matches any lowercase letters within the "SC" span tag only, and replace them with:

<span class="FSC">LETTERS</span>

I can manage to match and replace the letters in the first word "Hello", but everything breaks down after that.

Here's what I've got so far:

Find:

(<span class="SC">.*?)([a-z]+)(.*</span>)

Replace:

\1<span class="FSC">\U\2\E</span>\3

The tricky part is then continuing to find the rest of the lowercase letters within that tag, now that a new "FSC" (Fake Small Caps) span tag has been introduced. Trying the same regex again results in "span" and then "class" getting the FSC treatment. Ideally, I'd like to be able to just keep hitting the "Replace All" button until no more matches are found.

The above example would look like this when finished:

<span class="SC">H<span class="FSC">ELLO</span> W<span class="FSC">ORLD</span>! T<span class="FSC">ESTING</span>: <span class="FSC">ONE</span> <span class="FSC">T</span>W<span class="FSC">O</span> <span class="FSC">THR</span>EE &amp; W.T.F.</span>
<span class="foo">Don't touch me!</span>

It's not pretty, but it works on every ePub reader that I've tested it on.

If you google "epub small caps regex", you'll come across a MobileRead wiki article that I edited to include this regex, which I've decided is not satisfactory:

(<span class="[a-zA-Z0-9\- ]*?(?<!F)SC[a-zA-Z0-9\-]*?">(?:.+?<span class="FSC">.+?</span>)*[\.|,|:|;|-|–|—|!|\?]? ?(?:&amp;)? ?[A-Z]+)([a-z'’\. ]+)(.*?</span>)

This ends up miniaturizing a bunch of punctuation and sometimes stops in the middle of a phrase. I started over, thinking there was probably a better solution that doesn't attempt to plan for every single possibility up front.

If someone comes up with a better solution to this, you'll be the hero of the entire ePub publishing industry.

Update

I've added the accepted (and only) answer to the Mobile Read wiki. Please note that this regex has been altered specifically for use in Sigil; YMMV in other environments.

解决方案

Perfect usecase for: Collapse and Capture a Repeating Pattern in a Single Regex Expression

Modified it for your case:

(<span class="SC">(?:(?!<\/span>)(?:[^a-z&]|&[^;]+;))*|(?!^)\G(?:(?!<\/span>)(?:[^a-z&]|&[^;]+;))*)([a-z]+)

Replace with: \1<span class="FSC">\U\2\E</span>

And here's the RegEx explained: http://regex101.com/r/jU6bA5

This is a solution for "Replace All" as it works via RegEx global modifier /g !

这篇关于正则表达式:在HTML标记之间查找小写字母组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆