用其他一些字符串替换开始和结束锚标记之间的字符串 [英] Replace the string between opening and closing anchor tags with some other string

查看:44
本文介绍了用其他一些字符串替换开始和结束锚标记之间的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要用其他字符串替换一对锚标记之间的字符串.更清楚一点:

I need to replace the string between a pair of anchor tags with some other string. To be more clear:

<a blah blah>Click Here</a>

我想用 <img src=.../> 标签替换 '点击这里'.我阅读了一些其他资源,努力尝试使用 Lars Olav Torvik 的正则表达式工具,但失败了!

I want to replace 'Click Here' with an <img src=... /> tag. I read around a couple of other resources, tried hard at Lars Olav Torvik's regex tool, but failed badly!

请帮帮我!

推荐答案

不要使用正则表达式解析 HTML!

是的,一般来说,使用正则表达式解析 HTML 充满危险.计算机科学家会正确地指出 HTML 不是REGULAR 语言.然而,与这里的许多人所相信的相反,在某些情况下,使用正则表达式解决方案是完全有效和适当的.阅读 Jeff Atwoods 关于这个主题的博文:ParsingHtml 克苏鲁之道.撇开免责声明不谈,让我们继续使用正则表达式解决方案......

Don't use regex to parse HTML!

Yes, in general, using regex to parse HTML is fraught with peril. Computer scientists will correctly point out that HTML is not a REGULAR language. However, contrary to what many here believe, there are cases where using a regex solution is perfectly valid and appropriate. Read Jeff Atwoods's blog post on this very subject: Parsing Html The Cthulhu Way. That disclaimer aside, let's forge ahead with a regex solution...

最初的问题很模糊.这是对问题的更精确(可能根本不是 OP 所要求的)解释/重新表述:

The original question is pretty vague. Here is a more precise (possibly not at all what the OP is asking) interpretation/reformulation of the question:

鉴于:我们有一些 HTML 文本(HTML 4.01XHTML 1.0).该文本包含 锚元素.其中一些锚元素是指向图像文件资源的链接(即 HREF 属性指向以文件扩展名结尾的 URI:JPEGJPGPNGGIF).其中一些图像链接是简单的文本链接,其中锚元素的内容是纯文本,没有其他 HTML 元素,例如<a href="picture.jpg">不带 HTML 标签的链接文本</a>.

Given: We have some HTML text (either HTML 4.01 or XHTML 1.0). This text contains <A..>...</A> anchor elements. Some of these anchor elements are links to an image file resource (i.e. the HREF attribute points to a URI ending with a file extension of: JPEG, JPG, PNG or GIF). Some of these links to images, are simple text links, where the content of the anchor element is plain text having no other HTML elements, e.g. <a href="picture.jpg">Link text with no HTML tags</a>.

查找: 是否有正则表达式解决方案可以使用这些plain-text-link-to-image-resource-file" 链接,并替换链接文本IMG 元素的 SRC 属性设置为相同的图像 URI 资源?以下(有效的 HTML 4.01)示例输入包含三个段落.第一段中的所有链接都要修改,但第二段和第三段中的所有链接都不要修改,保持原样:

Find: Is there a regex solution that will take these "plain-text-link-to-image-resource-file" links, and replace the link text with an IMG element having a SRC attribute set to the same image URI resource? The following (valid HTML 4.01) example input has three paragraphs. All the links in the first paragraph are to be modified but all the links in the second and third paragraphs are NOT to be modified and left as-is:

<p title="Image links with plain text contents to be modified">
    This is a <a href="img1.png">LINK 1</a> simple anchor link to image.
    This <a title="<>" href="img2.jpg">LINK 2</a> has attributes before HREF.
    This <a href="img3.gif" title='<>'>LINK 3</a> has attributes after HREF.
</p>
<p title="NON-image links with plain text contents NOT to be modified">
    This is a <a href="tmp1.txt">LINK 1</a> simple anchor link to NON-image.
    This <a title="<>" href="tmp2.txt">LINK 2</a> has attributes before HREF.
    This <a href="tmp3.txt" title='<>'>LINK 3</a> has attributes after HREF.
</p>
<p title="Image links with NON-plain text contents NOT to be modified">
    This is a <a href="img1.png"><b>BOLD 1</b></a> anchor link to image.
    This is an <a href="img3.gif"><img src="img3.gif"/></a> image link to image.
</p>

所需的 HTML 输出:

<p title="Image links with plain text contents to be modified">
    This is a <a href="img1.png"><img src="img1.png" /></a> simple anchor link to image.
    This <a title="<>" href="img2.jpg"><img src="img2.jpg" /></a> has attributes before HREF.
    This <a href="img3.gif" title='<>'><img src="img3.gif" /></a> has attributes after HREF.
</p>
<p title="NON-image links with plain text contents NOT to be modified">
    This is a <a href="tmp1.txt">LINK 1</a> simple anchor link to NON-image.
    This <a title="<>" href="tmp2.txt">LINK 2</a> has attributes before HREF.
    This <a href="tmp3.txt" title='<>'>LINK 3</a> has attributes after HREF.
</p>
<p title="Image links with NON-plain text contents NOT to be modified">
    This is a <a href="img1.png"><b>BOLD 1</b></a> anchor link to image.
    This is an <a href="img3.gif"><img src="img3.gif"/></a> image link to image.
</p>

请注意,这些示例包括测试用例 <A..>...</A> 锚标记在所需的 HREF 属性之前和之后都有单引号和双引号属性值, 并且包含诱人的克苏鲁(但完全有效的 HTML 4.01)、尖括号.

Note that these examples include test case <A..>...</A> anchor tags have both single and double quoted attribute values both before and after the desired HREF attribute, and which contain cthulhu tempting, (yet perfectly valid HTML 4.01), angle brackets.

另请注意,替换文本是一个(空)IMG 标记,结尾为:'/>'(这不是有效的 HTML 4.01).

Note also that the replacement text is an (empty) IMG tag ending in: '/>' (which is NOT valid HTML 4.01).

问题陈述定义了一个高度要匹配的模式,该模式具有以下要求:

The statement of the problem defines a highly specific pattern to be matched which has the following requirements:

  • ... 开始标签可以在 HREF 属性之前和/或之后具有任意数量的属性.
  • HREF 属性值的值必须以 JPEGJPGPNG 结尾GIF(不区分大小写).
  • <A..>...</A> 元素的内容不能包含任何其他 HTML 标签.
  • ... 元素目标模式不是嵌套结构.
  • The <A..>...</A> start tag may have any number of attributes before and/or after the HREF attribute.
  • The HREF attribute value must have a value ending with JPEG, JPG, PNG or GIF (case-insensitive).
  • The contents of the <A..>...</A> element may NOT contain any other HTML tags.
  • The <A..>...</A> element target pattern is NOT a nested structure.

在处理如此高度特定的子字符串时,精心设计的正则表达式解决方案可以很好地工作(很少有边缘情况可以绊倒它).这是一个经过测试的 PHP 函数,它会做得很好(并正确转换上面的示例输入):

When dealing with such highly specific sub-strings, a well crafted regex solution can work very well (with very few edge cases that can trip it up). Here is a tested PHP function that will do a pretty good job (and correctly transform the above example input):

// Convert text-only contents of image links to IMG element.
function textLinksToIMG($text) {
    $re = '% # Match A element with image URL and text-only contents.
        (                     # Begin $1: A element start tag.
          <a                  # Start of A element start tag.
            (?:               # Zero or more attributes before HREF.
              \s+             # Whitespace required before attribute.
              (?!href\b)      # Match attributes other than HREF.
              [\w\-.:]+       # Attribute name (Non-HREF).
              (?:             # Attribute value is optional.
                \s*=\s*       # Attrib name and value separated by =.
                (?:           # Group for attrib value alternatives.
                  "[^"]*"     # Either double quoted,
                | \'[^\']*\'  # or single quoted,
                | [\w\-.:]+   # or unquoted value.
                )             # End group of value alternatives.
              )?              # Attribute value is optional.
            )*                # Zero or more attributes before HREF.
            \s+               # Whitespace required before attribute.
            href\s*=\s*       # HREF attribute name.
            (?|               # Branch reset group for $2: HREF value.
              "([^"]*)"       # Either $2.1: double quoted,
            | \'([^\']*)\'    # or $2.2: single quoted,
            | ([\w\-.:]+)     # or $2.3: unquoted value.
            )                 # End group of HREF value alternatives.
            (?<=              # Look behind to assert HREF value was...
              jpeg[\'"]       # either JPEG,
            | jpg[\'"]        # or JPG,
            | png[\'"]        # or PNG,
            | gif[\'"]        # or GIF,
            )                 # End look behind assertion.
            (?:               # Zero or more attributes after HREF.
              \s+             # Whitespace required before attribute.
              [\w\-.:]+       # Attribute name.
              (?:             # Attribute value is optional.
                \s*=\s*       # Attrib name and value separated by =.
                (?:           # Group for attrib value alternatives.
                  "[^"]*"     # Either double quoted,
                | \'[^\']*\'  # or single quoted,
                | [\w\-.:]+   # or unquoted value.
                )             # End group of value alternatives.
              )?              # Attribute value is optional.
            )*                # Zero or more attributes after HREF.
          \s*                 # Allow whitespace before closing >
          >                   # End of A element start tag.
        )                     # End $1: A element start tag.
        ([^<>]*)              # $3: A element contents (text-only).
        (</a\s*>)             # $4: A element end tag.
        %ix';
    return preg_replace($re, '$1<img src="$2" />$4', $text);
}

是的,这个解决方案中的正则表达式很长,但这主要是由于大量的注释,这也使得它高度可读.它还可以正确处理可能包含尖括号的引用属性值.是的,当然可以创建一些会破坏此解决方案的 HTML 标记,但是这样做所需的代码会非常复杂,以至于几乎闻所未闻.

Yes the regex in this solution is long, but this is mostly due to the extensive commenting, which also makes it highly readable. It also correctly handles quoted attribute values that may contain angle brackets. Yes, it is certainly possible to create some HTML markup that will break this solution, but the required code to do so would be so convoluted as to be virtually unheard of.

这篇关于用其他一些字符串替换开始和结束锚标记之间的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆