正则表达式< img>用src,width,height标记解析 [英] Regex <img > Tag parsing with src, width, height

查看:635
本文介绍了正则表达式< img>用src,width,height标记解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可能会对这种说法做出反应,H 使用正则表达式的TML解析是一个完全不好的主意,紧接着 ,你是对的。

但在我的情况下,下面的html节点是由我们自己的服务器创建的,所以我们知道它总是看起来像这样,而正则表达式将在 mobile android library ,我不想使用像Jsoup这样的库。



我要解析的内容< img src =myurl.jpgwidth =12height =32>



应解析的内容


  • 匹配常规img标记,并将src属性值分组:< img [^>] + src \\s * = \ \s * ['\]([^'\] +)['\] [^>] *> li>宽度和高度属性值:(width | height)\s * = \s * [']([^'] *)['] *

    我可以如何合并?



    期望的输出:


    $ b

    • img url

    • 宽度值

    • 身高值


    解决方案

    要将任何 img 标签与 src height width 属性可以以任何顺序出现实际上是可选的,您可以使用

     (< img\\\b |(?!^)\\ \\\G)[^>] *?\\ b(src | width | height)=([\'']?)([^>] *?)\\

    请参阅正则表达式演示 IDEONE Java演示

      String s =< img height = \132 \src = \NEW_myurl.jpg\width = \112\\ \\>< link src = \/test/test.css\/>< img src = \myurl.jpg \width = \12 \\height = \32 \>; 
    Pattern pattern = Pattern.compile((< img\\\b |(?!^)\\G)[^ >] * \\b(SRC |宽度|高度)=([\'])([^ \ ] *)\\3);
    匹配匹配器= pattern.matcher(s); ($ matlab.find()){
    if(!matcher.group(1).isEmpty()){//我们有一个新的IMG标签
    System.out.println( \\\
    --- NEW MATCH ---);
    }
    System.out.println(matcher.group(2)+:+ matcher.group(4));
    }

    正则表达式细节:


    • (< img\\\b |(?!^)\\G) - 初始边界匹配< img> 标记开始或上一次成功匹配结束 - 匹配我们不感兴趣的任何可选属性(0+除> 以外的字符以保留在标签内)
      - \\b(src | width | height)= - 整个单词 src = width = height =

    • ([\ '']?) - 检查属性值分隔符的技术第三组
    • ([^>] *?) - 第4组包含属性值(0+除>> 以外的其他字符)尽可能少至第一个

    • \\ 3 - 与组3(注意)匹配的属性值分隔符,如果分隔符可能为空,则添加(= \\s?|



    逻辑:




    • 匹配 img 标签

    • 的开头然后,匹配一切都在里面,但只捕获我们需要的属性
    • 因为我们要有多个匹配,而不是组,我们需要为每个新的找到一个边界, img 标签。这是通过检查第一个组是否为空来完成的( if(!matcher.group(1).isEmpty())

    • 所有剩下的工作就是添加一个保存匹配的列表。


    You may react to this saying that HTML Parsing using regex is a totally bad idea, following this for example, and you are right.

    But in my case, the following html node is created by our own server so we know that it will always look like this, and as the regex will be in a mobile android library, I don't want to use a library like Jsoup.

    What I want to parse: <img src="myurl.jpg" width="12" height="32">

    What should be parsed:

    • match a regular img tag, and group the src attribute value: <img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
    • width and height attribute values: (width|height)\s*=\s*['"]([^'"]*)['"]*

    So the first regex will have a #1 group with the img url, and the second regex will have two matches with subgroups of their values.

    How can I merge both?

    Desired output:

    • img url
    • width value
    • height value

    解决方案

    To match any img tag with src, height and width attributes that can come in any order and that are in fact optional, you can use

    "(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"
    

    See the regex demo and an IDEONE Java demo:

    String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
    Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
    Matcher matcher = pattern.matcher(s);
    while (matcher.find()){
        if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
            System.out.println("\n--- NEW MATCH ---");  
        }
        System.out.println(matcher.group(2) + ": " + matcher.group(4));
    } 
    

    The regex details:

    • (<img\\b|(?!^)\\G) - the initial boundary matching the <img> tag start or the end of the previous successful match
    • [^>]*? - match any optional attributes we are not interested in (0+ characters other than > so as to stay inside the tag) -\\b(src|width|height)= - a whole word src=, width= or height=
    • ([\"']?) - a technical 3rd group to check the attribute value delimiter
    • ([^>]*?) - Group 4 containing the attribute value (0+ characters other than a > as few as possible up to the first
    • \\3 - attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>) at the end of the pattern)

    The logic:

    • Match the start of img tag
    • Then, match everything that is inside, but only capture the attributes we need
    • Since we are going to have multiple matches, not groups, we need to find a boundary for each new img tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty()))
    • All there remains to do is to add a list for keeping matches.

    这篇关于正则表达式&lt; img&gt;用src,width,height标记解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆