正则表达式< img>用src,width,height标记解析 [英] Regex <img > Tag parsing with src, width, height
问题描述
您可能会对这种说法做出反应,H 使用正则表达式的TML解析是一个完全不好的主意,紧接着 ,你是对的。
但在我的情况下,下面的html节点是由我们自己的服务器创建的,所以我们知道它总是看起来像这样,而正则表达式将在 mobile android library ,我不想使用像Jsoup这样的库。
我要解析的内容:< img src =myurl.jpgwidth =12height =32>
应解析的内容:
- 匹配常规img标记,并将src属性值分组:
< img [^>] + src \\s * = \ \s * ['\]([^'\] +)['\] [^>] *>
li>宽度和高度属性值:(width | height)\s * = \s * [']([^'] *)['] * $ c $所以第一个正则表达式将有一个带有img url的#1组,第二个正则表达式将有两个与子组相匹配的正则表达式的价值观。
- 匹配我们不感兴趣的任何可选属性(0+除
我可以如何合并?
期望的输出:
$ b- img url
- 宽度值
- 身高值
解决方案要将任何
img
标签与src
,height
和width
属性可以以任何顺序出现实际上是可选的,您可以使用
(< img\\\b |(?!^)\\ \\\G)[^>] *?\\ b(src | width | height)=([\'']?)([^>] *?)\\
请参阅正则表达式演示和 IDEONE Java演示:
String s =< img height = \132 \src = \NEW_myurl.jpg\width = \112\\ \\>< link src = \/test/test.css\/>< img src = \myurl.jpg \width = \12 \\height = \32 \>;
Pattern pattern = Pattern.compile((< img\\\b |(?!^)\\G)[^ >] * \\b(SRC |宽度|高度)=([\'])([^ \ ] *)\\3);
匹配匹配器= pattern.matcher(s); ($ matlab.find()){
if(!matcher.group(1).isEmpty()){//我们有一个新的IMG标签
System.out.println( \\\
--- NEW MATCH ---);
}
System.out.println(matcher.group(2)+:+ matcher.group(4));
}
正则表达式细节:
-
(< img\\\b |(?!^)\\G)
- 初始边界匹配< img>
标记开始或上一次成功匹配结束 ?
>
以外的字符以保留在标签内)
-\\b(src | width | height)=
- 整个单词src =
,width =
或height =
-
([\ '']?)
- 检查属性值分隔符的技术第三组
-
([^>] *?)
- 第4组包含属性值(0+除>>
以外的其他字符)尽可能少至第一个 -
\\ 3
- 与组3(注意)匹配的属性值分隔符,如果分隔符可能为空,则添加(= \\s?|
逻辑:
- 匹配
img
标签 - 的开头然后,匹配一切都在里面,但只捕获我们需要的属性
- 因为我们要有多个匹配,而不是组,我们需要为每个新的
找到一个边界, img
标签。这是通过检查第一个组是否为空来完成的(if(!matcher.group(1).isEmpty())
) - 所有剩下的工作就是添加一个保存匹配的列表。
You may react to this saying that HTML Parsing using regex is a totally bad idea, following this for example, and you are right.
But in my case, the following html node is created by our own server so we know that it will always look like this, and as the regex will be in a mobile android library, I don't want to use a library like Jsoup.
What I want to parse: <img src="myurl.jpg" width="12" height="32">
What should be parsed:
- match a regular img tag, and group the src attribute value:
<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
- width and height attribute values:
(width|height)\s*=\s*['"]([^'"]*)['"]*
So the first regex will have a #1 group with the img url, and the second regex will have two matches with subgroups of their values.
How can I merge both?
Desired output:
- img url
- width value
- height value
To match any img
tag with src
, height
and width
attributes that can come in any order and that are in fact optional, you can use
"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"
See the regex demo and an IDEONE Java demo:
String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
System.out.println("\n--- NEW MATCH ---");
}
System.out.println(matcher.group(2) + ": " + matcher.group(4));
}
The regex details:
(<img\\b|(?!^)\\G)
- the initial boundary matching the<img>
tag start or the end of the previous successful match[^>]*?
- match any optional attributes we are not interested in (0+ characters other than>
so as to stay inside the tag) -\\b(src|width|height)=
- a whole wordsrc=
,width=
orheight=
([\"']?)
- a technical 3rd group to check the attribute value delimiter([^>]*?)
- Group 4 containing the attribute value (0+ characters other than a>
as few as possible up to the first\\3
- attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add(?=\\s|/?>)
at the end of the pattern)
The logic:
- Match the start of
img
tag - Then, match everything that is inside, but only capture the attributes we need
- Since we are going to have multiple matches, not groups, we need to find a boundary for each new
img
tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty())
) - All there remains to do is to add a list for keeping matches.
这篇关于正则表达式< img>用src,width,height标记解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!