专门应用java模式匹配器提取html元素,忽略一些字符 [英] exclusively apply java pattern matcher to extract html elements, ignore some characters

查看:53
本文介绍了专门应用java模式匹配器提取html元素,忽略一些字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用此代码:

Pattern pat_1 = Pattern.compile("class=\"\"pinyin\"\">(.*?)<script>");
Matcher mat_1 = pat_1.matcher( text );
while( mat_1.find() )
{
    System.out.println( mat_1.group(1) );
}

这是带来匹配的输入数据源:

This is the input data source bring matched:

<br>
<span class=""b"">拼音:</span><span class=""pinyin"">xī<script>Setduyin('Duyin/xi1')</script></span> <span class=""b"">注音:</span><span class=""pinyin"">ㄒㄧ<script>Setduyin('Duyin/xi1')</script></span><br>
<span class=""b"">简体部首:</span>丨 <span class=""b"">部首笔画:</span>1 <span class=""b"">总笔画:</span>8<br><span class=""b"">繁体部首:</span>卜 <span class=""b"">部首笔画:</span>2 <span class=""b"">总笔画:</span>8<br><span class=""b"">康熙字典笔画</span>( 卥:8; )

我的代码的问题在于它也选择了 ㄒㄧ,因为前面和后面的元素是相同的.我怎么能排除ㄒㄧ而只选择.也许我可以使用 <br> 标签,因为这是第一次所独有的东西,但这需要识别一个新行并忽略 拼音: 如何做到这一点?我一直在玩 regex101.com,但我还不能确定它.

The problem with my code is that it also picks up ㄒㄧ because the preceding and proceding elements are identical. How could I exclude ㄒㄧ and only select . maybe I can use the <br> tag because that is something unique to the first once, but that necessitates identifying a new line and also ignoring 拼音: how to do that? I've been playing around with regex101.com but I've not yet been able to pin it down.

所以现在要清楚,java代码的输出是

So to be clear right now the output of that java code is

xī
ㄒㄧ

但我只想要它

推荐答案

你可以试试下面的正则表达式.

You could try the below regex.

Pattern pat_1 = Pattern.compile("class=\"\"pinyin\"\">(.*?)<script>(?:(?!<script>).)*");

演示

(?m)^.*?class=\"\"pinyin\"\">(.*?)<script>

(?m) 称为多行修饰符,在正则表达式中使用锚点 ^$ 时启用此修饰符是安全的.

(?m) called multiline modifier, it's safe to enable this modifier when anchors ^, $ are used in the regex.

演示

这篇关于专门应用java模式匹配器提取html元素,忽略一些字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆