Java:我有一大串html,需要提取href =“...”文本 [英] Java: I have a big string of html and need to extract the href="..." text

查看:154
本文介绍了Java:我有一大串html,需要提取href =“...”文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个包含大块html的字符串,并且试图从href =...部分的字符串中提取链接。 href可以采用以下形式之一:

 < a href =.../> 
< a class =...href =.../>

我并没有真正遇到正则表达式的问题,但是由于某些原因,我使用下面的代码:

 字符串innerHTML = getHTML(); 
Pattern p = Pattern.compile(href = \(。*)\,Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);
if(m.find()){
//获取此匹配的所有组
for(int i = 0; i< = m.groupCount(); i ++){
String groupStr = m.group(i);
System.out.println(groupStr);





$ b

有人可以告诉我什么是错的吗?用我的代码?我在php中做了这些东西,但在Java中,我做了一些错误的事情......发生了什么是它打印整个html字符串,只要我试图打印它......



编辑:让每个人都知道我正在处理的是什么样的字符串:

 < a class = Wraphref =item.php?id = 43241>< input type =button> 
< span class =chevron>< / span>
< / a>
< div class =menu>< / div>

每次运行代码时,都会打印整个字符串......这就是问题所在......



关于使用jTidy ...我在上面,但知道在这种情况下出了什么问题也很有趣......

解决方案

 。* 

这是一个贪婪的操作,它会接受包括引号在内的任何字符。



试试类似:

 href = \([^ \] *)\


I have this string containing a large chunk of html and am trying to extract the link from href="..." portion of the string. The href could be in one of the following forms:

<a href="..." />
<a class="..." href="..." />

I don't really have a problem with regex but for some reason when I use the following code:

        String innerHTML = getHTML(); 
  Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL);
  Matcher m = p.matcher(innerHTML);
  if (m.find()) {
   // Get all groups for this match
   for (int i=0; i<=m.groupCount(); i++) {
    String groupStr = m.group(i);
    System.out.println(groupStr);

   }
  }

Can someone tell me what is wrong with my code? I did this stuff in php but in Java I am somehow doing something wrong... What is happening is that it prints the whole html string whenever I try to print it...

EDIT: Just so that everyone knows what kind of a string I am dealing with:

<a class="Wrap" href="item.php?id=43241"><input type="button">
    <span class="chevron"></span>
  </a>
  <div class="menu"></div>

Everytime I run the code, it prints the whole string... That's the problem...

And about using jTidy... I'm on it but it would be interesting to know what went wrong in this case as well...

解决方案

.* 

This is an greedy operation that will take any character including the quotes.

Try something like:

"href=\"([^\"]*)\""

这篇关于Java:我有一大串html,需要提取href =“...”文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆