使用Java的网页数据抓取 [英] Webpage data scraping using Java

查看:86
本文介绍了使用Java的网页数据抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在试图用Java实现一个简单的HTML网页刮板。现在我遇到了一个小问题。
假设我有以下HTML片段。

 < div id =sr-h-leftclass = SR-COMP > 
< span style =cursor:pointer;类= SR-H-O >比较和LT; /跨度>
< / a>
< / div>
< div id =sr-h-rightclass =sr-summary>
< div id =sr-num-results>
< div class =sr-hor>显示1,439个匹配中的1-30个,

我感兴趣的数据是底部显示的整数1.439。我只是想知道如何从HTML中获取该整数。
我现在正在考虑使用正则表达式,然后使用java.util.Pattern来帮助获取数据,但仍然不清楚这个过程。
如果你们可以给我一些关于这些数据的暗示或想法,我将不胜感激。
非常感谢。

正则表达式可能是最好的方法。例如:

  Pattern p = Pattern.compile(Showing [0-9,] +  -  [0-9,] +([0-9,] +)匹配); 
Matcher m = p.matches(scrapedHTML);
if(m.matches()){
int num = Integer.parseInt(m.group(1).replaceAll(,,));
// num == 1439
}

我不确定是什么你的意思是理解过程,但这里是代码的作用: p 是一个正则表达式模式,与显示...行匹配。 m 是将该模式应用于刮掉的HTML的结果。如果 m.matches()为true,则表示该模式与HTML匹配,并且 m.group(1)将会成为模式中的第一个正则表达式组(表达式),它是([0-9,] +),它匹配一串数字和逗号,所以它会是1,459。 replaceAll()调用将其转换为1459,并且 Integer.parseInt()将其转换为整数1459

I am now trying to implement a simple HTML webpage scraper using Java.Now I have a small problem. Suppose I have the following HTML fragment.

<div id="sr-h-left" class="sr-comp">
    <a class="link-gray-underline" id="compare_header"  rel="nofollow" href="javascript:i18nCompareProd('/serv/main/buyer/ProductCompare.jsp?nxtg=41980a1c051f-0942A6ADCF43B802');">
        <span style="cursor: pointer;" class="sr-h-o">Compare</span>
    </a>
</div>
<div id="sr-h-right" class="sr-summary">
    <div id="sr-num-results">
        <div class="sr-h-o-r">Showing 1 - 30 of 1,439 matches, 

The data I am interested is the integer 1.439 shown at the bottom.I am just wondering how can I get that integer out of the HTML. I am now considering using a regular expression,and then use the java.util.Pattern to help get the data out,but still not very clear about the process. I would be grateful if you guys could give me some hint or idea on this data scraping. Thanks a lot.

Regular expressions are probably the best way to do it. Something like:

Pattern p = Pattern.compile("Showing [0-9,]+ - [0-9,]+ of ([0-9,]+) matches");
Matcher m = p.matches(scrapedHTML);
if(m.matches()) {
    int num = Integer.parseInt(m.group(1).replaceAll(",", ""));
    // num == 1439
}

I'm not sure what you meant by understanding the "process", but here's what that code does: p is a regular expression pattern that matches the "Showing..." line. m is the result of applying that pattern to the scraped HTML. If m.matches() is true it means the pattern matched the HTML, and m.group(1) will be the first regular expression group (expression in parentheses) in the pattern, which was ([0-9,]+), which matches a string of digits and commas, so it'll be "1,459". The replaceAll() call turns that into "1459", and Integer.parseInt() turns that into the integer 1459

这篇关于使用Java的网页数据抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆