Groovy htmlunit getFirstByXPath返回null + OCR问题 [英] Groovy htmlunit getFirstByXPath returning null + OCR Question

查看:186
本文介绍了Groovy htmlunit getFirstByXPath返回null + OCR问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在HtmlUnit返回空值方面遇到了一些问题,并且正在寻找指导。我抓取网站第一行的每个结果都返回null。我想知道是否有人可以



A)解释为什么他们可能会返回null

< B)解释更好的方法(如果有的话)获取信息



这是我当前的代码(URL是in )
$ b $ pre $客户=新的WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

def url =http://www.hidemyass.com/proxy-list/

page = client.getPage(url)

IpAddress = page .getFirstByXPath(// html / body / div / div / form / table / tbody / tr / td [2])。getValue()
printlnIP地址为:$ data//返回null

// Port_Number是一张图片

Country = page.getFirstByXPath(// html / body / div / div / form / table / tbody / tr / td [4] [@ class ='country'] / @ rel)。getValue()
printlnCountry缩写是:$ Country

//通过gif名称区分速度和连接?
$ b $ Type = page.getFirstByXPath(// html / body / div / div / form / table / tbody / tr / td [7])。getValue()
printlnProxy类型是:$ Type

匿名= page.getFirstByXPath(// html / body / div / div / form / table / tbody / tr / td [8])。getValue()
println匿名级别为:$匿名

client.closeAllWindows()

现在我所有的XPath都返回null,而.getValue()显然不能在null上工作。



我对于 PORT 应该做什么也有疑问,因为它是一张图片?是否有更好的选择,比下载它并试图通过OCR解决它?
$ b

注意



这个网站没有任何意义,我只是在寻找一个我可以实践的网站(最后一个我遇到了片段身份问题,无法得到答案: HtmlUnit getByXpath返回空 HtmlUnit和Fragment Identities

解决方案

查询不正确。根据代码示例中提供的URL,应该从搜索路径中删除表单元素。





这是一个xpath查询,当页面布局发生变化时, p>

  // table [@ id ='proxylist-table'] / tbody / tr / td [2] 

就端口号而言,该页面的作者一定希望这部分数据不会因为某种原因被刮掉。做OCR可能是你最好的选择。

然而,你可以做的一件事是看看返回的图像的大小以猜测端口号。例如,我注意到显示端口80的图像全都具有406或411的内容长度。端口8080是402或409.图像有两种不同的尺寸可与行颜色混合。如果Url以1结尾,如果它以0结尾,它将具有白色背景,它将具有浅灰色背景,并且总是大于几个字节。这种方法有明显的缺点,但它可能有效。


I have had a few issues with HtmlUnit returning nulls lately and am looking for guidance. each of my results for grabbing the first row of a website have returned null. I am wondering if someone can

A) explain why they might be returning null

B) explain better ways (if there are some) to go about getting the information

Here is my current code (URL is in the source):

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

def url = "http://www.hidemyass.com/proxy-list/"

page = client.getPage(url)

IpAddress = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[2]").getValue()
println "IP Address is: $data"          //returns null

//Port_Number is an Image

Country = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[4][@class='country']/@rel").getValue()
println "Country abbreviation is: $Country"

//differentiate speed and connection by name of gif?

Type = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[7]").getValue()
println "Proxy type is: $Type"

Anonymity = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[8]").getValue()
println "Anonymity Level is: $Anonymity"

client.closeAllWindows()

Right now all of my XPaths return null and .getValue() obviously doesn't work on null.

I also have questions as to what I should do about the PORT since it is an image? Is there a better alternative than downloading it and attempting to solve it by OCR?

Side Note

There is no significance in this site, I was just looking for a site that I could practice scraping on (the last one I ran into issues of fragment identities and couldn't get an answer to: HtmlUnit getByXpath returns null and HtmlUnit and Fragment Identities )

解决方案

It looks like your xpath query is incorrect. Based on the url provided in the code sample the form element should be removed from the search path.

Here is an xpath query that will be less prone to breaking when the layout of the page changes.

//table[@id='proxylist-table']/tbody/tr/td[2]

As far as the port number goes The author of that page must have wanted that portion of the data to not be scraped for some reason. Doing OCR might be your best option.

However, one thing you could do is look at the size of the image that is returned to guess the port number. For example I've noticed that images that display port 80 all have a content length of 406 or 411. Port 8080 are either 402 or 409. There are two different sizes to the images to blend in with the row color. If the Url ends in a 1 it will have a white back ground if it ends in 0 it will have a light grey back ground and always be a few bytes larger. There are obvious drawbacks to this approach but it may work.

这篇关于Groovy htmlunit getFirstByXPath返回null + OCR问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆