使用Jsoup从页面中提取信息 [英] Extracting information from page with Jsoup
问题描述
我正在尝试使用Jsoup从此处中提取信息图书馆. js元素后无法获取信息.
我通过Opera DragonFly在此页面上查看每个td元素.结果如下:
I'm trying to extract information from here with Jsoup library. Cannot grab information after js element.
I look on this page with Opera DragonFly at the each of the td elements. And here is result:
<td class="t_port">
<script type="text/javascript">
//<![CDATA[
document.write(Socks^GrubMe^51959);
//]]>
</script>
"1080
"
</td>
当我使用任何浏览器的查看代码功能时,他会向我返回相同的代码行,但没有"1080"(即我所寻找的信息).当我尝试使用Jsoup抓取此页面时,我会得到相同的结果. js代码大致相似.喜欢:
When I'm use view code function of any browser, he returns me same lines of code but without "1080" - information what I'm looking for. Same result I'l take when I try to grab this page with Jsoup. js code is much more or less similar. Like:
document.write(SmallBlind^NineBeforeZero^64881);
或
document.write(ProxyMoxy^DexterProxy^29182);
或类似的东西
document.write(Defender^Agile^57721);
了解此服务的策略,我想这是js代码阻止了此必要信息,并在以后通过编辑DOM添加"1080"类型的信息来动态地加载该信息.
有任何建议可以获取此信息吗?
附注:这是我的代码:
Understanding policy of this service i suppose what this js code blocks this necessary information and load it later dynamicly, through editing DOM add adding "1080" type of information.
Any suggestions grab this info?
P.S: Here is my code:
Document doc = Jsoup.connect(socks4URL).post();
Elements ips = doc.select("table.proxytbl td.t_ip");
for (Element e : ips) {
System.out.println("e is " + e.text());
}
Elements ports = doc.select("table.proxytbl td.t_port");
for (Element e : ports) {
System.out.println("port is " + e);
}
推荐答案
第一
我想该网站正是使用这种技术来劝阻像您这样的人抓取他们的信息.话虽如此,我只是假设您了解这一点并放弃.
First
I suppose the site uses this technique exactly to discourage people like you to scrape their information. Having said that, I just assume you understand this and give up.
此侧不通过ajax加载端口信息.它只是在脚本标签中定义了一些全局变量,并使用按位 XOR运算符(^)来计算端口号.要了解发生了什么,您需要了解XOR运算符,找到在源代码中内联加载的小脚本(提示:div内的id为ind ="incontent"的script标记).这是我得到的,但这可能是动态生成的脚本,因此它可能因调用而异:
This side does not load the port info via ajax. It simply defines some global variables in a script tag and uses the bitwise XOR operator (^) to calculate the port number. To understand what is going on, you need to understand the XOR operator, find the little script that is loaded inline in the source (hint: script tag inside the div with id="incontent"). Here is what I got, but that might be a dynamically generated script, so it might differ from call to call:
<script type="text/javascript">
//<![CDATA[
BigProxy = 13097;BigGoodProxy = 42249^BigProxy;GrubMe = BigGoodProxy^BigProxy;Defender = 16593^BigGoodProxy;Polymorth = 32164^60129;Xorg = Defender^BigProxy;DexterProxy = Defender^Defender;SmallBlind = 56306^22478;Agile = 7797^61126;Socks = BigProxy^SmallBlind;DontGrubMe = BigProxy^45134;Xinemara = 64225^38807;HttpSocks = Socks^BigGoodProxy;BigBlind = GrubMe^41530;NineBeforeZero = 8868^38743;SmallProxy = HttpSocks^Socks;ProxyMoxy = Polymorth^41915;
//]]>
</script>
现在,您可以解析数据并重新创建具有相同值的变量.只需解析端口字段并解释少量的XOR计算.例如:
Now you can parse the data and recreate variables with the same values. Just parse the port field and interpret the little XOR calculation. For example:
document.write(SmallBlind^BigProxy^47917);
根据上面的脚本SmallBlind = 35900和BigProxy = 13097(评估后!)
According to the above script SmallBlind=35900 and BigProxy=13097 (after evaluation!)
所以微积分是35900 ^ 13097 ^ 47917 = 1080
so the calculus is 35900 ^ 13097 ^ 47917 = 1080
如果您非常需要它们,只需订阅许多可以使您准备使用袜子代理列表的服务之一:)
Just subscribe to one of the many services that send you ready to use socks proxy lists, if you need them so badly :)
这篇关于使用Jsoup从页面中提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!