使用Jsoup从页面中提取信息 [英] Extracting information from page with Jsoup

查看:150
本文介绍了使用Jsoup从页面中提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Jsoup从此处中提取信息图书馆. js元素后无法获取信息.
我通过Opera DragonFly在此页面上查看每个td元素.结果如下:

I'm trying to extract information from here with Jsoup library. Cannot grab information after js element.
I look on this page with Opera DragonFly at the each of the td elements. And here is result:

<td class="t_port">
      <script type="text/javascript">
      //<![CDATA[
        document.write(Socks^GrubMe^51959);
      //]]>
      </script>
     "1080
                "
    </td>

当我使用任何浏览器的查看代码功能时,他会向我返回相同的代码行,但没有"1080"(即我所寻找的信息).当我尝试使用Jsoup抓取此页面时,我会得到相同的结果. js代码大致相似.喜欢:

When I'm use view code function of any browser, he returns me same lines of code but without "1080" - information what I'm looking for. Same result I'l take when I try to grab this page with Jsoup. js code is much more or less similar. Like:

document.write(SmallBlind^NineBeforeZero^64881);

document.write(ProxyMoxy^DexterProxy^29182);

或类似的东西

 document.write(Defender^Agile^57721);


了解此服务的策略,我想这是js代码阻止了此必要信息,并在以后通过编辑DOM添加"1080"类型的信息来动态地加载该信息. 有任何建议可以获取此信息吗?

附注:这是我的代码:


Understanding policy of this service i suppose what this js code blocks this necessary information and load it later dynamicly, through editing DOM add adding "1080" type of information. Any suggestions grab this info?

P.S: Here is my code:

Document doc = Jsoup.connect(socks4URL).post();
    Elements ips = doc.select("table.proxytbl td.t_ip");
    for (Element e : ips) {
        System.out.println("e is " + e.text());
    }
    Elements ports = doc.select("table.proxytbl td.t_port");
    for (Element e : ports) {
        System.out.println("port is " + e);
    }

推荐答案

第一

我想该网站正是使用这种技术来劝阻像您这样的人抓取他们的信息.话虽如此,我只是假设您了解这一点并放弃.

First

I suppose the site uses this technique exactly to discourage people like you to scrape their information. Having said that, I just assume you understand this and give up.

此侧不通过ajax加载端口信息.它只是在脚本标签中定义了一些全局变量,并使用按位 XOR运算符(^)来计算端口号.要了解发生了什么,您需要了解XOR运算符,找到在源代码中内联加载的小脚本(提示:div内的id为ind ="incontent"的script标记).这是我得到的,但这可能是动态生成的脚本,因此它可能因调用而异:

This side does not load the port info via ajax. It simply defines some global variables in a script tag and uses the bitwise XOR operator (^) to calculate the port number. To understand what is going on, you need to understand the XOR operator, find the little script that is loaded inline in the source (hint: script tag inside the div with id="incontent"). Here is what I got, but that might be a dynamically generated script, so it might differ from call to call:

<script type="text/javascript">
//<![CDATA[
  BigProxy = 13097;BigGoodProxy = 42249^BigProxy;GrubMe = BigGoodProxy^BigProxy;Defender = 16593^BigGoodProxy;Polymorth = 32164^60129;Xorg = Defender^BigProxy;DexterProxy = Defender^Defender;SmallBlind = 56306^22478;Agile = 7797^61126;Socks = BigProxy^SmallBlind;DontGrubMe = BigProxy^45134;Xinemara = 64225^38807;HttpSocks = Socks^BigGoodProxy;BigBlind = GrubMe^41530;NineBeforeZero = 8868^38743;SmallProxy = HttpSocks^Socks;ProxyMoxy = Polymorth^41915;
//]]>
</script>

现在,您可以解析数据并重新创建具有相同值的变量.只需解析端口字段并解释少量的XOR计算.例如:

Now you can parse the data and recreate variables with the same values. Just parse the port field and interpret the little XOR calculation. For example:

document.write(SmallBlind^BigProxy^47917);

根据上面的脚本SmallBlind = 35900和BigProxy = 13097(评估后!)

According to the above script SmallBlind=35900 and BigProxy=13097 (after evaluation!)

所以微积分是35900 ^ 13097 ^ 47917 = 1080

so the calculus is 35900 ^ 13097 ^ 47917 = 1080

如果您非常需要它们,只需订阅许多可以使您准备使用袜子代理列表的服务之一:)

Just subscribe to one of the many services that send you ready to use socks proxy lists, if you need them so badly :)

这篇关于使用Jsoup从页面中提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆