使用xpath和import.io从站点中的javascript对象中提取值 [英] Extract value from javascript object in site using xpath and import.io

查看:149
本文介绍了使用xpath和import.io从站点中的javascript对象中提取值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提取站点中javascript对象提供的数字,但我真的不明白我在做什么.

我在import.io网站和其他教程网站中使用了相似的示例和指南尝试了不同的版本,但是我只得到了两个结果之一:提取给定页面上的所有数字,或者根本不提取任何数字.

我尝试过//[contains(.,"Unikālo apmeklējumu skaits:")]@type; //[contains(.,"Unikālo apmeklējumu skaits:")].很可能有必要在此添加其他内容,但我只是不知道.

我感兴趣的链接是:

希望有人能够帮助我解决这个问题.

对于网络爬虫的新手来说,这应该是一项艰巨的任务,我将解释一下.首先,到达该位置的xpath可能是这样的:

'//td[@class="msg_footer" and contains(text(), "Unik")]'

现在您有了该标记(及其包含的内容),但是如果您检查它不包含所需的编号,则该内容会动态加载javascript,而javascript就是这样的:

<script type="text/javascript"><!-- 

var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );

--></script>

可以从带有此xpath的响应中获取:

'//script[contains(text(), "contacts_js")]/text()'

从该字符串开始,您应该复制src中附带的网址,因此该网址例如:

/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=

并将当前日期添加到末尾,因为javascriptnew Date()创建它.然后,您应该向该网址发送请求(添加先前的响应域),如下所示:

https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)

检查日期是否为urlencoded.它应该返回如下响应:

var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;

pcc_id = 0; PH_1 = gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF",;"55937>

您可以在其中检查SHOW_CNT中的值是否是您想要的数字.

如果您想知道我如何确定哪个请求和哪个脚本正在填充该响应标签,那么就像我使用firebug一样,在涉及调用您的URL的所有响应中搜索SHOW_CNT,指向我指定的请求,然后尝试检查是谁请求的.

希望有帮助.

I want extract a number provided by javascript object in site, but I really don't understand that I am doing.

I tried different versions using alike examples and guidelines in import.io site and other tutorial sites, but I got only 1 of two results: extracted all numbers on given page or nothing at all.

I tried e.g. //[contains(.,"Unikālo apmeklējumu skaits:")]@type ; //[contains(.,"Unikālo apmeklējumu skaits:")] . Most likely it's necessary to add there something else, but I just don't know that.

Link I am interested in to extract from is: https://www.ss.lv/msg/lv/clothes-footwear/womens-clothes/trousers/ikcbb.html and information necessary is a number after text "Unikālo apmeklējumu skaits:" which is given by javascript.

Hopefully someone will be able to help me with this problem.

解决方案

For someone who is new in web-scraping this should be a hard task, I'll ty to explain it. First of all, the xpath to get to that location could be something like this:

'//td[@class="msg_footer" and contains(text(), "Unik")]'

Now you have that tag (and what it contains), but if you check it doesn't contain the number you need, that content is being dynamically loaded with a javascript, and the javascript is this one:

<script type="text/javascript"><!-- 

var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );

--></script>

which could be gotten from the response with this xpath:

'//script[contains(text(), "contacts_js")]/text()'

from that string, you should replicate the url that comes in src, so this url for example:

/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=

and add to the end the current date, as javascript creates it with new Date(). Then you should make a request to that url (adding the previous response domain), so something like:

https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)

check that the date is urlencoded. it should return a response like:

var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;

pcc_id=0;PH_1=gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF","55937369");

where you can check that the value inside SHOW_CNT is the number you want.

If you want to know how I figured out which request and which script was populating that response tag, well that I did using firebug, searching for SHOW_CNT inside all of the responses that involve calling to your URL, which pointed to the request I specified, and then trying to check who was requesting that.

Hope it helped.

这篇关于使用xpath和import.io从站点中的javascript对象中提取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆