使用xpath和import.io从站点中的javascript对象中提取值 [英] Extract value from javascript object in site using xpath and import.io
问题描述
我想提取站点中javascript对象提供的数字,但我真的不明白我在做什么.
我在import.io
网站和其他教程网站中使用了相似的示例和指南尝试了不同的版本,但是我只得到了两个结果之一:提取给定页面上的所有数字,或者根本不提取任何数字.
我尝试过//[contains(.,"Unikālo apmeklējumu skaits:")]@type
; //[contains(.,"Unikālo apmeklējumu skaits:")]
.很可能有必要在此添加其他内容,但我只是不知道.
我感兴趣的链接是:
希望有人能够帮助我解决这个问题. 对于网络爬虫的新手来说,这应该是一项艰巨的任务,我将解释一下.首先,到达该位置的xpath可能是这样的: 现在您有了该标记(及其包含的内容),但是如果您检查它不包含所需的编号,则该内容会动态加载 可以从带有此xpath的响应中获取: 从该字符串开始,您应该复制 并将当前日期添加到末尾,因为 检查日期是否为urlencoded.它应该返回如下响应: pcc_id = 0; PH_1 = gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF",;"55937>
您可以在其中检查 如果您想知道我如何确定哪个请求和哪个脚本正在填充该响应标签,那么就像我使用 希望有帮助. I want extract a number provided by javascript object in site, but I really don't understand that I am doing. I tried different versions using alike examples and guidelines in I tried e.g. Link I am interested in to extract from is: https://www.ss.lv/msg/lv/clothes-footwear/womens-clothes/trousers/ikcbb.html and information necessary is a number after text "Unikālo apmeklējumu skaits:" which is given by javascript. Hopefully someone will be able to help me with this problem. For someone who is new in web-scraping this should be a hard task, I'll ty to explain it. First of all, the xpath to get to that location could be something like this: Now you have that tag (and what it contains), but if you check it doesn't contain the number you need, that content is being dynamically loaded with a which could be gotten from the response with this xpath: from that string, you should replicate the url that comes in and add to the end the current date, as check that the date is urlencoded. it should return a response like: pcc_id=0;PH_1=gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF","55937369"); where you can check that the value inside If you want to know how I figured out which request and which script was populating that response tag, well that I did using Hope it helped. 这篇关于使用xpath和import.io从站点中的javascript对象中提取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!'//td[@class="msg_footer" and contains(text(), "Unik")]'
javascript
,而javascript就是这样的:>
<script type="text/javascript"><!--
var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );
--></script>
'//script[contains(text(), "contacts_js")]/text()'
src
中附带的网址,因此该网址例如:/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=
javascript
用new Date()
创建它.然后,您应该向该网址发送请求(添加先前的响应域),如下所示:https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)
var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;
SHOW_CNT
中的值是否是您想要的数字.firebug
一样,在涉及调用您的URL的所有响应中搜索SHOW_CNT
,指向我指定的请求,然后尝试检查是谁请求的.import.io
site and other tutorial sites, but I got only 1 of two results: extracted all numbers on given page or nothing at all.//[contains(.,"Unikālo apmeklējumu skaits:")]@type
; //[contains(.,"Unikālo apmeklējumu skaits:")]
. Most likely it's necessary to add there something else, but I just don't know that.'//td[@class="msg_footer" and contains(text(), "Unik")]'
javascript
, and the javascript is this one:<script type="text/javascript"><!--
var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );
--></script>
'//script[contains(text(), "contacts_js")]/text()'
src
, so this url for example:/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=
javascript
creates it with new Date()
. Then you should make a request to that url (adding the previous response domain), so something like:https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)
var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;
SHOW_CNT
is the number you want.firebug
, searching for SHOW_CNT
inside all of the responses that involve calling to your URL, which pointed to the request I specified, and then trying to check who was requesting that.