使用xpath和import.io从站点中的javascript对象中提取值 [英] Extract value from javascript object in site using xpath and import.io

查看：149 发布时间：2020/6/26 18:43:50 xpath import.io

本文介绍了使用xpath和import.io从站点中的javascript对象中提取值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想提取站点中javascript对象提供的数字，但我真的不明白我在做什么.

我在import.io网站和其他教程网站中使用了相似的示例和指南尝试了不同的版本，但是我只得到了两个结果之一:提取给定页面上的所有数字，或者根本不提取任何数字.

我尝试过//[contains(.,"Unikālo apmeklējumu skaits:")]@type; //[contains(.,"Unikālo apmeklējumu skaits:")].很可能有必要在此添加其他内容，但我只是不知道.

我感兴趣的链接是:

希望有人能够帮助我解决这个问题.

解决方案

对于网络爬虫的新手来说，这应该是一项艰巨的任务，我将解释一下.首先，到达该位置的xpath可能是这样的:

'//td[@class="msg_footer" and contains(text(), "Unik")]'

现在您有了该标记(及其包含的内容)，但是如果您检查它不包含所需的编号，则该内容会动态加载javascript，而javascript就是这样的:

<script type="text/javascript"><!-- 

var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );

--></script>

可以从带有此xpath的响应中获取:

'//script[contains(text(), "contacts_js")]/text()'

从该字符串开始，您应该复制src中附带的网址，因此该网址例如:

/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=

并将当前日期添加到末尾，因为javascript用new Date()创建它.然后，您应该向该网址发送请求(添加先前的响应域)，如下所示:

https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)

检查日期是否为urlencoded.它应该返回如下响应:

var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;

pcc_id = 0; PH_1 = gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF"，;"55937>

您可以在其中检查SHOW_CNT中的值是否是您想要的数字.

如果您想知道我如何确定哪个请求和哪个脚本正在填充该响应标签，那么就像我使用firebug一样，在涉及调用您的URL的所有响应中搜索SHOW_CNT，指向我指定的请求，然后尝试检查是谁请求的.

希望有帮助.

I want extract a number provided by javascript object in site, but I really don't understand that I am doing.

I tried different versions using alike examples and guidelines in import.io site and other tutorial sites, but I got only 1 of two results: extracted all numbers on given page or nothing at all.

I tried e.g. //[contains(.,"Unikālo apmeklējumu skaits:")]@type ; //[contains(.,"Unikālo apmeklējumu skaits:")] . Most likely it's necessary to add there something else, but I just don't know that.

Link I am interested in to extract from is: https://www.ss.lv/msg/lv/clothes-footwear/womens-clothes/trousers/ikcbb.html and information necessary is a number after text "Unikālo apmeklējumu skaits:" which is given by javascript.

Hopefully someone will be able to help me with this problem.

解决方案

For someone who is new in web-scraping this should be a hard task, I'll ty to explain it. First of all, the xpath to get to that location could be something like this:

'//td[@class="msg_footer" and contains(text(), "Unik")]'

Now you have that tag (and what it contains), but if you check it doesn't contain the number you need, that content is being dynamically loaded with a javascript, and the javascript is this one:

<script type="text/javascript"><!-- 

var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );

--></script>

which could be gotten from the response with this xpath:

'//script[contains(text(), "contacts_js")]/text()'

from that string, you should replicate the url that comes in src, so this url for example:

/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=

and add to the end the current date, as javascript creates it with new Date(). Then you should make a request to that url (adding the previous response domain), so something like:

https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)

check that the date is urlencoded. it should return a response like:

var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;

pcc_id=0;PH_1=gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF","55937369");

where you can check that the value inside SHOW_CNT is the number you want.

If you want to know how I figured out which request and which script was populating that response tag, well that I did using firebug, searching for SHOW_CNT inside all of the responses that involve calling to your URL, which pointed to the request I specified, and then trying to check who was requesting that.

Hope it helped.

这篇关于使用xpath和import.io从站点中的javascript对象中提取值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用xpath和import.io从站点中的javascript对象中提取值 [英] Extract value from javascript object in site using xpath and import.io

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用xpath和import.io从站点中的javascript对象中提取值 [英] Extract value from javascript object in site using xpath and import.io

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭