如何使用XPath/HTMLAgilityPack读取JavaScript对象 [英] How to read JavaScript object with XPath/HTMLAgilityPack

查看:442
本文介绍了如何使用XPath/HTMLAgilityPack读取JavaScript对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的爬虫项目,我需要从JavaScript对象获取产品详细信息.

For my crawler project, I need to get product details from JavaScript object.

如何从以下JavaScript中有效获取对象详细信息?我使用XPath和HTMLAgilityPack.

How can I effectively get object details from the following JavaScript? I Use XPath and HTMLAgilityPack.

<script type="text/javascript">
    var product = {
        identifier: '2051189775',     //PRODUCT ID
        fn: 'Fit- Whiskered Dark Wash Skirt',
        category: ['sale'],
        brand: 'Brand Name',
        price: '22.90',  // this would be the discount price
        amount: '31.80',  // this would be the original price
        currency: 'USD',
        //List can me even more.
    };
</script>

我以前从未尝试从JavaScript对象获取详细信息.我直接从HTML获取其他爬虫的详细信息.

I've not tried getting details from JavaScript objects before. I was getting details directly from HTML for other crawlers.

推荐答案

由于HTML Agility Pack不评估HTML的任何内容,因此应将javascript代码视为纯文本.使用SelectSingleNode方法查找一段Javascript,然后只需抓住InnerHtml即可获取内容.

Since the HTML Agility Pack doesn't evaluate any of the contents of the HTML, the javascript code should just be considered plain text. Use the SelectSingleNode method to find the piece of Javascript, then just grab the InnerHtml to get to the contents.

要么找到C#javascript解析器(例如, Iron JS ),要么使用标准文本操作编写解析器技术(String.*Regex提取要提取的位.

Either find a C# javascript parser (Iron JS for example) or write a parser using standard text manipulation techniques (String.* or Regex to extract the bits you're after.

一旦花括号之间有几处,您就可以使用前面提到的解析器或 Json.NET ,因为大括号之间的片段似乎是有效的json.

Once you have the bits between the curly brackets you could parse them using a before mentioned parser or a library like Json.NET, since the pieces between the curly brackets seems to be valid json.

这篇关于如何使用XPath/HTMLAgilityPack读取JavaScript对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆