从< script>获取数据使用Scrapy在HTML中标记 [英] Get data from <script> tag in HTML using Scrapy
问题描述
我一直在尝试使用Scrapy(xpath)从Kbb的HTML中提取脚本标记中的数据。但我的主要问题是识别正确的div和脚本标签。我是新手使用xpath并感谢任何帮助!
I've been trying to extract data from script tag in Kbb's HTML using Scrapy(xpath). But my main issue is with identifying the correct div and script tags. I'm new to using xpath and would appreciate any help!
<script type="text/javascript" src="http://s1.kbb.com/combine/IncentivesPilotJs/949332058"></script>
<input type="hidden" id="ResaleValueUrl" value="/ymmt/resalevalue/?vehicleid=392396" />
<input type="hidden" id="Intent" value="buy-used" />
<!--[if lt IE 9]>
<script>
window.FlashCanvasOptions = {
swfPath: "/js/canvas/FlashCanvas/UCMarketMeter/"
};
</script>
<script type="text/javascript" src="http://s1.kbb.com/combine/YmmtMarketMeterFlashCanvasJs/795892638"></script>
<![endif]-->
<script type="text/javascript" src="http://s1.kbb.com/combine/YMMTOverview/1527402533"></script>
<script type="text/javascript" src="http://s1.kbb.com/combine/YmmtPricingOverviewBuyUsedJs/-1416499456"></script>
<script language="javascript" type="text/javascript">
$(document).ready(function() {
KBB.Vehicle.Pages.PricingOverview.Buyers.setup({
//Workaround until we get cross domain working for Flash
imageDir: window.FlashCanvasOptions ? "/Content/images" : "http://file.kelleybluebookimages.com/kbb/images/marketmeter",
vehicleId: "392396",
zipCode: "78701",
mileage: "10000",
intent: "buy-used",
priceType: "retail",
condition: "good",
options: "392396|53635|78701|100|10|",
price: "17074",
manufacturer: "Nissan",
model: "Altima",
year: "2014",
style: "2.5 S Sedan 4D",
category: "",
hasCpo: true,
meetsCpoReq: true,
showOthersPaid: false,
data: {
"values": {
"cpo": {
"priceMin": 17335.0,
"price": 18275.0,
"priceMax": 19214.0
},
"fpp": {
"priceMin": 15286.0,
"price": 17074.0,
"priceMax": 18861.0
},
"privatepartyexcellent": {
"priceMin": 0.0,
"price": 16064.0,
"priceMax": 0.0
},
"privatepartyfair": {
"priceMin": 0.0,
"price": 14081.0,
"priceMax": 0.0
},
"privatepartygood": {
"priceMin": 0.0,
"price": 15454.0,
"priceMax": 0.0
},
"privatepartyverygood": {
"priceMin": 0.0,
"price": 15715.0,
"priceMax": 0.0
},
"retail": {
"priceMin": 0.0,
"price": 17875.0,
"priceMax": 0.0
}
},
"timAmount": 0.0,
"monthlyPayments": {
"cpo": {
"vehiclePrice": 18275.0,
"rate": 2.9,
"terms": 60.0,
"taxAndTitle": 6.5,
"downPay": 0.0,
"amount": 348.0
},
"fpp": {
"vehiclePrice": 17074.0,
"rate": 4.9,
"terms": 60.0,
"taxAndTitle": 6.5,
"downPay": 0.0,
"amount": 342.0
},
"privatepartyexcellent": {
"vehiclePrice": 16064.0,
"rate": 4.9,
"terms": 60.0,
"taxAndTitle": 6.5,
"downPay": 0.0,
"amount": 322.0
},
"privatepartyfair": {
"vehiclePrice": 14081.0,
"rate": 4.9,
"terms": 60.0,
"taxAndTitle": 6.5,
"downPay": 0.0,
"amount": 282.0
},
"privatepartygood": {
"vehiclePrice": 15454.0,
"rate": 4.9,
"terms": 60.0,
"taxAndTitle": 6.5,
"downPay": 0.0,
"amount": 309.0
},
"privatepartyverygood": {
"vehiclePrice": 15715.0,
"rate": 4.9,
"terms": 60.0,
"taxAndTitle": 6.5,
"downPay": 0.0,
"amount": 315.0
},
"retail": {
"vehiclePrice": 17875.0,
"rate": 4.9,
"terms": 60.0,
"taxAndTitle": 6.5,
"downPay": 0.0,
"amount": 358.0
}
},
"scale": {
"scaleLow": 14081.0,
"scaleHigh": 19214.0
},
"transactions": {
"below": 7,
"between": 17,
"above": 3
}
},
adPriceRanges: {"AdPriceRange":[{"PriceMin":0,"PriceMax":8499,"AdPRValue":1},{"PriceMin":8500,"PriceMax":18499,"AdPRValue":2},{"PriceMin":18500,"PriceMax":23499,"AdPRValue":3},{"PriceMin":23500,"PriceMax":28499,"AdPRValue":4},{"PriceMin":28500,"PriceMax":33499,"AdPRValue":5},{"PriceMin":33500,"PriceMax":38499,"AdPRValue":6},{"PriceMin":38500,"PriceMax":43499,"AdPRValue":7},{"PriceMin":43500,"PriceMax":48499,"AdPRValue":8},{"PriceMin":48500,"PriceMax":53499,"AdPRValue":9},{"PriceMin":53500,"PriceMax":63499,"AdPRValue":10},{"PriceMin":63500,"PriceMax":73499,"AdPRValue":11},{"PriceMin":73500,"PriceMax":1000000,"AdPRValue":12}]}});
});
$('.foot-note').hide();
$(window).on('popstate', function() {
KBB.Vehicle.Pages.PricingOverview.Buyers.stateChangeHandler();
});
</script>
Scrapy Code:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
import scrapy
from kbb.items import kbbItem
class kbbSpider(scrapy.Spider):
name = "kbb"
allowed_domains = ["kbb.com"]
start_urls = [
"http://www.kbb.com/nissan/altima/2014/25-s-sedan-4d/?vehicleid=392396&intent=buy-used&10000&good&pricetype=retail"
]
def parse(self, response):
sel=Selector(response)
#sites=sel.xpath('//div')
items=[]
#for site in sites:
item=kbbItem
item['priceMin']=site.xpath('//div/script').extract[35][915:922]
return items
我最终想要填充 priceMin
,价格
, priceMax
来自 fpp
以及从零售
字段到我的商品的价格。目前我正在使用索引来获取这些值但是想知道是否有更简单的方法。
I finally want to populate priceMin
, price
, priceMax
from fpp
and price from retail
field into my items. Currently I'm using indices to get those values but was wondering if there is an easier way.
推荐答案
问题是所需数据在Javascript代码中。并且,您当前依赖行索引的方法非常脆弱且不可靠。
The problem is that the desired data is inside the Javascript code. And, your current approach where you rely on line indexes is quite fragile and unreliable.
想法是找到脚本
包含所需数据的标签,请使用正则表达式到达包含价格的对象/字典,在 json
模块并获取所需信息。
The idea is to locate the script
tag containing the desired data, use regular expressions to get to the object/dictionary containing prices, load the object into a python dictionary with the help of json
module and get the desired information.
来自 Scrapy的演示Shell :
In [1]: import re
In [2]: import json
In [3]: pattern = re.compile(r"KBB\.Vehicle\.Pages\.PricingOverview\.Buyers\.setup\(.*?data: ({.*?}),\W+adPriceRanges", re.MULTILINE | re.DOTALL)
In [4]: data = response.xpath("//script[contains(., 'KBB.Vehicle.Pages.PricingOverview.Buyers.setup')]/text()").re(pattern)[0]
In [5]: data = data.replace("//Workaround until we get cross domain working for Flash", "")
In [6]: data_obj = json.loads(data)
In [7]: data_obj['values']['fpp']
Out[7]: {u'price': 15569.0, u'priceMax': 17356.0, u'priceMin': 13781.0}
In [8]: data_obj['values']['retail']
Out[8]: {u'price': 16370.0, u'priceMax': 0.0, u'priceMin': 0.0}
这篇关于从< script>获取数据使用Scrapy在HTML中标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!