从< script>中提取多行javascript内容标签使用Scrapy [英] Extract multi-line javascript content from <script> tag using Scrapy
问题描述
我正在尝试使用Scrapy从此脚本标记中提取数据:
I'm trying to extract data from this script tag using Scrapy:
<script>
var hardwareTemplateFunctions;
var storefrontContextUrl = '';
jq(function() {
var data = new Object();
data.hardwareProductCode = '9054832';
data.offeringCode = 'SMART_BASIC.TLF12PLEAS';
data.defaultTab = '';
data.categoryId = 10001;
data.bundles = new Object();
data.bundles['SMART_SUPERX.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1099'),
monthlyPrice: parsePrice('499'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Super',
offeringType: 'VOICE',
monthlyPrice: parsePrice('499'),
commitmentTime: 12
};
data.bundles['SMART_PLUSS.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1599'),
monthlyPrice: parsePrice('399'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Pluss',
offeringType: 'VOICE',
monthlyPrice: parsePrice('399'),
commitmentTime: 12
};
data.bundles['SMART_BASIC.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('2199'),
monthlyPrice: parsePrice('299'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Basis',
offeringType: 'VOICE',
monthlyPrice: parsePrice('299'),
commitmentTime: 12
};
data.bundles['SMART_MINI.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('2999'),
monthlyPrice: parsePrice('199'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Mini',
offeringType: 'VOICE',
monthlyPrice: parsePrice('199'),
commitmentTime: 12
};
data.bundles['KONTANT_KOMPLETT.REGULAR'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('0'),
upfrontPrice: parsePrice('3499'),
monthlyPrice: parsePrice('0'),
commitmentTime: parsePrice('0'),
offeringTitle: 'SMART Kontant',
offeringType: 'PREPAID',
monthlyPrice: parsePrice('0'),
commitmentTime: 0
};
data.reviewJson = new Object();
hardwareTemplateFunctions = hardwareTemplateFunctions(data);
hardwareTemplateFunctions.init();
data.reviewSummaryBox = hardwareTemplateFunctions.reviewSummaryBox;
accessoryFunctions(data).init();
additionalServiceFunctions(data).init();
});
function parsePrice(str) {
var price = parseFloat(str);
return isNaN(price) ? 0 : price;
}
var offerings = {};
</script>
我想从每个部分获得如下数据:
I wan to get the data from each section that looks like this:
data.bundles['SMART_SUPERX.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1099'),
monthlyPrice: parsePrice('499'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Super',
offeringType: 'VOICE',
monthlyPrice: parsePrice('499'),
commitmentTime: 12
};
然后从每个字段中获取数据并从例如 upfrontPrice
(例如本例中为1099)。
and then fetch the data from each field and get the final data from for example upfrontPrice
(e.g 1099 in this example).
我尝试使用此方法获取每个对象:
I have tried fetching each object using this:
items = response.xpath('//script/text()').re("data.bundles\[.*\](.*)")
然而,这只给我第一行数据。 ( = {
)。那我该怎么做呢?有没有更好的方法从脚本标记中提取此数据?
However that only give me the first line of data. (= {
). So how should i do this? Is there a better way of extracting this data from the script tag?
编辑:当我使用 items = response.xpath('// script / text()')时。 re(data.bundles\ [。* \] = {((?s)。*)};)
我似乎只得到最后一个块(带有<$ c的块) $ c> data.bundles ['KONTANT_KOMPLETT.REGULAR'] )
When i use items = response.xpath('//script/text()').re("data.bundles\[.*\] = {((?s).*) };")
I seem to get only the last block (the one with data.bundles['KONTANT_KOMPLETT.REGULAR']
)
我如何获得所有这些的列表?
How do i get a list of all of them?
推荐答案
以下正则表达式似乎是正确的:
Following regex seems to be correct:
r"data\.bundles\[[^\]]*\] = {([^}]*)}"
*
在正则表达式中是贪婪的 - 它总会尝试尽可能匹配,所以我使用 [^ \]]
以确保我将匹配最近的]
。我用 {}
括号做同样的事情。另外,我不必担心。
不匹配换行符。
*
in regexes is greedy - it will always try to match as much as possible, so i use [^\]]
to make sure that I will match the closest ]
. I do the same with {}
brackets. Additionally, I don't have to worry about .
not matching newline.
这篇关于从< script>中提取多行javascript内容标签使用Scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!