从 <script> 中提取多行 javascript 内容使用 Scrapy 标记 [英] Extract multi-line javascript content from <script> tag using Scrapy
问题描述
我正在尝试使用 Scrapy 从这个脚本标签中提取数据:
I'm trying to extract data from this script tag using Scrapy:
<script>
var hardwareTemplateFunctions;
var storefrontContextUrl = '';
jq(function() {
var data = new Object();
data.hardwareProductCode = '9054832';
data.offeringCode = 'SMART_BASIC.TLF12PLEAS';
data.defaultTab = '';
data.categoryId = 10001;
data.bundles = new Object();
data.bundles['SMART_SUPERX.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1099'),
monthlyPrice: parsePrice('499'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Super',
offeringType: 'VOICE',
monthlyPrice: parsePrice('499'),
commitmentTime: 12
};
data.bundles['SMART_PLUSS.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1599'),
monthlyPrice: parsePrice('399'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Pluss',
offeringType: 'VOICE',
monthlyPrice: parsePrice('399'),
commitmentTime: 12
};
data.bundles['SMART_BASIC.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('2199'),
monthlyPrice: parsePrice('299'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Basis',
offeringType: 'VOICE',
monthlyPrice: parsePrice('299'),
commitmentTime: 12
};
data.bundles['SMART_MINI.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('2999'),
monthlyPrice: parsePrice('199'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Mini',
offeringType: 'VOICE',
monthlyPrice: parsePrice('199'),
commitmentTime: 12
};
data.bundles['KONTANT_KOMPLETT.REGULAR'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('0'),
upfrontPrice: parsePrice('3499'),
monthlyPrice: parsePrice('0'),
commitmentTime: parsePrice('0'),
offeringTitle: 'SMART Kontant',
offeringType: 'PREPAID',
monthlyPrice: parsePrice('0'),
commitmentTime: 0
};
data.reviewJson = new Object();
hardwareTemplateFunctions = hardwareTemplateFunctions(data);
hardwareTemplateFunctions.init();
data.reviewSummaryBox = hardwareTemplateFunctions.reviewSummaryBox;
accessoryFunctions(data).init();
additionalServiceFunctions(data).init();
});
function parsePrice(str) {
var price = parseFloat(str);
return isNaN(price) ? 0 : price;
}
var offerings = {};
</script>
我想从每个部分获取如下所示的数据:
I wan to get the data from each section that looks like this:
data.bundles['SMART_SUPERX.TLF12PLEAS'] = {
signupFee: parsePrice('0'),
newMsisdnFee: parsePrice('199'),
upfrontPrice: parsePrice('1099'),
monthlyPrice: parsePrice('499'),
commitmentTime: parsePrice('12'),
offeringTitle: 'SMART Super',
offeringType: 'VOICE',
monthlyPrice: parsePrice('499'),
commitmentTime: 12
};
然后从每个字段中获取数据并从例如 upfrontPrice
(例如本例中的 1099)中获取最终数据.
and then fetch the data from each field and get the final data from for example upfrontPrice
(e.g 1099 in this example).
我尝试使用此方法获取每个对象:
I have tried fetching each object using this:
items = response.xpath('//script/text()').re("data.bundles[.*](.*)")
然而,这只给我第一行数据.(= {
).那么我该怎么做呢?有没有更好的方法从脚本标签中提取这些数据?
However that only give me the first line of data. (= {
). So how should i do this? Is there a better way of extracting this data from the script tag?
当我使用 items = response.xpath('//script/text()').re("data.bundles[.*] = {((?s).*) };")
我似乎只得到最后一个块(带有 data.bundles['KONTANT_KOMPLETT.REGULAR']
的块)
When i use items = response.xpath('//script/text()').re("data.bundles[.*] = {((?s).*) };")
I seem to get only the last block (the one with data.bundles['KONTANT_KOMPLETT.REGULAR']
)
我如何获得所有这些的列表?
How do i get a list of all of them?
推荐答案
以下正则表达式似乎是正确的:
Following regex seems to be correct:
r"data.bundles[[^]]*] = {([^}]*)}"
正则表达式中的
*
是贪婪的 - 它总是会尽可能多地匹配,所以我使用 [^]]
来确保我将匹配最接近的 ]
.我对 {}
括号做同样的事情.此外,我不必担心 .
不匹配换行符.
*
in regexes is greedy - it will always try to match as much as possible, so i use [^]]
to make sure that I will match the closest ]
. I do the same with {}
brackets. Additionally, I don't have to worry about .
not matching newline.
这篇关于从 <script> 中提取多行 javascript 内容使用 Scrapy 标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!