从 <script> 中提取多行 javascript 内容使用 Scrapy 标记 [英] Extract multi-line javascript content from <script> tag using Scrapy

查看:21
本文介绍了从 <script> 中提取多行 javascript 内容使用 Scrapy 标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scrapy 从这个脚本标签中提取数据:

I'm trying to extract data from this script tag using Scrapy:

<script>
        var hardwareTemplateFunctions;
        var storefrontContextUrl = '';

        jq(function() {
            var data = new Object();
            data.hardwareProductCode = '9054832';
            data.offeringCode = 'SMART_BASIC.TLF12PLEAS';
            data.defaultTab = '';
            data.categoryId = 10001;

            data.bundles = new Object();
                            data.bundles['SMART_SUPERX.TLF12PLEAS'] = {
                    signupFee: parsePrice('0'),
                    newMsisdnFee: parsePrice('199'),
                    upfrontPrice: parsePrice('1099'),
                    monthlyPrice: parsePrice('499'),
                    commitmentTime: parsePrice('12'),
                    offeringTitle: 'SMART Super',
                    offeringType: 'VOICE',
                    monthlyPrice: parsePrice('499'),
                    commitmentTime: 12
                };
                            data.bundles['SMART_PLUSS.TLF12PLEAS'] = {
                    signupFee: parsePrice('0'),
                    newMsisdnFee: parsePrice('199'),
                    upfrontPrice: parsePrice('1599'),
                    monthlyPrice: parsePrice('399'),
                    commitmentTime: parsePrice('12'),
                    offeringTitle: 'SMART Pluss',
                    offeringType: 'VOICE',
                    monthlyPrice: parsePrice('399'),
                    commitmentTime: 12
                };
                            data.bundles['SMART_BASIC.TLF12PLEAS'] = {
                    signupFee: parsePrice('0'),
                    newMsisdnFee: parsePrice('199'),
                    upfrontPrice: parsePrice('2199'),
                    monthlyPrice: parsePrice('299'),
                    commitmentTime: parsePrice('12'),
                    offeringTitle: 'SMART Basis',
                    offeringType: 'VOICE',
                    monthlyPrice: parsePrice('299'),
                    commitmentTime: 12
                };
                            data.bundles['SMART_MINI.TLF12PLEAS'] = {
                    signupFee: parsePrice('0'),
                    newMsisdnFee: parsePrice('199'),
                    upfrontPrice: parsePrice('2999'),
                    monthlyPrice: parsePrice('199'),
                    commitmentTime: parsePrice('12'),
                    offeringTitle: 'SMART Mini',
                    offeringType: 'VOICE',
                    monthlyPrice: parsePrice('199'),
                    commitmentTime: 12
                };
                            data.bundles['KONTANT_KOMPLETT.REGULAR'] = {
                    signupFee: parsePrice('0'),
                    newMsisdnFee: parsePrice('0'),
                    upfrontPrice: parsePrice('3499'),
                    monthlyPrice: parsePrice('0'),
                    commitmentTime: parsePrice('0'),
                    offeringTitle: 'SMART Kontant',
                    offeringType: 'PREPAID',
                    monthlyPrice: parsePrice('0'),
                    commitmentTime: 0
                };

            data.reviewJson = new Object();


            hardwareTemplateFunctions = hardwareTemplateFunctions(data);
            hardwareTemplateFunctions.init();

            data.reviewSummaryBox = hardwareTemplateFunctions.reviewSummaryBox;

            accessoryFunctions(data).init();
            additionalServiceFunctions(data).init();
        });

        function parsePrice(str) {
            var price = parseFloat(str);
            return isNaN(price) ? 0 : price;
        }

        var offerings = {};
    </script>

我想从每个部分获取如下所示的数据:

I wan to get the data from each section that looks like this:

 data.bundles['SMART_SUPERX.TLF12PLEAS'] = {
                signupFee: parsePrice('0'),
                newMsisdnFee: parsePrice('199'),
                upfrontPrice: parsePrice('1099'),
                monthlyPrice: parsePrice('499'),
                commitmentTime: parsePrice('12'),
                offeringTitle: 'SMART Super',
                offeringType: 'VOICE',
                monthlyPrice: parsePrice('499'),
                commitmentTime: 12
            };

然后从每个字段中获取数据并从例如 upfrontPrice(例如本例中的 1099)中获取最终数据.

and then fetch the data from each field and get the final data from for example upfrontPrice (e.g 1099 in this example).

我尝试使用此方法获取每个对象:

I have tried fetching each object using this:

items = response.xpath('//script/text()').re("data.bundles[.*](.*)")

然而,这只给我第一行数据.(= {).那么我该怎么做呢?有没有更好的方法从脚本标签中提取这些数据?

However that only give me the first line of data. (= {). So how should i do this? Is there a better way of extracting this data from the script tag?

当我使用 items = response.xpath('//script/text()').re("data.bundles[.*] = {((?s).*) };") 我似乎只得到最后一个块(带有 data.bundles['KONTANT_KOMPLETT.REGULAR'] 的块)

When i use items = response.xpath('//script/text()').re("data.bundles[.*] = {((?s).*) };") I seem to get only the last block (the one with data.bundles['KONTANT_KOMPLETT.REGULAR'])

我如何获得所有这些的列表?

How do i get a list of all of them?

推荐答案

以下正则表达式似乎是正确的:

Following regex seems to be correct:

r"data.bundles[[^]]*] = {([^}]*)}"

正则表达式中的

* 是贪婪的 - 它总是会尽可能多地匹配,所以我使用 [^]] 来确保我将匹配最接近的 ].我对 {} 括号做同样的事情.此外,我不必担心 . 不匹配换行符.

* in regexes is greedy - it will always try to match as much as possible, so i use [^]] to make sure that I will match the closest ]. I do the same with {} brackets. Additionally, I don't have to worry about . not matching newline.

这篇关于从 &lt;script&gt; 中提取多行 javascript 内容使用 Scrapy 标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆