Python-如何使用bs4抓取JavaScript代码)? [英] Python - How can I scrape with bs4 a javascript code)?

查看:91
本文介绍了Python-如何使用bs4抓取JavaScript代码)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我一直在尝试从javascript的html中抓取一个值.代码中有很多javascript,但我只想能够打印出这一个:

So I have been trying to scrape out a value from a html that is a javascript. There is alot of javascript in the code but I just want to be able to print out this one:

var spConfig=newProduct.Config({
  "attributes": {
    "531": {
      "id": "531",
      "options": [
        {
          "id": "18",
          "hunter": "0",
          "products": [
            "128709"
          ]
        },
        {
          "label": "40 1\/2",
          "hunter": "0",
          "products": [
            "120151"
          ]
        },
        {
          "id": "33",
          "hunter": "0",
          "products": [
            "120152"
          ]
        },
        {
          "id": "36",
          "hunter": "0",
          "products": [
            "128710"
          ]
        },
        {
          "id": "42",
          "hunter": "0",
          "products": [
            "125490"
          ]
        }
      ]
    }
  },

  "Id": "120153",

});

因此,我首先编写了如下代码:

So I started by doing a code that looks like:

test = bs4.find_all('script', {'type': 'text/javascript'})
print(test)

我得到的输出非常巨大,因此我无法将其全部发布到此处,但是其中之一就是我在顶部提到的javascript,我只想打印出 var spConfig = newProduct.Config.

The output I am getting is pretty huge so I am not able to post it all here but one of them is the javascript as I mentioned at the top and I want to print out only var spConfig=newProduct.Config.

我如何做到这一点,以便能够打印出 var spConfig = newProduct.Config .... ,以后我可以使用json.loads将其转换为json,以后我可以更轻松地刮它吗?

How am I able to do that, to be able to just print out var spConfig=newProduct.Config.... which I later can use json.loads that convert it to a json where I later on can scrape it more easier?

对于任何问题或我没有很好解释的问题.我会在注释中说明所有内容,在这里我可以在stackoverflow中提高自己!:)

bs4打印出的javascript的更多示例

More example of what bs4 prints out for javascripts

<script type="text/javascript">varoptionsPrice=newProduct.Options({
  "priceFormat": {
    "pattern": "%s\u00a0\u20ac",
    "precision": 2,
    "requiredPrecision": 2,
    "decimalSymbol": ",",
    "groupSymbol": "\u00a0",
    "groupLength": 3,
    "integerRequired": 1
  },
  "showBoths": false,
  "idSuffix": "_clone",
  "skipCalculate": 1,
  "defaultTax": 20,
  "currentTax": 20,
  "tierPrices": [

  ],
  "tierPricesInclTax": [

  ],
  "swatchPrices": null
});</script>,
<script type="text/javascript">var spConfig=newProduct.Config({
  "attributes": {
    "531": {
      "id": "531",
      "options": [
        {
          "id": "18",
          "hunter": "0",
          "products": [
            "128709"
          ]
        },
        {
          "label": "40 1\/2",
          "hunter": "0",
          "products": [
            "120151"
          ]
        },
        {
          "id": "33",
          "hunter": "0",
          "products": [
            "120152"
          ]
        },
        {
          "id": "36",
          "hunter": "0",
          "products": [
            "128710"
          ]
        },
        {
          "id": "42",
          "hunter": "0",
          "products": [
            "125490"
          ]
        }
      ]
    }
  },

  "Id": "120153"
});</script>,
<scripttype="text/javascript">document.observe('dom:loaded',
function(){
  varswatchesConfig=newProduct.ConfigurableSwatches(spConfig);
});</script>

编辑更新2:

try:
    product_li_tags = bs4.find_all('script', {'type': 'text/javascript'})
except Exception:
    product_li_tags = []


for product_li_tag in product_li_tags:
   try:
        pat = "product.Config\((.+)\);"
        json_str = re.search(pat, product_li_tag, flags=re.DOTALL).group(1)
        print(json_str)
   except:
       pass

#json.loads(json_str)
print("Nothing")
sys.exit()

推荐答案

您可以使用 .text 函数来获取每个标签中的内容.然后,如果您知道要获取专门以" varoptionsPrice "开头的代码,则可以对此进行过滤:

You can use the .text function to get the content within each tag. Then, if you know that you want to grab the code that specifically starts with "varoptionsPrice", you can filter for that:

soup = BeautifulSoup(myhtml, 'lxml')

script_blocks = soup.find_all('script', {'type': 'text/javascript'})
special_code = ''
for s in script_blocks:
    if s.text.strip().startswith('varOptionsPrice'):
        special_code = s.text
        break

print(special_code)

要在评论中回答您的问题,有两种不同的方法来提取文本中具有JSON的部分.您可以通过正则表达式将其传递,以获取第一个左括号之间和); 末尾之前的所有内容.尽管如果您想完全避免使用正则表达式,则可以执行以下操作:

To answer your question in the comments, there are a couple of different ways of extracting the part of the text that has the JSON. You could pass it through a regexp to grab everything between the first left parentheses and before the ); at the end. Though if you want to avoid regexp completely, you could do something like:

json_stuff = special_code[special_code.find('(')+1:special_code.rfind(')')]

然后从中制作出可用的字典:

Then to make a usable dictionary out of it:

import json
j = json.loads(json_stuff)
print(j['defaultTax'])  # This should return a value of 20

这篇关于Python-如何使用bs4抓取JavaScript代码)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆