使用 BeautifulSoup 抓取隐藏元素 [英] Scraping hidden elements using BeautifulSoup

查看:61
本文介绍了使用 BeautifulSoup 抓取隐藏元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从网站上为我的项目抓取数据.但问题是我没有在我的开发人员工具栏屏幕中看到的输出中获取标签.以下是我想从中抓取数据的 DOM 的快照:

<!-- ngIf: products.grid_layout.length >0 --><div ng-if="products.grid_layout.length > 0"><div class="fl"><!-- ngRepeat:products.grid_layout 中的产品--><!-- ngIf:$index%3==0 --><div ng-repeat="product in products.grid_layout" ng-if="$index%3==0" class="GridItems"><grid-item product="product" gakey="ga_key" idx="$index"ancestors="products.ancestors" is-search-item="isSearchItem" is-filter="isFilter"><a ng-href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" ng-click="searchProductTrack(product, idx+1)" tabindex="0" href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" class="" style=""></grid-item>

我能够获得类bigContainer"的 div 标签,但我无法抓取该标签内的标签.例如,如果我想获得 grid-item 标签,我得到了一个空列表,这意味着它显示没有这样的标签.为什么会这样?请帮忙!!

解决方案

您可以使用底层的 web-api 来提取 grid-item 详细信息,这些细节由 angularJS javascript 框架呈现,因此 HTML 不是静态的.

一种解析方法是使用 selenium 来获取数据,但使用浏览器的开发人员工具识别 web-api 非常简单.

我使用 firebug 插件和 firefox 来查看从Net 选项卡"发出的 GET 请求

对页面的GET请求是:

<块引用>

https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2

并且它返回了一个回调JS脚本,几乎完全是JSON数据.

它返回的 JSON 包含网格项目的详细信息

每个网格项都被描述为一个 json 对象,如下所示:

<代码>{product_id":23491960,complex_product_id":7287171,"name": "三星 Galaxy Z1 (黑色)","short_desc": "",bullet_points":{"salient_feature": ["屏幕: 10.16 cm (4")", "Camera: 3.1 MP Rear/VGA Front", "RAM: 768 MB", "ROM: 4 GB", "Dual-core 1.2 GHz Cortex-A7", "电池:1500 mAh/锂离子"]},"url": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745","seourl": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745","url_type": "产品",promo_text":空,"image_url": "https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1-BSMAR2320696B3C745/2.jpg",vertical_id":18,"vertical_label": "手机","offer_price": 5090,实际价格":5799,"merchant_name": "SMARTBUY",authorised_merchant":假,股票":真实,"brand": "三星","tag": "+5% 返现","product_tag": "+5% 返现",可发货":真实,"created_at": "2015-09-17T08:28:25.000Z","updated_at": "2015-12-29T05:55:29.000Z","img_width": 400,"img_height": 400,折扣":12"}

所以你不用beautifulSoup也可以通过下面的方式获得细节.

导入请求导入jsonresponse = requests.get("https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2")jsonResponse = ((response.text.split('angular.callbacks._3('))[1].split(');')[0])数据 = json.loads(jsonResponse)打印(数据[网格布局"])网格数据 = 数据[网格布局"]对于 grid_data 中的 grid_item:打印(品牌:",grid_item[品牌"])打印(产品名称:",grid_item[名称"])打印(当前价格:卢比",grid_item[offer_price"])打印(==================")

你会得到类似的输出

品牌:三星产品名称:Samsung Galaxy Z1(黑色)当前价格:4990 卢比==================品牌:三星产品名称:Samsung Galaxy A7(金色)当前价格:22947 卢比==================

希望这会有所帮助.

I was trying to scrape data from a website for my project.But the problem is I am not getting the tags in my outputs which I am seeing in my developer toolbar screen. the following is the snapshot of the the DOM from which I wanted to scrape the data :

<div class="bigContainer">
      <!-- ngIf: products.grid_layout.length > 0 --><div ng-if="products.grid_layout.length > 0">
        <div class="fl">
          <!-- ngRepeat: product in products.grid_layout --><!-- ngIf: $index%3==0 -->
          <div ng-repeat="product in products.grid_layout" ng-if="$index%3==0" class="GridItems">
          <grid-item product="product" gakey="ga_key" idx="$index" ancestors="products.ancestors" is-search-item="isSearchItem" is-filter="isFilter">
              <a ng-href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" ng-click="searchProductTrack(product, idx+1)" tabindex="0" href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" class="" style="">
           </grid-item>   

I am able to get the div tag with class "bigContainer" but I am not able to scrape the tags within this tag.For example if I want to get the grid-item tag,I got an empty list which means it shows that there is no such tag. Why is this happening? Please help!!

You can use the underlying web-api to extract the grid-item details, which are rendered by the angularJS javascript framework, so the HTML is not static.

One way to parse would be use selenium to get the data, but identifying the web-api is pretty simple using the developer tools of the browser.

EDIT: I use firebug add-on with firefox to see the GET requests made from "Net tab"

and the GET request for the page is:

https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2

And it returned a callback JS script, which was almost completely JSON data.

The JSON it returned contained the details for the grid items

Each grid item was described as a json object like below:

{
        "product_id": 23491960,
        "complex_product_id": 7287171,
        "name": "Samsung Galaxy Z1 (Black)",
        "short_desc": "",
        "bullet_points": {
            "salient_feature": ["Screen: 10.16 cm (4")", "Camera: 3.1 MP Rear/VGA Front", "RAM: 768 MB", "ROM: 4 GB", "Dual-core 1.2 GHz Cortex-A7", "Battery: 1500 mAh/Li-Ion"]
        },
        "url": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "seourl": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "url_type": "product",
        "promo_text": null,
        "image_url": "https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1-BSMAR2320696B3C745/2.jpg",
        "vertical_id": 18,
        "vertical_label": "Mobile",
        "offer_price": 5090,
        "actual_price": 5799,
        "merchant_name": "SMARTBUY",
        "authorised_merchant": false,
        "stock": true,
        "brand": "Samsung",
        "tag": "+5% Cashback",
        "product_tag": "+5% Cashback",
        "shippable": true,
        "created_at": "2015-09-17T08:28:25.000Z",
        "updated_at": "2015-12-29T05:55:29.000Z",
        "img_width": 400,
        "img_height": 400,
        "discount": "12"
    }

So you can get the details without even using beautifulSoup in the following way.

import requests
import json

response = requests.get("https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2")
jsonResponse = ((response.text.split('angular.callbacks._3('))[1].split(');')[0])
data = json.loads(jsonResponse)
print(data["grid_layout"])
grid_data = data["grid_layout"]

for grid_item in grid_data:
    print("Brand:", grid_item["brand"])
    print("Product Name:", grid_item["name"])
    print("Current Price: Rs", grid_item["offer_price"])
    print("==================")

you would get output like

Brand: Samsung
Product Name: Samsung Galaxy Z1 (Black)
Current Price: Rs 4990
==================
Brand: Samsung
Product Name: Samsung Galaxy A7 (Gold)
Current Price: Rs 22947
==================

Hope this helps.

这篇关于使用 BeautifulSoup 抓取隐藏元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆