使用BeautifulSoup刮去隐藏的元素 [英] Scraping hidden elements using BeautifulSoup

查看:499
本文介绍了使用BeautifulSoup刮去隐藏的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图为我的项目从网站上抓取数据。但问题是我没有在我的开发人员工具栏屏幕上看到我输出中的标签。以下是我想要抓取数据的DOM的快照:

 < div class =bigContainer > 
<! - ngIf:products.grid_layout.length> 0 - >< div ng-if =products.grid_layout.length> 0>
< div class =fl>
<! - ngRepeat:products in products.grid_layout - ><! - ngIf:$ index%3 == 0 - >
< div ng-repeat =product in products.grid_layoutng-if =$ index%3 == 0class =GridItems>
< grid-item product =productgakey =ga_keyidx =$ indexancestors =products.ancestorsis-search-item =isSearchItemis-filter =isFilter>
< a ng-href =/ shop / p / nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch = organic | undefined | lumia 930 | gridng-click =searchProductTrack (product,idx + 1)tabindex =0href =/ shop / p / nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch = organic | undefined | lumia 930 | style =>
< / grid-item>

我可以用class bigContainer获得div标签,但我无法刮掉标签内的标签。例如,如果我想获得网格项标签,我得到一个空的列表,这意味着它显示没有这样的标签。这是为什么发生?请帮助!!

解决方案

您可以使用底层web-api来提取网格项目详细信息, angularJS javascript框架,所以HTML不是静态的。



解析的一种方法是使用selenium来获取数据,但使用浏览器的开发者工具来识别web-api非常简单。 b
$ b

编辑:我使用firefox的firefox附加组件来查看从Net标签制作的GET请求



并且该页面的GET请求是:


https://catalog.paytm.com/v1//g/electronics/移动配件/移动/智能电话PAGE_COUNT = 1&安培;中items_per_page = 30&安培;分辨率= 960x720&安培;质量=高&安培; sort_popul ar = 1& cat_tree = 1& callback = angular.callbacks._3& channel = web& version = 2

它返回了一个回调JS脚本,它几乎完全是JSON数据。



返回的JSON包含网格项目的详细信息

每个网格项都被描述为如下所示的json对象:

  {
product_id :23491960,
complex_product_id:7287171,
name:Samsung Galaxy Z1(Black),
short_desc:,
bullet_points:{
salient_feature:[屏幕:10.16厘米(4英寸),相机:3.1MP后置/ VGA前置,RAM:768MB,ROM:4GB核心1.2 GHz Cortex-A7,电池:1500 mAh /锂离子电池]
},
url:https://catalog.paytm.com/v1/p/samsung- z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745,
seourl:https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745,
url_type:product,
promo_text:null,
image_url:https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1 -BSMAR2320696B3C745 / 2.jpg,
vertical_id:18,
vertical_label:移动,
offer_price:5090,
actual_price:5799,
merchant_name:SMARTBUY,
authorised_merchant:false,
stock:true,
brand:Samsung,
:+ 5%现金返还,
product_tag:+ 5%现金返还,
shippable:true,
created_at:2015-09-17T08:28 :25.000Z,
updated_at:2015-12-29T05:55:29.000Z,
img_width:400,
img_height:400,
discount:12
}

所以你可以不用甚至使用

 进口请求
进口json

响应= requests.get(https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular= 1& cat_tree = 1& callback = angular.callbacks._3& channel = web& version = 2)
jsonResponse =((response.text.split('angular.callbacks._3('))[1] .split(');')[0])
data = json.loads(jsonResponse)
print(data [grid_layout])
grid_data = data [grid_layout] $ grid_data中的grid_item的b
$ b:
print(品牌:,grid_item [brand])
print(Product Name:,grid_item [name])
print(当前价格:Rs,grid_item [offer_price])
print(==================)

您将获得如下输出:

 品牌:三星
产品名称:三星Galaxy Z1(黑色)
现价:¥4990
================= =
品牌:三星
产品名称:三星Galaxy A7(金)
现价:Rs 22947
==================

希望这有助于。


I was trying to scrape data from a website for my project.But the problem is I am not getting the tags in my outputs which I am seeing in my developer toolbar screen. the following is the snapshot of the the DOM from which I wanted to scrape the data :

<div class="bigContainer">
      <!-- ngIf: products.grid_layout.length > 0 --><div ng-if="products.grid_layout.length > 0">
        <div class="fl">
          <!-- ngRepeat: product in products.grid_layout --><!-- ngIf: $index%3==0 -->
          <div ng-repeat="product in products.grid_layout" ng-if="$index%3==0" class="GridItems">
          <grid-item product="product" gakey="ga_key" idx="$index" ancestors="products.ancestors" is-search-item="isSearchItem" is-filter="isFilter">
              <a ng-href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" ng-click="searchProductTrack(product, idx+1)" tabindex="0" href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" class="" style="">
           </grid-item>   

I am able to get the div tag with class "bigContainer" but I am not able to scrape the tags within this tag.For example if I want to get the grid-item tag,I got an empty list which means it shows that there is no such tag. Why is this happening? Please help!!

解决方案

You can use the underlying web-api to extract the grid-item details, which are rendered by the angularJS javascript framework, so the HTML is not static.

One way to parse would be use selenium to get the data, but identifying the web-api is pretty simple using the developer tools of the browser.

EDIT: I use firebug add-on with firefox to see the GET requests made from "Net tab"

and the GET request for the page is:

https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2

And it returned a callback JS script, which was almost completely JSON data.

The JSON it returned contained the details for the grid items

Each grid item was described as a json object like below:

{
        "product_id": 23491960,
        "complex_product_id": 7287171,
        "name": "Samsung Galaxy Z1 (Black)",
        "short_desc": "",
        "bullet_points": {
            "salient_feature": ["Screen: 10.16 cm (4\")", "Camera: 3.1 MP Rear/VGA Front", "RAM: 768 MB", "ROM: 4 GB", "Dual-core 1.2 GHz Cortex-A7", "Battery: 1500 mAh/Li-Ion"]
        },
        "url": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "seourl": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "url_type": "product",
        "promo_text": null,
        "image_url": "https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1-BSMAR2320696B3C745/2.jpg",
        "vertical_id": 18,
        "vertical_label": "Mobile",
        "offer_price": 5090,
        "actual_price": 5799,
        "merchant_name": "SMARTBUY",
        "authorised_merchant": false,
        "stock": true,
        "brand": "Samsung",
        "tag": "+5% Cashback",
        "product_tag": "+5% Cashback",
        "shippable": true,
        "created_at": "2015-09-17T08:28:25.000Z",
        "updated_at": "2015-12-29T05:55:29.000Z",
        "img_width": 400,
        "img_height": 400,
        "discount": "12"
    }

So you can get the details without even using beautifulSoup in the following way.

import requests
import json

response = requests.get("https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2")
jsonResponse = ((response.text.split('angular.callbacks._3('))[1].split(');')[0])
data = json.loads(jsonResponse)
print(data["grid_layout"])
grid_data = data["grid_layout"]

for grid_item in grid_data:
    print("Brand:", grid_item["brand"])
    print("Product Name:", grid_item["name"])
    print("Current Price: Rs", grid_item["offer_price"])
    print("==================")

you would get output like

Brand: Samsung
Product Name: Samsung Galaxy Z1 (Black)
Current Price: Rs 4990
==================
Brand: Samsung
Product Name: Samsung Galaxy A7 (Gold)
Current Price: Rs 22947
==================

Hope this helps.

这篇关于使用BeautifulSoup刮去隐藏的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆