如何从网站上抓取动态内容? [英] How to scrape dynamic content from a website?

查看:64
本文介绍了如何从网站上抓取动态内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在使用scrapy从Amazon图书部分中获取数据.但是我不知何故知道它具有一些动态数据.我想知道如何从网站中提取动态数据.到目前为止,我已经尝试过以下方法:

So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far:

import scrapy
from ..items import AmazonsItem

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6']

    def parse(self, response):
        items =  AmazonsItem()
        products_name = response.css('.s-access-title::attr("data-attribute")').extract()
        for product_name in products_name:
            print(product_name)
        next_page = response.css('li.a-last a::attr(href)').get()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

现在我正在使用SelectorGadget来选择我必须抓取的类,但是如果是动态网站,则该类不起作用.

Now I was using SelectorGadget to select a class which I have to scrape but in case of a dynamic website, it doesn't work.

  1. 那我该如何抓取一个包含动态内容的网站?
  2. 动态内容和静态内容之间到底有什么区别?
  3. 如何从网站中提取价格和图片等其他信息?以及如何获得特定的类(例如价格)?
  4. 我怎么知道数据是动态创建的?

推荐答案

那我该如何抓取具有动态内容的网站?

有一些选择:

  1. 使用Selenium,它允许您模拟打开浏览器,让页面呈现,然后提取html源代码
  2. 有时您可以查看XHR,看看是否可以直接获取数据(例如从API中获取数据)
  3. 有时数据在html源的< script> 标记内.将文本处理为json格式后,您可以搜索这些内容并使用 json.loads()
  1. Use Selenium, which allows you to simulate opening a browser, letting the page render, then pull the html source code
  2. Sometimes you can look at the XHR and see if you can fetch the data directly (like from an API)
  3. Sometimes the data is within the <script> tags of the html source. You could search through those and use json.loads() once you manipulate the text into a json format

动态内容和静态内容之间到底有什么区别?

动态的是指在初始页面请求之后根据请求生成数据.静态表示该站点的原始调用中所有数据都存在

Dynamic means the data is generated from a request after the initial page request. Static means all the data is there at the original call to the site

如何从网站中提取价格和图片等其他信息?以及如何获取价格等特定类?

请参阅您的第一个问题

我怎么知道数据是动态创建的?

如果您在开发工具页面的源代码中看到它是动态创建的,而在您第一次请求的html页面的源代码中却没有看到它,那么您就会知道它是动态创建的.您还可以查看数据是否由开发工具中的其他请求生成,并查看Network-> XHR

You'll know it's dynamically created if you see it in the dev tools page source, but not in the html page source you first request. You can also see if the data is generated by additional requests in the dev tool and looking at Network -> XHR

最后

Amazon确实提供了访问数据的API.尝试研究一下

Amazon does offer an API to access the data. Try looking into that as well

这篇关于如何从网站上抓取动态内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆