如何使用Selenium在scrapy中生成片段URL? [英] How to yield fragment URLs in scrapy using Selenium?

查看:43
本文介绍了如何使用Selenium在scrapy中生成片段URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于我对网页抓取的了解不足,我开始为我找到一个非常复杂的问题,我将尽力解释(因此我愿意在我的帖子中提出建议或编辑).

from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).

我很久以前就开始使用网络抓取框架Scrapy"来进行网络抓取,现在我仍然使用它.最近,我遇到了 my发帖

I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post

在那之后,我意识到如果没有 JavaScript 解释器或浏览器模仿器,我的框架就无法实现,所以他们提到了 Selenium 库.我尽可能多地阅读有关该库的信息(即 example1, 示例2示例3example4).我还发现了这篇 StackOverflow 的帖子,其中提供了有关我的问题的一些线索.

After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.

最后,我最大的问题是:

So Finally, my biggest questions are:

1 - 有没有什么办法可以在上面显示的网站上使用 Selenium 和 scrapy 迭代/产生页面?到目前为止,这是我正在使用的代码,但不起作用...

1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy? So far, this is the code I'm using, but doesn't work...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# The require imports...

def getBrowser():
    path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87")
    browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)

    return browser

class MySpider(Spider):
    name = "myspider"

    browser = getBrowser()

    def start_requests(self):
        the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="

        yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)

    def parse(self, response):
        self.get_page_links()

    def get_page_links(self):
        """ This first part, goes through all available pages """

        for i in xrange(1, 3):  # 210
            new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
                    "config": {"page": str(i)}}
            json_data = json.dumps(new_data)
            new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
            self.browser.get(new_url)
            print "\nThe new URL is -> ", new_url, "\n"
            content = self.browser.page_source
            self.get_item_links(content)

    def get_item_links(self, body=""):
        if body:
            """ This second part, goes through all available items """
            raw_links = re.findall(r'listclickable.+?>', body)
            links = []
            if raw_links:
                for raw_link in raw_links:
                    new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
                                                                                                             "")
                    links.append(str(new_link))

                if links:
                    ids = self.get_ids(links)
                    for link in links:
                        current_id = self.get_single_id(link)
                        print "\nThe Link -> ", link
                        # If commented the line below, code works, doesn't otherwise
                        yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)                                                                           

    def get_ids(self, list1=[]):
        if list1:
            ids = []
            for elem in list1:
                raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
                ids.append(raw_id)

            return ids

        else:
            return []

    def get_single_id(self, text=""):
        if text:
            raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
            return raw_id

        else:
            return ""

    def parse_room(self, response): 
        # More scraping code...

所以这主要是我的问题.我几乎可以肯定我正在做的不是最好的方法,因此我做了我的第二个问题.为了避免将来不得不做这些问题,我做了第三个问题.

So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.

2 - 如果第一个问题的答案是否定的,我该如何解决这个问题?我对另一种方式持开放态度,否则

3 - 谁能告诉我或向我展示可以学习如何解决/结合使用 javaScript 和 Ajax 的网页抓取的页面?现在使用 JavaScript 和 Ajax 脚本加载内容的网站越来越多

非常感谢!

推荐答案

Selenium 是抓取动态数据的最佳工具之一.您可以在任何 Web 浏览器中使用 selenium 来获取从脚本加载的数据.这完全适用喜欢浏览器点击操作.但我不喜欢它.

Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.

要获取动态数据,您可以使用scrapy + splash 组合.从scrapy,您将获得所有静态数据和其他动态内容的飞溅.

For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.

这篇关于如何使用Selenium在scrapy中生成片段URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆