如何使用Selenium在scrapy中生成片段URL? [英] How to yield fragment URLs in scrapy using Selenium?

查看：43 发布时间：2021/7/16 22:14:38 javascript python selenium web-scraping scrapy

本文介绍了如何使用Selenium在scrapy中生成片段URL?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

由于我对网页抓取的了解不足，我开始为我找到一个非常复杂的问题，我将尽力解释(因此我愿意在我的帖子中提出建议或编辑).

from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).

我很久以前就开始使用网络抓取框架Scrapy"来进行网络抓取，现在我仍然使用它.最近，我遇到了 my发帖

I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post

在那之后，我意识到如果没有 JavaScript 解释器或浏览器模仿器，我的框架就无法实现，所以他们提到了 Selenium 库.我尽可能多地阅读有关该库的信息(即 example1, 示例2、示例3 和example4).我还发现了这篇 StackOverflow 的帖子，其中提供了有关我的问题的一些线索.

After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.

最后，我最大的问题是:

So Finally, my biggest questions are:

1 - 有没有什么办法可以在上面显示的网站上使用 Selenium 和 scrapy 迭代/产生页面?到目前为止，这是我正在使用的代码，但不起作用...

1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy? So far, this is the code I'm using, but doesn't work...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# The require imports...

def getBrowser():
    path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87")
    browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)

    return browser

class MySpider(Spider):
    name = "myspider"

    browser = getBrowser()

    def start_requests(self):
        the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="

        yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)

    def parse(self, response):
        self.get_page_links()

    def get_page_links(self):
        """ This first part, goes through all available pages """

        for i in xrange(1, 3):  # 210
            new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
                    "config": {"page": str(i)}}
            json_data = json.dumps(new_data)
            new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
            self.browser.get(new_url)
            print "\nThe new URL is -> ", new_url, "\n"
            content = self.browser.page_source
            self.get_item_links(content)

    def get_item_links(self, body=""):
        if body:
            """ This second part, goes through all available items """
            raw_links = re.findall(r'listclickable.+?>', body)
            links = []
            if raw_links:
                for raw_link in raw_links:
                    new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
                                                                                                             "")
                    links.append(str(new_link))

                if links:
                    ids = self.get_ids(links)
                    for link in links:
                        current_id = self.get_single_id(link)
                        print "\nThe Link -> ", link
                        # If commented the line below, code works, doesn't otherwise
                        yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)                                                                           

    def get_ids(self, list1=[]):
        if list1:
            ids = []
            for elem in list1:
                raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
                ids.append(raw_id)

            return ids

        else:
            return []

    def get_single_id(self, text=""):
        if text:
            raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
            return raw_id

        else:
            return ""

    def parse_room(self, response): 
        # More scraping code...

所以这主要是我的问题.我几乎可以肯定我正在做的不是最好的方法，因此我做了我的第二个问题.为了避免将来不得不做这些问题，我做了第三个问题.

So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.

2 - 如果第一个问题的答案是否定的，我该如何解决这个问题?我对另一种方式持开放态度，否则

3 - 谁能告诉我或向我展示可以学习如何解决/结合使用 javaScript 和 Ajax 的网页抓取的页面?现在使用 JavaScript 和 Ajax 脚本加载内容的网站越来越多

非常感谢！

如何使用Selenium在scrapy中生成片段URL? [英] How to yield fragment URLs in scrapy using Selenium?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何使用Selenium在scrapy中生成片段URL? [英] How to yield fragment URLs in scrapy using Selenium?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭