尝试抓取访问的链接内的链接时出现问题吗? [英] Problems while trying to crawl links inside visted links with scrapy?

查看:60
本文介绍了尝试抓取访问的链接内的链接时出现问题吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了学习草率,我尝试从start_urls列表中抓取一些内部url.问题在于,并非start_urls中的所有元素都具有内部urls(在这里我想返回NaN).因此,如何返回以下两列数据帧(**):

In order to learn scrapy, I am trying to crawl some inner urls from a list of start_urls. The problem is that not all elements from start_urls have inner urls (here I would like to return NaN). Thus, how can I return the following 2 column dataframe (**):

visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://www.extracted-link3.com

到目前为止,我尝试:

在:

# -*- coding: utf-8 -*-


class ToySpider(scrapy.Spider):
    name = "toy_example"

    allowed_domains = ["www.example.com"]

    start_urls = ['https:example1.com',
                  'https:example2.com',
                  'https:example3.com']


    def parse(self, response):
        links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a")

        lis_ = []

        for l in links:
            item = ToyCrawlerItem()
            item['visited_link'] = response.url
            item['extracted_link'] = l.xpath('@href').extract_first()
            yield item

        lis_.append(item)
        df = pd.DataFrame(lis_)

        print('\n\n\n\n\n', df, '\n\n\n\n\n')

        df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

但是,上面的代码返回了我:

However, the above code its returning me:

出局:

extracted_link,visited_link
https://www.extracted-link.com,https://www.example1.com

我尝试使用以下方法管理None问题值:

I tried to manage the None issue values with:

   if l == None:
                item['visited_link'] = 'NaN'
            else:
                item['visited_link'] = response.url

但是它不起作用,关于如何获取(**)

But it is not working, any idea of how to get (**)

*是一个数据帧,我知道我可以执行-o,但是我将执行数据帧操作.

* yes a dataframe, I know that I can do -o, but I will do dataframe operations.

更新

在阅读@rrschmidt答案后,我尝试:

After reading @rrschmidt answer I tried to:

def parse(self, response):
    links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")

    lis_ = []

    for l in links:

        item = ToyItem()

        if len(l) == 0:
            item['visited_link'] = 'NaN'
        else:
            item['visited_link'] = response.url

        #item['visited_link'] = response.url

        item['extracted_link'] = l.xpath('@href').extract_first()

        yield item

        print('\n\n\n Aqui:\n\n', item, "\n\n\n")

   lis_.append(item)
   df = pd.DataFrame(lis_)

   print('\n\n\n\n\n', df, '\n\n\n\n\n')

   df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

尽管如此,它仍然返回了我同样的错误输出.有人可以帮我澄清这个问题吗?.

Nevertheless, it still returned me the same wrong output. Could anybody help me to clarify this issue?.

推荐答案

据我所知,您的刮板有两个问题:

As far as I can see there are two problems with your scraper:

    对于start_urls中的每个元素,都调用
  1. as parse,并且您正在为每个链接创建并保存一个新的数据框,正在生成的数据框会相互覆盖.
  1. as parse is called for every element in start_urls and you are creating and saving a new dataframe for each link, the dataframes you are generating are overwriting each other.

这就是为什么您在crawled_table.csv

解决方案:仅创建数据框一次,并将所有项目推送到同一数据框对象中.

Solution for this: create the dataframe only one time and push all items into the same dataframe object.

然后将数据帧保存在每个parse调用中,以防刮板在完成之前必须停止.

Then save the dataframe in each parse call, just in case the scraper has to stop before finishing.

  1. if l == None:将不起作用,因为如果未找到匹配项,response.xpath将返回一个空列表.所以做if len(l) == 0:应该做
  1. if l == None: won't work as response.xpath returns an empty list if no matches were found. So doing if len(l) == 0: should do

要点在于,我将如何构造刮板(未经测试的代码!)

In a gist here's how I would structure the scraper (code not tested!)

# -*- coding: utf-8 -*-

class ToySpider(scrapy.Spider):
    name = "toy_example"

    allowed_domains = ["www.example.com"]

    start_urls = ['https:example1.com',
                  'https:example2.com',
                  'https:example3.com']

    df = pd.DataFrame()

    def parse(self, response):
        links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
        items = []

        if len(links) == 0:
            item = ToyItem()
            # build item with visited_link = NaN here
            item['visited_link'] = response.url
            item['extracted_link'] = 'NaN'
            items.append(item)
        else:
            for l in links:
                item = ToyItem()
                # build the item as you previously did here
                item['visited_link'] = response.url
                item['extracted_link'] = l.xpath('@href').extract_first()
                items.append(item)

        items_df = pd.DataFrame(items)
        self.df = self.df.append(items_df, ignore_index=True)

        print('\n\n\n\n\n', self.df, '\n\n\n\n\n')
        self.df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

        return items

这篇关于尝试抓取访问的链接内的链接时出现问题吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆