如何将被抓取的 url 分配给一个项目? [英] How to assign the url that's being scraped from to an item?

查看：41 发布时间：2021/7/16 22:27:00 python scrapy

本文介绍了如何将被抓取的 url 分配给一个项目?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对 Python 和 Scrapy 还很陌生，到目前为止，这个站点对我的项目来说是非常宝贵的资源，但现在我遇到了一个看起来很简单的问题.我可能想错了.我想要做的是在我的输出 CSV 中添加一列，其中列出了从中抓取每一行数据的 URL.换句话说，我希望表格看起来像这样:

I'm pretty new to Python and Scrapy and this site has been an invaluable resource so far for my project, but now I'm stuck on a problem that seems like it'd be pretty simple. I'm probably thinking about it the wrong way. What I want to do is add a column to my output CSV that lists the URL that each row's data was scraped from. In other words, I want the table to look like this:

item1    item2    item_url
a        1        http://url/a
b        2        http://url/a
c        3        http://url/b
d        4        http://url/b

我正在使用 psycopg2 来获取存储在数据库中的一堆 url，然后我从中抓取.代码如下所示:

I'm using psycopg2 to get a bunch of urls stored in database that I then scrape from. The code looks like this:

class MySpider(CrawlSpider):
    name = "spider"

    # querying the database here...

    #getting the urls from the database and assigning them to the rows list
    rows = cur.fetchall()

    allowed_domains = ["www.domain.com"]

    start_urls = []

    for row in rows:

        #adding the urls from rows to start_urls
        start_urls.append(row)

        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("a bunch of xpaths here...")
            items = []
            for site in sites:
                item = SettingsItem()
                # a bunch of items and their xpaths...
                # here is my non-working code
                item['url_item'] = row
                items.append(item)
            return items

如您所见，我想制作一个仅获取解析函数当前所在网址的项目.但是当我运行蜘蛛时，它给了我exceptions.NameError: global name 'row' is not defined".我认为这是因为 Python 没有将行识别为 XPathSelector 函数中的变量，或者类似的东西?(就像我说的，我是新手.)无论如何，我被卡住了，任何帮助将不胜感激.

As you can see, I wanted to make an item that just takes the url that the parse function is currently on. But when I run the spider, it gives me "exceptions.NameError: global name 'row' is not defined." I think that this is because Python doesn't recognize row as a variable within the XPathSelector function, or something like that? (Like I said, I'm new.) Anyway, I'm stuck, and any help would be much appreciated.

推荐答案

将开始请求生成放在 start_requests():

Put the start requests generation not in class body but in start_requests():

class MySpider(CrawlSpider):

    name = "spider"
    allowed_domains = ["www.domain.com"]

    def start_requests(self):
        # querying the database here...

        #getting the urls from the database and assigning them to the rows list
        rows = cur.fetchall()

        for url, ... in rows:
            yield self.make_requests_from_url(url)


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("a bunch of xpaths here...")

        for site in sites:
            item = SettingsItem()
            # a bunch of items and their xpaths...
            # here is my non-working code
            item['url_item'] = response.url

            yield item

这篇关于如何将被抓取的 url 分配给一个项目?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将被抓取的 url 分配给一个项目? [英] How to assign the url that's being scraped from to an item?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何将被抓取的 url 分配给一个项目? [英] How to assign the url that&#39;s being scraped from to an item?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

如何将被抓取的 url 分配给一个项目? [英] How to assign the url that's being scraped from to an item?

登录关闭