用python编写的类爬虫引发属性错误 [英] Class crawler written in python throws attribute error

查看:117
本文介绍了用python编写的类爬虫引发属性错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python中编写了一些代码后,我陷入了深重的麻烦.我是遵循python OOP设计编写代码的新手.我在代码中使用的xpath完美无缺.通过"page_crawler"类的实例在"info_grabber"类中运行"passing_links"方法时,我迷路了.每次运行代码时,都会出现错误'page_crawler'对象没有属性'passing_links'".也许我编写班级检索器的方式不是应该的.但是,由于我花了几个小时在上面,所以我想我可能会得到一些建议,说明应该纠正哪些行以使其起作用.预先感谢您对它的了解:

After writing some code in python, I've got stuck in deep trouble. I'm a newbie in writing code following the OOP design in python. The xpaths I've used in my code are flawless. I'm getting lost when it comes to run the "passing_links" method in my "info_grabber" class through the instance of "page_crawler" class. Every time I run my code I get an error "'page_crawler' object has no attribute 'passing_links'". Perhaps the way I've written my class-crawler is not how it should be. However, as I've spent few hours on it so I suppose I might get any suggestion as to which lines I should rectify to make it work. Thanks in advance for taking a look into it:

from lxml import html
import requests

class page_crawler(object):

    main_link = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA"
    base_link = "https://www.yellowpages.com"

    def __init__(self):

        self.links = [self.main_link]


    def crawler(self):
        for link in self.links:
            self.get_link(link)

    def get_link(self, link):

        print("Running page "+ link)
        page = requests.get(link)
        tree = html.fromstring(page.text)
        item_links = tree.xpath('//h2[@class="n"]/a[@class="business-name"][not(@itemprop="name")]/@href')
        for item_link in item_links:
            return self.base_link + item_link

        links = tree.xpath('//div[@class="pagination"]//li/a/@href')
        for url in links:
            if not self.base_link + url in self.links:
                self.links += [self.base_link + url]



class Info_grabber(page_crawler):

    def __init__(self, plinks):
        page_crawler.__init__(self)
        self.plinks = [plinks]

    def passing_links(self):
        for nlink in self.plinks:
            print(nlink)
            self.crawling_deep(nlink)

    def crawling_deep(self, uurl):

        page = requests.get(uurl)
        tree = html.fromstring(page.text)

        name = tree.findtext('.//div[@class="sales-info"]/h1')
        phone = tree.findtext('.//p[@class="phone"]')
        try:
            email = tree.xpath('//div[@class="business-card-footer"]/a[@class="email-business"]/@href')[0]
        except IndexError:
            email=""

        print(name, phone, email)


if __name__ == '__main__':

    crawl = Info_grabber(page_crawler)
    crawl.crawler()
    crawl.passing_links()

现在执行后,当我碰到"self.crawling_deep(nlink)"行时,我得到一个新错误"raise MissingSchema(error)"

Now upon execution I get a new error "raise MissingSchema(error)" when it hits the line "self.crawling_deep(nlink)"

推荐答案

我不确定我是否理解您要在page_crawler.get_link中执行的操作,但是我认为您应该采用其他方法来收集分页"链接.
我将Info_grabber.plinks重命名为Info_grabber.links,以便page_crawler.crawler可以访问它们,并设法从多个页面中提取信息,但是代码并不理想.

I'm not sure i understand what you're trying to do in page_crawler.get_link, but i think you should have a different method for collecting "pagination" links.
I renamed Info_grabber.plinks to Info_grabber.links so that the page_crawler.crawler can access them, and managed to extract info from several pages, however the code is far from ideal.

class page_crawler(object):

    main_link = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA"
    base_link = "https://www.yellowpages.com"

    def __init__(self):
        self.links = []
        self.pages = []

    def crawler(self):
        for link in self.links:
            self.get_link(link)

    def get_link(self, link):
        print("Running page "+ link)
        page = requests.get(link)
        tree = html.fromstring(page.text)
        item_links = tree.xpath('//h2[@class="n"]/a[@class="business-name"][not(@itemprop="name")]/@href')
        for item_link in item_links:
            if not self.base_link + item_link in self.links:
                self.links += [self.base_link + item_link]

    def get_pages(self, link):
        page = requests.get(link)
        tree = html.fromstring(page.text)
        links = tree.xpath('//div[@class="pagination"]//li/a/@href')
        for url in links:
            if not self.base_link + url in self.pages:
                self.pages += [self.base_link + url]


class Info_grabber(page_crawler):

    def __init__(self, plinks):
        page_crawler.__init__(self)
        self.links += [plinks]

    def passing_links(self):
        for nlink in self.links:
            print(nlink)
            self.crawling_deep(nlink)

    def crawling_deep(self, uurl):
        page = requests.get(uurl)
        tree = html.fromstring(page.text)
        name = tree.findtext('.//div[@class="sales-info"]/h1')
        phone = tree.findtext('.//p[@class="phone"]')
        try:
            email = tree.xpath('//div[@class="business-card-footer"]/a[@class="email-business"]/@href')[0]
        except IndexError:
            email=""
        print(name, phone, email)


if __name__ == '__main__':
    url = page_crawler.main_link
    crawl = Info_grabber(url)
    crawl.crawler()
    crawl.passing_links()

您会注意到,我在page_crawler中添加了pages属性和get_pages方法,我将实现部分留给您.
稍后您可能需要向page_crawler添加更多方法,因为如果您开发更多子类,它们可能会有用.最后考虑研究合成,因为它也是强大的OOP功能.

You'll notice that i added a pages property and a get_pages method in page_crawler, i'll leave the implementation part to you.
You might need to add more methods to page_crawler later on, as they could be of use if you develop more child classes. Finally consider looking into composition as it is also a strong OOP feature.

这篇关于用python编写的类爬虫引发属性错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆