需要调试一个Python网络爬虫援助 [英] Requiring assistance in debugging a Python web crawler

查看：317 发布时间：2016/8/5 19:22:25 python beautifulsoup web-crawler

本文介绍了需要调试一个Python网络爬虫援助的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不能，尽管在过去的几个小时，我尽力跑爬虫（名为 searchengine.py ）。现在看来，这不能成功索引的网页，因为它去。我会给你完整的履带code。那种我收到看起来像下面的错误

 索引http://www.4futureengineers.com/company.html
无法解析页面http://www.4futureengineers.com/company.html

我打电话 searchengine.py 在我的Python交互式会话（shell）中输入下面的命令。

 ＆GT;＆GT;进口的搜索引擎
＆GT;＆GT;履带= searchengine.crawler（'searchindex.db'）
＆GT;＆GT;页= \\
.. ['http://www.4futureengineers.com/company.html']
＆GT;＆GT; crawler.crawl（页）

它给人的错误，即不成功解析命令后立即 crawler.crawl（页）

下面是searchengine.py的完整源$ C $ C

 进口的urllib2
从BeautifulSoup进口*
从进口里urlparse urljoin
从pysqlite2进口dbapi2作为源码
＃创建一个单词列表忽略
ignorewords = {'的'1'的'：'在'1，：1，到1和1，1，是1，它是：1 }
爬虫类：
  ＃初始化履带与数据库的名称
  高清__init __（自我，数据库名）：
    self.con = sqlite.connect（数据库）  高清__del __（个体经营）：
    self.con.close（）  高清dbcommit（个体经营）：
    self.con.commit（）
  获取一个条目ID和添加＃的辅助功能
  ＃它，如果它不是present
  高清getentryid（个体经营，表，字段值，createnew = TRUE）：
    CUR = self.con.execute（
    从％s的ROWID其中％s =％s的％（表，字段的值））
    RES = cur.fetchone（）
    如果解析度==无：
      CUR = self.con.execute（
      插入％S（％S）VALUES（'％s'的）％（表，字段的值））
      返回cur.lastrowid
    其他：
      返回RES [0]
  ＃索引单个页面
  高清addtoindex（自我，网址，汤）：
    如果self.isindexed（URL）：回归
    打印索引+网址    ＃获取单个单词
    文字= self.gettextonly（汤）
    字= self.separatewords（文本）    ＃获取URL ID
    urlid = self.getentryid（'urllist，URL，网址）    ＃连接的每一个字这个网址
    因为我在范围内（LEN（字））：
      字=字[I]
      如果字ignorewords：继续
      的wordID = self.getentryid（'单词表'，'文字'，字）
      self.con.execute（插入wordlocation（urlid，的wordID，位置）值（％D，％D，％D）％（urlid，的wordID，I））
  ＃提取从HTML页面中的文本（无标签）
  高清gettextonly（个体经营，汤）：
    V = soup.string
    如果v == NULL：
      C = soup.contents
      resulttext =''
      在C T：
        潜台词= self.gettextonly（T）
        resulttext + =潜台词+的'\\ n'
      回报resulttext
    其他：
      返回v.strip（）  ＃独立的话，任何非空白字符
  高清separatewords（个体经营，文字）：
    分配器= re.compile（'\\\\ W *）
    [中如果s！splitter.split（文本）对于s s.lower（）='']返回  高清isindexed（自我，网址）：
    U = self.con.execute \\
      （从urllist选择ROWID其中，url =％s的％URL）.fetchone（）
    如果u =无！
      ＃检查它是否已被抓到
      V = self.con.execute（
      选择wordlocation *其中urlid =％d的％u [0]）。fetchone（）
      当v =无：返回True
    返回False  高清抓取（个体经营，页面，深度= 2）：
    对于在范围（深度）I：
      新页= {}
      在页页：
        尝试：
          C = urllib2.urlopen（页）
        除：
          打印无法打开％s％页
          继续        尝试：
          汤= BeautifulSoup（c.read（））
          self.addtoindex（页，汤）          链接=汤（'A'）
          在链接的链接：
            如果（在字典（link.attrsHREF'））：
              URL = urljoin（页面，链接['的href']）
              如果url.find（'）=  -  1：继续
              URL = url.split（'＃'）[0]＃删除部分位置
              如果网址[0：4] =='HTTP'，而不是self.isindexed（URL）：
                新页[URL] = 1
              LINKTEXT = self.gettextonly（链接）
              self.addlinkref（页，网址，LINKTEXT）          self.dbcommit（）
        除：
          打印无法解析页％的％的页面
      页=新页  ＃创建数据库表
  高清createindextables（个体经营）：
    self.con.execute（'创建表urllist（URL））
    self.con.execute（'创建表词表（字））
    self.con.execute（'创建表wordlocation（urlid，的wordID，地点））
    self.con.execute（'创建表的链接（fromid整数，TOID整数））
    self.con.execute（'创建表linkwords（的wordID，的linkID））
    self.con.execute（单词表上创建索引wordidx（字））
    self.con.execute（'上urllist创建索引urlidx（URL））
    self.con.execute（'上wordlocation创建索引wordurlidx（的wordID））
    self.con.execute（'上的链接创建索引urltoidx（TOID））
    self.con.execute（'上的链接创建索引urlfromidx（fromid））
    self.dbcommit（）

解决方案

在抓取处理该错误使得调试极其困难的：

 尝试：
    这里＃太多的东西
以下情况除外：＃裸除外
    打印无法解析页％的％的页面＃泛型消息

虽然很稳定（即如果有什么不顺心的程序继续运行），这使得它无法弄清楚到底是怎么回事错了，你知道的是，十三行中的尝试之一块出了问题不知何故。重构code与短尝试块和特定错误试验（见的的，除了罪恶）。

您可以尝试没有任何错误都处理（注释掉尝试运行： 除外：和打印... 线和迪登线目前在尝试块），并读取特定错误回溯沿帮你，然后把相应的错误在以后的处理了。

I can't run a crawler (named searchengine.py) despite my best effort for the past couple of hours. It seems it could not successfully index the pages as it goes. I will give you the full crawler code. The kind of errors I'm receiving looks like below

Indexing http://www.4futureengineers.com/company.html
Could not parse page http://www.4futureengineers.com/company.html

I am calling searchengine.py by entering the following commands in my Python interactive session (shell).

>> import searchengine
>> crawler=searchengine.crawler('searchindex.db')
>> pages= \
.. ['http://www.4futureengineers.com/company.html']
>> crawler.crawl(pages)

It's giving errors i.e. unsuccessful parsing right after the command crawler.crawl(pages)

Here is the complete source code of searchengine.py

import urllib2
from BeautifulSoup import *
from urlparse import urljoin
from pysqlite2 import dbapi2 as sqlite


# Create a list of words to ignore
ignorewords={'the':1,'of':1,'to':1,'and':1,'a':1,'in':1,'is':1,'it':1}


class crawler:
  # Initialize the crawler with the name of database
  def __init__(self,dbname):
    self.con=sqlite.connect(dbname)

  def __del__(self):
    self.con.close()

  def dbcommit(self):
    self.con.commit()


  # Auxilliary function for getting an entry id and adding 
  # it if it's not present
  def getentryid(self,table,field,value,createnew=True):
    cur=self.con.execute(
    "select rowid from %s where %s='%s'" % (table,field,value))
    res=cur.fetchone()
    if res==None:
      cur=self.con.execute(
      "insert into %s (%s) values ('%s')" % (table,field,value))
      return cur.lastrowid
    else:
      return res[0]


  # Index an individual page
  def addtoindex(self,url,soup):
    if self.isindexed(url): return
    print 'Indexing '+url

    # Get the individual words
    text=self.gettextonly(soup)
    words=self.separatewords(text)

    # Get the URL id
    urlid=self.getentryid('urllist','url',url)

    # Link each word to this url
    for i in range(len(words)):
      word=words[i]
      if word in ignorewords: continue
      wordid=self.getentryid('wordlist','word',word)
      self.con.execute("insert into wordlocation(urlid,wordid,location) values (%d,%d,%d)" % (urlid,wordid,i))


  # Extract the text from an HTML page (no tags)
  def gettextonly(self,soup):
    v=soup.string
    if v==Null:   
      c=soup.contents
      resulttext=''
      for t in c:
        subtext=self.gettextonly(t)
        resulttext+=subtext+'\n'
      return resulttext
    else:
      return v.strip()

  # Seperate the words by any non-whitespace character
  def separatewords(self,text):
    splitter=re.compile('\\W*')
    return [s.lower() for s in splitter.split(text) if s!='']



  def isindexed(self,url):
    u=self.con.execute \
      ("select rowid from urllist where url='%s'" % url).fetchone()
    if u!=None:
      #Check if it has actually been crawled
      v=self.con.execute(
      'select * from wordlocation where urlid=%d' % u[0]).fetchone()
      if v!=None: return True
    return False



  def crawl(self,pages,depth=2):
    for i in range(depth):
      newpages={}
      for page in pages:
        try:
          c=urllib2.urlopen(page)
        except:
          print "Could not open %s" % page
          continue

        try:
          soup=BeautifulSoup(c.read())
          self.addtoindex(page,soup)

          links=soup('a')
          for link in links:
            if ('href' in dict(link.attrs)):
              url=urljoin(page,link['href'])
              if url.find("'")!=-1: continue
              url=url.split('#')[0]  # remove location portion
              if url[0:4]=='http' and not self.isindexed(url):
                newpages[url]=1
              linkText=self.gettextonly(link)
              self.addlinkref(page,url,linkText)

          self.dbcommit()
        except:
          print "Could not parse page %s" % page


      pages=newpages



  # Create the database tables
  def createindextables(self): 
    self.con.execute('create table urllist(url)')
    self.con.execute('create table wordlist(word)')
    self.con.execute('create table wordlocation(urlid,wordid,location)')
    self.con.execute('create table link(fromid integer,toid integer)')
    self.con.execute('create table linkwords(wordid,linkid)')
    self.con.execute('create index wordidx on wordlist(word)')
    self.con.execute('create index urlidx on urllist(url)')
    self.con.execute('create index wordurlidx on wordlocation(wordid)')
    self.con.execute('create index urltoidx on link(toid)')
    self.con.execute('create index urlfromidx on link(fromid)')
    self.dbcommit()

解决方案

The error handling in crawl has made debugging extremely difficult:

try:
    # too much stuff here
except: # bare except
    print "Could not parse page %s" % page # generic message

Although very stable (i.e. if anything goes wrong the program keeps running) this makes it impossible to figure out what is going wrong, all you know is that one of the thirteen lines in the try block went wrong somehow. Refactor this section of the code with shorter try blocks and test for specific errors (see "the evils of except").

You could try running without any error handling at all (comment out the try: except: and print ... lines and dedent the lines currently in the try block) and read the specific error tracebacks to help you along, then put appropriate error handling back in later.

这篇关于需要调试一个Python网络爬虫援助的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

需要调试一个Python网络爬虫援助 [英] Requiring assistance in debugging a Python web crawler

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

需要调试一个Python网络爬虫援助 [英] Requiring assistance in debugging a Python web crawler

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭