请帮助使用BeautifulSoup解析这个HTML表并限于lxml的Python的方式 [英] Please help parse this html table using BeautifulSoup and lxml the pythonic way

查看:132
本文介绍了请帮助使用BeautifulSoup解析这个HTML表并限于lxml的Python的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经搜索了很多关于BeautifulSoup有的建议LXML作为未来BeautifulSoup,而这是有道理的,我有一个艰难的时间如下表从网页上表的整个列表解析。

我与根据页面上的行数目不同,它是检查的时间感兴趣的三列。一个BeautifulSoup和lxml的解决方案非常AP preciated。这样我可以要求管理员在安装开发LXML。机器。

所需的输出:

 网​​站上次访问上次加载
http://google.com 2011年1月14日
http://stackoverflow.com 01/10/2011
......更多,如果present

下面是一个混乱的网页中的code样品:<​​/ P>

 &LT;表边框=2WIDTH =100%&GT;
                      &LT;&TBODY GT;&LT; TR&GT;
                        &LT; TD WIDTH =33%级=BoldTD&GT;网站与LT; / TD&GT;
                        &LT; TD WIDTH =33%级=BoldTD&GT;上次访问&LT; / TD&GT;
                        &LT; TD WIDTH =34%级=BoldTD&GT;最后加载&LT; / TD&GT;
                      &LT; / TR&GT;
                      &所述; TR&GT;
                        &LT; TD WIDTH =33%&GT;
                          &所述; A HREF =htt​​p://google.com&下; / A&GT;
                        &LT; / TD&GT;
                        &LT; TD WIDTH =33%&GT; 2011年1月14日
                                &LT; / TD&GT;
                        &LT; TD WIDTH =34%&GT;
                                &LT; / TD&GT;
                      &LT; / TR&GT;
                      &所述; TR&GT;
                        &LT; TD WIDTH =33%&GT;
                          &所述; A HREF =htt​​p://stackoverflow.com&下; / A&GT;
                        &LT; / TD&GT;
                        &LT; TD WIDTH =33%&GT; 01/10/2011
                                &LT; / TD&GT;
                        &LT; TD WIDTH =34%&GT;
                                &LT; / TD&GT;
                      &LT; / TR&GT;
                    &LT; / TBODY&GT;&LT; /表&gt;


解决方案

下面是一个使用HTMLParser的一个版本。我试着对 pastebin.com/tu7dfeRJ 的内容。对付它的meta标签和DOCTYPE声明,这两者挫败了ElementTree的版本。

 从进口的HTMLParser的HTMLParser类MyParser(HTMLParser的):
  高清__init __(个体经营):
    HTMLParser的.__的init __(个体经营)
    self.line =
    self.in_tr =假
    self.in_table =假  高清handle_starttag(个体经营,标签,ATTRS):
    如果self.in_table和标签==TR:
      self.line =
      self.in_tr = TRUE
    如果标签=='一':
     在ATTRS ATTR:
       如果ATTR [0] =='href属性:
         self.line + = ATTR [1] +  高清handle_endtag(个体经营,标签):
    如果标签=='TR':
      self.in_tr =假
      如果len(self.line):
        打印self.line
    ELIF标签==表:
      self.in_table =假  高清handle_data(个体经营,数据):
    如果数据==网站:
      self.in_table = 1
    ELIF self.in_tr:
      数据= data.strip()
      如果数据:
        self.line + = data.strip()+如果__name__ =='__main__':
  MYP = MyParser()
  myp.feed(开('table.html')。阅读())

希望这解决了你需要的一切,你能接受这个作为答案。
按要求更新。

I have searched a lot about BeautifulSoup and some suggested lxml as the future of BeautifulSoup while that makes sense, I am having a tough time parsing the following table from a whole list of tables on the webpage.

I am interested in the three columns with varied number of rows depending on the page and the time it was checked. A BeautifulSoup and lxml solution is well appreciated. That way I can ask the admin to install lxml on the dev. machine.

Desired output :

Website                    Last Visited          Last Loaded
http://google.com          01/14/2011 
http://stackoverflow.com   01/10/2011
...... more if present

Following is a code sample from a messy web page :

                   <table border="2" width="100%">
                      <tbody><tr>
                        <td width="33%" class="BoldTD">Website</td>
                        <td width="33%" class="BoldTD">Last Visited</td>
                        <td width="34%" class="BoldTD">Last Loaded</td>
                      </tr>
                      <tr>
                        <td width="33%">
                          <a href="http://google.com"</a>
                        </td>
                        <td width="33%">01/14/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                      <tr>
                        <td width="33%">
                          <a href="http://stackoverflow.com"</a>
                        </td>
                        <td width="33%">01/10/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                    </tbody></table>

解决方案

Here's a version that uses HTMLParser. I tried against the contents of pastebin.com/tu7dfeRJ. It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.line = ""
    self.in_tr = False
    self.in_table = False

  def handle_starttag(self, tag, attrs):
    if self.in_table and tag == "tr":
      self.line = ""
      self.in_tr = True
    if tag=='a':
     for attr in attrs:
       if attr[0] == 'href':
         self.line += attr[1] + " "

  def handle_endtag(self, tag):
    if tag == 'tr':
      self.in_tr = False
      if len(self.line):
        print self.line
    elif tag == "table":
      self.in_table = False

  def handle_data(self, data):
    if data == "Website":
      self.in_table = 1
    elif self.in_tr:
      data = data.strip()
      if data:
        self.line += data.strip() + " "

if __name__ == '__main__':
  myp = MyParser()
  myp.feed(open('table.html').read())

Hopefully this addresses everything you need and you can accept this as the answer. Updated as requested.

这篇关于请帮助使用BeautifulSoup解析这个HTML表并限于lxml的Python的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆