解析HTML表格的最快,最简单和最佳方法? [英] Fastest, easiest, and best way to parse an HTML table?

查看:40
本文介绍了解析HTML表格的最快,最简单和最佳方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取此表 http://www.datamystic.com/timezone/time_zones.html转换成数组格式,这样我就可以做任何我想做的事情.最好使用PHP,Python或JavaScript.

I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.

这是一个经常出现的问题,因此,我正在寻找有关如何解决所有类似问题的想法,而不是寻求有关此特定问题的帮助.

This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.

BeautifulSoup是我想到的第一件事.另一种可能性是将其复制/粘贴到TextMate中,然后运行正则表达式.

BeautifulSoup is the first thing that comes to mind. Another possibility is copying/pasting it in TextMate and then running regular expressions.

您有什么建议?

这是我最终编写的脚本,但是正如我所说,我正在寻找更通用的解决方案.

This is the script that I ended up writing, but as I said, I'm looking for a more general solution.

from BeautifulSoup import BeautifulSoup
import urllib2


url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
    tds = row.findAll('td')
    if(len(tds)==4):
        countrycode = tds[1].string
        timezone = tds[2].string
        if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
            print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())

也欢迎对我的python代码进行改进的评论和建议;)

Comments and suggestions for improvement to my python code welcome, too ;)

推荐答案

对于您的一般问题:尝试 lxml. lxml 包中的html (将其视为类固醇上的stdlibs xml.etree:相同的xmlapi,但具有html支持,xpath,xslt等...)

For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)

针对您的具体情况的简单示例:

A quick example for your specific case:

from lxml import html

tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
           [td.text_content().strip() for td in row.findall('td')] 
           for row in table.findall('tr')
       ]

这将为您提供一个嵌套列表:每个子列表对应于表中的一行,并包含来自单元格的数据.偷偷插入的广告行尚未过滤掉,但是它可以带您上路.(顺便说一句:lxml很快!)

This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)

但是:更具体地针对您的特定用例:有一种更好的方法来获取时区数据库信息,而不是抓取该特定的网页(此外:请注意,该网页实际上提到不允许您复制其内容).甚至还有一些已经在使用此信息的库,例如,参见 python-dateutil .

BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.

这篇关于解析HTML表格的最快,最简单和最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆