用lxml xpath解析 [英] Parsing with lxml xpath

查看:91
本文介绍了用lxml xpath解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图实现lxml, xpath代码以解析来自链接的HTML:https://www.theice.com/productguide/ProductSpec.shtml?specId=251 具体来说,我试图解析页面末尾附近的<tr class="last">表.

I was trying to implement a lxml, xpath code to parse html from link: https://www.theice.com/productguide/ProductSpec.shtml?specId=251 Specifically, I was trying to parse the <tr class="last"> table at near the end of the page.

我想获取该子表中的文本,例如:纽约"及其旁边列出的时间(伦敦和新加坡也是如此).

I wanted to obtain the text in that sub-table, for example: "New York" and the hours listed next to it (and do the same for London and Singapore) .

我有以下代码(无法正常工作):

I have the following code (which doesn't work properly):

doc = lxml.html.fromstring(page)
tds = doc.xpath('//table[@class="last"]//table[@id"tradingHours"]/tbody/tr/td/text()')

使用BeautifulSoup:

With BeautifulSoup:

table = soup.find('table', attrs={'id':'tradingHours'})
for td in table.findChildren('td'):
    print td.text

实现此目标的最佳方法是什么?我要使用lxml而不是beautifulSoup(只是为了查看区别).

What is the best method to achieve this? I want to use lxml not beautifulSoup (just to see the difference).

推荐答案

您的lxml代码非常有效.主要问题是table标记不是具有class="last"属性的标记.相反,它是具有该属性的tr标签:

Your lxml code is very close to working. The main problem is that the table tag is not the one with the class="last" attribute. Rather, it is a tr tag that has that attribute:

    </tr><tr class="last"><td>TRADING HOURS</td>&#13;

因此

//table[@class="last"]

没有匹配项.还有一个较小的语法错误:@id"tradingHours"应该为@id="tradingHours".

has no matches. There is also a minor syntax error: @id"tradingHours" should be @id="tradingHours".

由于table[@id="tradingHours"]足够具体,因此您也可以完全省略//table[@class="last"].

You can also omit //table[@class="last"] entirely since table[@id="tradingHours"] is specific enough.

与BeautifulSoup代码最接近的类似物是:

The closest analog to your BeautifulSoup code would be:

import urllib2
import lxml.html as LH

url = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251'
doc = LH.parse(urllib2.urlopen(url))
for td in doc.xpath('//table[@id="tradingHours"]//td/text()'):
    print(td.strip())


zip(*[iterable]*n)石斑鱼食谱.通常在解析表时非常有用.它将iterable中的项目收集为n个项目的组.我们可以在这里像这样使用它:


The grouper recipe, zip(*[iterable]*n), is often very useful when parsing tables. It collects the items in iterable into groups of n items. We could use it here like this:

texts = iter(doc.xpath('//table[@id="tradingHours"]//td/text()'))
for group in zip(*[texts]*5):
    row = [item.strip() for item in group]
    print('\n'.join(row))
    print('-'*80)

我不能很好地解释石斑鱼食谱的工作原理,但是我做了在这里尝试.

I'm not terribly good at explaining how the grouper recipe works, but I've made an attempt here.

此页面使用JavaScript重新格式化日期.要在JavaScript更改内容后 抓取页面,您可以使用:

This page is using JavaScript to reformat the dates. To scrape the page after the JavaScript has altered the contents, you could use selenium:

import urllib2
import lxml.html as LH
import contextlib
import selenium.webdriver as webdriver

url = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
    driver.get(url)
    content = driver.page_source
    doc = LH.fromstring(content)
    texts = iter(doc.xpath('//table[@id="tradingHours"]//td/text()'))
    for group in zip(*[texts]*5):
        row = [item.strip() for item in group]
        print('\n'.join(row))
        print('-'*80)

收益

NEW YORK
8:00 PM-2:15 PM *
20:00-14:15
7:30 PM
19:30
--------------------------------------------------------------------------------
LONDON
1:00 AM-7:15 PM
01:00-19:15
12:30 AM
00:30
--------------------------------------------------------------------------------
SINGAPORE
8:00 AM-2:15 AM *
08:00-02:15
7:30 AM
07:30
--------------------------------------------------------------------------------

请注意,在这种特殊情况下,如果您不想使用硒,则可以使用 pytz 自己解析和转换时间:

Note that in this particular case, if you did not want to use selenium, you could use pytz to parse and convert the times yourself:

import dateutil.parser as parser
import pytz

text = 'Tue Jul 30 20:00:00 EDT 2013'
date = parser.parse(text)
date = date.replace(tzinfo=None)
print(date.strftime('%I:%M %p'))
# 08:00 PM

ny = pytz.timezone('America/New_York')
london = pytz.timezone('Europe/London')
london_date = ny.localize(date).astimezone(london)
print(london_date.strftime('%I:%M %p'))
# 01:00 AM

这篇关于用lxml xpath解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆