Python使用正则表达式解析HTML [英] Python parsing HTML Using Regular Expressions

查看:339
本文介绍了Python使用正则表达式解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试浏览网站的HTML并对其进行解析,以查找课程的最大入学人数.我尝试检查HTML文件的每一行中的子字符串,但这将尝试解析错误的行.所以我现在正在使用正则表达式.我现在以\t\t\t\t\t\t\t<td class="odd">([0-9])|([0-9][0-9])|([0-9][0-9][0-9])<\/td>\r\n作为正则表达式,但是此正则表达式与最大注册人数以及科目编号匹配.还有另一种方法可以解决我要从网页中提取的内容吗? HTML代码段如下:

I am trying to go through the HTML of a website and parse it looking for the max enrollment of a class. I tried checking for a substring in each line of the HTML file, but that would try to parse the wrong lines. So I am now using Regular Expressions. I have \t\t\t\t\t\t\t<td class="odd">([0-9])|([0-9][0-9])|([0-9][0-9][0-9])<\/td>\r\n as my regular expression right now, but this regular expression matches the max enrollment as well as the section number. Is there another way to go about what I am trying to extract from the webpage? The HTML code snippet is below:

<tr>
    <td class="tableHeader">Section</td>
    <td class="odd">001</td>
</tr>

<tr>
    <td class="tableHeader">Credits</td>
    <td class="even" align="left">  4.00</td>
</tr>

<tr>
<td class="tableHeader">Title</td>
<td class="odd">Linear Algebra</td>
</tr>

<tr>
    <td class="tableHeader">Campus</td>
    <td class="even" align="left">University City</td>
</tr>

<tr>
    <td class="tableHeader">Instructor(s)</td>
    <td class="odd">Guang  Yang</td>
</tr>
<tr>
    <td class="tableHeader">Instruction Type</td>
    <td class="even">Lecture</td>
</tr>

<tr>
    <td class="tableHeader">Max Enroll</td>
    <td class="odd">30</td>
</tr>

推荐答案

使用正确的工具完成正确的工作.

Use the right tool for the right job.

让我们做个比喻来解释为什么这是错误的:这就像试图让 5岁.

了解哈姆雷特,而他没有莎士比亚的,他将能够更抽象的概念.

Let's make an analogy to explain why it's wrong: it's like trying to have a 5 year old understand Hamlet, whereas he does not have the vocabulary and grammar to understand Shakespeare's, that he will get when he'll be able to process more abstract concepts.

使用 lxml

Use either lxml or BeautifulSoup to do that.

作为示例:获取所有偶数和所有赔率的列表:

As an example: to get a list of all the evens and all the odds:

>>> from lxml import etree
>>> tree = etree.HTML(your_html_text)
>>> odds = tree.xpath('//td[@class="odd"]/text()')
>>> evens = tree.xpath('//td[@class="even"]/text()')
>>> odds
['001', 'Linear Algebra', 'Guang  Yang', '30']
>>> evens
['  4.00', 'University City', 'Lecture']

我只是试图以一种方式提取内容,而我没有得到节号和最大注册号.我只需要获得最大注册人数的帮助.

好,现在我得到了想要的东西,所以这是使用lxml的解决方案:

ok, now I'm getting what you want, so here's the solution using lxml:

>>> for elt in tree.xpath('//tr'):
...     if elt.xpath('td[@class="tableHeader"]')[0].text == "Max Enroll":
...         elt.xpath('td[@class="odd"]|td[@class="even"]')[0].text
... 
'30'

您只有最大注册人数.

使用BeautifulSoup会更容易:

Using BeautifulSoup it's a bit easier:

>>> bs = BeautifulSoup(your_html_text)
>>> for t in bs.findAll('td', attrs={'class': 'tableHeader'}):
...   if t.text == "Max Enroll":
...     print t.findNext('td').text
'30'

这篇关于Python使用正则表达式解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆