Python使用正则表达式解析HTML [英] Python parsing HTML Using Regular Expressions
问题描述
我正在尝试浏览网站的HTML并对其进行解析,以查找课程的最大入学人数.我尝试检查HTML文件的每一行中的子字符串,但这将尝试解析错误的行.所以我现在正在使用正则表达式.我现在以\t\t\t\t\t\t\t<td class="odd">([0-9])|([0-9][0-9])|([0-9][0-9][0-9])<\/td>\r\n
作为正则表达式,但是此正则表达式与最大注册人数以及科目编号匹配.还有另一种方法可以解决我要从网页中提取的内容吗? HTML代码段如下:
I am trying to go through the HTML of a website and parse it looking for the max enrollment of a class. I tried checking for a substring in each line of the HTML file, but that would try to parse the wrong lines. So I am now using Regular Expressions. I have \t\t\t\t\t\t\t<td class="odd">([0-9])|([0-9][0-9])|([0-9][0-9][0-9])<\/td>\r\n
as my regular expression right now, but this regular expression matches the max enrollment as well as the section number. Is there another way to go about what I am trying to extract from the webpage? The HTML code snippet is below:
<tr>
<td class="tableHeader">Section</td>
<td class="odd">001</td>
</tr>
<tr>
<td class="tableHeader">Credits</td>
<td class="even" align="left"> 4.00</td>
</tr>
<tr>
<td class="tableHeader">Title</td>
<td class="odd">Linear Algebra</td>
</tr>
<tr>
<td class="tableHeader">Campus</td>
<td class="even" align="left">University City</td>
</tr>
<tr>
<td class="tableHeader">Instructor(s)</td>
<td class="odd">Guang Yang</td>
</tr>
<tr>
<td class="tableHeader">Instruction Type</td>
<td class="even">Lecture</td>
</tr>
<tr>
<td class="tableHeader">Max Enroll</td>
<td class="odd">30</td>
</tr>
推荐答案
Use the right tool for the right job.
让我们做个比喻来解释为什么这是错误的:这就像试图让 5岁.
了解哈姆雷特,而他没有莎士比亚的,他将能够更抽象的概念.Let's make an analogy to explain why it's wrong: it's like trying to have a 5 year old understand Hamlet, whereas he does not have the vocabulary and grammar to understand Shakespeare's, that he will get when he'll be able to process more abstract concepts.
使用 lxml
或
Use either lxml
or BeautifulSoup
to do that.
作为示例:获取所有偶数和所有赔率的列表:
As an example: to get a list of all the evens and all the odds:
>>> from lxml import etree
>>> tree = etree.HTML(your_html_text)
>>> odds = tree.xpath('//td[@class="odd"]/text()')
>>> evens = tree.xpath('//td[@class="even"]/text()')
>>> odds
['001', 'Linear Algebra', 'Guang Yang', '30']
>>> evens
[' 4.00', 'University City', 'Lecture']
我只是试图以一种方式提取内容,而我没有得到节号和最大注册号.我只需要获得最大注册人数的帮助.
好,现在我得到了想要的东西,所以这是使用lxml的解决方案:
ok, now I'm getting what you want, so here's the solution using lxml:
>>> for elt in tree.xpath('//tr'):
... if elt.xpath('td[@class="tableHeader"]')[0].text == "Max Enroll":
... elt.xpath('td[@class="odd"]|td[@class="even"]')[0].text
...
'30'
您只有最大注册人数.
使用BeautifulSoup会更容易:
Using BeautifulSoup it's a bit easier:
>>> bs = BeautifulSoup(your_html_text)
>>> for t in bs.findAll('td', attrs={'class': 'tableHeader'}):
... if t.text == "Max Enroll":
... print t.findNext('td').text
'30'
这篇关于Python使用正则表达式解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!