如何在python中抓取分布在多行上的html标签? [英] How to scrape html tags spread over multiple lines in python?

查看:48
本文介绍了如何在python中抓取分布在多行上的html标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用 python 抓取网页.我能够轻松获取单行标签的结果,但对于分布在多行上的标签,我的代码无法检索任何内容.

I am trying to scrape a webpage in python. I was able to easily get the results for tags which were on a single line, but for tags spread over multiple lines, my code cannot retrieve anything.

在 HTML 源代码中,单行标签显示为:

In the HTML source single line tags are present as:

<td><span class="facultyName">John Matthew Falletta, MD</span>

并且多个行标记存在为:

and multiple line tags are present as:

<td><span class="label">Division:</span>
            &nbsp;&nbsp;
                  </td><td>Hematology/Oncology</td>

这是我写的:

patFinderFullname = re.compile('<span class="facultyName">(.*)</span>')

fullname = re.findall(patFinderFullname,webpage)         #works fine

patFinderDivision = re.compile('<span class="label">Division:</span>&nbsp;&nbsp;</td><td>(.*)</td>')

division = re.findall(patFinderDivision,webpage)       #doesn't work

这里我的网页变量包含必须被抓取的网址.有人可以指出,我缺少什么,或者我错在哪里?

Here my webpage variable contains the url which has to be scraped. Can someone point out, what I am missing, or where I am wrong?

推荐答案

我强烈建议您使用 BeautifulSoup.它是一个用于解析 HTML 文档的 Python 库.

I highly recommend you use BeautifulSoup. It is a Python library for parsing HTML documents.

P.s:如果您想坚持使用自己的代码,请使用 \s* 跳过正则表达式中的空格.

P.s: If you want to stick with your own code, use \s* to skip white spaces in regex.

patFinderDivision = re.compile('<span class="label">Division:</span>\s*&nbsp;&nbsp;\s*</td><td>(.*)</td>')

这篇关于如何在python中抓取分布在多行上的html标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆