如何在python中抓取分布在多行上的html标签? [英] How to scrape html tags spread over multiple lines in python?

查看：48 发布时间：2021/9/24 18:55:39 python scripting web-scraping

本文介绍了如何在python中抓取分布在多行上的html标签?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试用 python 抓取网页.我能够轻松获取单行标签的结果，但对于分布在多行上的标签，我的代码无法检索任何内容.

I am trying to scrape a webpage in python. I was able to easily get the results for tags which were on a single line, but for tags spread over multiple lines, my code cannot retrieve anything.

在 HTML 源代码中，单行标签显示为:

In the HTML source single line tags are present as:

<td><span class="facultyName">John Matthew Falletta, MD</span>

并且多个行标记存在为:

and multiple line tags are present as:

<td><span class="label">Division:</span>
            &nbsp;&nbsp;
                  </td><td>Hematology/Oncology</td>

这是我写的:

patFinderFullname = re.compile('<span class="facultyName">(.*)</span>')

fullname = re.findall(patFinderFullname,webpage)         #works fine

patFinderDivision = re.compile('<span class="label">Division:</span>&nbsp;&nbsp;</td><td>(.*)</td>')

division = re.findall(patFinderDivision,webpage)       #doesn't work

这里我的网页变量包含必须被抓取的网址.有人可以指出，我缺少什么，或者我错在哪里?

Here my webpage variable contains the url which has to be scraped. Can someone point out, what I am missing, or where I am wrong?

推荐答案

我强烈建议您使用 BeautifulSoup.它是一个用于解析 HTML 文档的 Python 库.

I highly recommend you use BeautifulSoup. It is a Python library for parsing HTML documents.

P.s:如果您想坚持使用自己的代码，请使用 \s* 跳过正则表达式中的空格.

P.s: If you want to stick with your own code, use \s* to skip white spaces in regex.

patFinderDivision = re.compile('<span class="label">Division:</span>\s*&nbsp;&nbsp;\s*</td><td>(.*)</td>')

这篇关于如何在python中抓取分布在多行上的html标签?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在python中抓取分布在多行上的html标签? [英] How to scrape html tags spread over multiple lines in python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在python中抓取分布在多行上的html标签? [英] How to scrape html tags spread over multiple lines in python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭