python中用于Web抓取的正则表达式公式化错误 [英] Error in regex formulation for web scraping in python
问题描述
我正在尝试从网站上抓取一些信息.我需要8个字段的信息,但我有5个字段的信息,但是3个字段总是空白.我认为我的正则表达式表达有些错误.我正在用python进行操作,而不必使用BS. 这是我需要抓取的HTML文件.这是该网页之一的示例.
I am trying to scrape some information from a website. I require 8 fields of information, I have got it for 5 fields, but 3 fields are always coming empty. I think there is some mistake with my regular expression formulation. I am doing it in python and I don't have to use BS. Here are the HTML fileds I need to scrape. This is example of one of the webpage.
enter code here
<td><span class="facultyName">John Matthew Falletta, MD</span>
<span class="primaryTitle">Professor of Pediatrics</span>
<span class="secondaryTitle">Professor in the School of Nursing</span>
<td><span class="label">Department:</span>
</td><td>Pediatrics</td>
<td><span class="label">Division:</span>
</td><td>Hematology/Oncology</td>
<td><span class="label">Address:</span></td><td>Box 2991<br>DUMC<br>Durham, NC 27710 </td>
<td><span class="label">Phone:</span></td><td>
(919)
668-5111<br>
<td><span class="label">FAX:</span></td><td>
(919)
688-5125</td>
这是我的代码,其中包含每种标签类型的正则表达式:
Here is my code containing respective regular expressions for each type of tag:
enter code here
patFinderFullname = re.compile('<span class="facultyName">(.*)</span>')
patFinderPTitle = re.compile('<span class="primaryTitle">(.*)</span>')
patFinderSTitle = re.compile('<span class="secondaryTitle">(.*)</span>')
patFinderDepartment = re.compile('<span class="label">Department:</span>\s+ \s+</td><td>(.*)</td>')
patFinderDivision = re.compile('<span class="label">Division:</span>\s+ \s+</td><td>(.*)')
patFinderAddress = re.compile(' <span class="label">Address:</span>\s+(.*)\s+</td>')
patFinderPhone = re.compile('<span class="label">Phone:</span></td><td>\s*(.*?)\s*<br>')
patFinderFax = re.compile('<td><span class="label">FAX:</span>\s+</td><td>\s+(.*)</td>')
前五个字段结果正确,但是地址",电话"和传真"的后三个字段始终返回空.谁能指出我所缺少的吗?或者最后三个字段的正则表达式有什么问题.我已经发布了一个较早的[1] [问题],但是这些问题后来才出现,因此我在另一个问题中提出了这个问题.
First five field results are coming correct, but the last three fields for Address, Phone and Fax are returning always empty. Can anyone point out what I am missing? Or what is wrong with the regular expressions for the last three fields. I have posted an earlier [1][question], but these problems arrived later to that, so I am asking it in a different question.
推荐答案
patFinderAddress = re.compile('<td><span class="label">Address:</span></td>.*?</td>'
patFinderPhone = re.compile('<td><span class="label">Phone:</span>\s*</td><td>\s*^\s*.*\s*^\s*.*<br>',re.M)
patFinderFax = re.compile('<td><span class="label">FAX:</span>\s*</td><td>\s*^\s*.*\s*^\s*.*</td>',re.M)
以下是一些可以处理您的数据的正则表达式.由于数据跨越多行,最后两个没有工作.第一个没有用,因为它是错误的.
Here's the some regexs that work with your data. The last two weren't working as the data spanned multiple lines. The first didn't work because it was wrong.
但是,对于html解析,请使用html解析器,因为它更健壮,并且可以为您提供所需的数据,而不是像html字符串那样令人讨厌.
But, for html parsing, use an html parser as it's far more robust and gives you the data you want rather than this eyesore of html strings.
这篇关于python中用于Web抓取的正则表达式公式化错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!