python中用于Web抓取的正则表达式公式化错误 [英] Error in regex formulation for web scraping in python

查看:103
本文介绍了python中用于Web抓取的正则表达式公式化错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站上抓取一些信息.我需要8个字段的信息,但我有5个字段的信息,但是3个字段总是空白.我认为我的正则表达式表达有些错误.我正在用python进行操作,而不必使用BS. 这是我需要抓取的HTML文件.这是该网页之一的示例.

I am trying to scrape some information from a website. I require 8 fields of information, I have got it for 5 fields, but 3 fields are always coming empty. I think there is some mistake with my regular expression formulation. I am doing it in python and I don't have to use BS. Here are the HTML fileds I need to scrape. This is example of one of the webpage.

enter code here

<td><span class="facultyName">John Matthew Falletta, MD</span>

<span class="primaryTitle">Professor of Pediatrics</span>

<span class="secondaryTitle">Professor in the School of Nursing</span>

<td><span class="label">Department:</span>
        &nbsp;&nbsp;
    </td><td>Pediatrics</td>

<td><span class="label">Division:</span>
        &nbsp;&nbsp;
    </td><td>Hematology/Oncology</td>

<td><span class="label">Address:</span></td><td>Box 2991<br>DUMC<br>Durham, NC &nbsp;27710   </td>

<td><span class="label">Phone:</span></td><td>
       (919)
       668-5111<br>

<td><span class="label">FAX:</span></td><td>                
        (919)
        688-5125</td>

这是我的代码,其中包含每种标签类型的正则表达式:

Here is my code containing respective regular expressions for each type of tag:

enter code here

patFinderFullname = re.compile('<span class="facultyName">(.*)</span>')
patFinderPTitle = re.compile('<span class="primaryTitle">(.*)</span>')
patFinderSTitle = re.compile('<span class="secondaryTitle">(.*)</span>')
patFinderDepartment = re.compile('<span class="label">Department:</span>\s+&nbsp;&nbsp;\s+</td><td>(.*)</td>')
patFinderDivision = re.compile('<span class="label">Division:</span>\s+&nbsp;&nbsp;\s+</td><td>(.*)')

patFinderAddress = re.compile(' <span class="label">Address:</span>\s+(.*)\s+</td>')
patFinderPhone = re.compile('<span class="label">Phone:</span></td><td>\s*(.*?)\s*<br>')
patFinderFax = re.compile('<td><span class="label">FAX:</span>\s+</td><td>\s+(.*)</td>')

前五个字段结果正确,但是地址",电话"和传真"的后三个字段始终返回空.谁能指出我所缺少的吗?或者最后三个字段的正则表达式有什么问题.我已经发布了一个较早的[1] [问题],但是这些问题后来才出现,因此我在另一个问题中提出了这个问题.

First five field results are coming correct, but the last three fields for Address, Phone and Fax are returning always empty. Can anyone point out what I am missing? Or what is wrong with the regular expressions for the last three fields. I have posted an earlier [1][question], but these problems arrived later to that, so I am asking it in a different question.

[1]:如何抓取html标签分布在python中的多行上?

推荐答案

patFinderAddress = re.compile('<td><span class="label">Address:</span></td>.*?</td>'
patFinderPhone  = re.compile('<td><span class="label">Phone:</span>\s*</td><td>\s*^\s*.*\s*^\s*.*<br>',re.M)
patFinderFax = re.compile('<td><span class="label">FAX:</span>\s*</td><td>\s*^\s*.*\s*^\s*.*</td>',re.M)

以下是一些可以处理您的数据的正则表达式.由于数据跨越多行,最后两个没有工作.第一个没有用,因为它是错误的.

Here's the some regexs that work with your data. The last two weren't working as the data spanned multiple lines. The first didn't work because it was wrong.

但是,对于html解析,请使用html解析器,因为它更健壮,并且可以为您提供所需的数据,而不是像html字符串那样令人讨厌.

But, for html parsing, use an html parser as it's far more robust and gives you the data you want rather than this eyesore of html strings.

这篇关于python中用于Web抓取的正则表达式公式化错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆