使用正则表达式从页面文本中提取数字 [英] Scrapy Extract number from page text with regex

查看:120
本文介绍了使用正则表达式从页面文本中提取数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我花了几个小时来了解如何搜索页面上的所有文本,如果它与正则表达式匹配,则提取它.我的蜘蛛设置如下:

I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:

def parse(self, response):
        title = response.xpath('//title/text()').extract()
        units = response.xpath('//body/text()').re(r"Units: (\d)")
        print title, units

我想在页面上的单位:"之后提取数字.当我在带有 Units: 351 的页面上运行 scrapy 时,我只会得到页面的标题,前后有一堆转义,而没有任何单位.

I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and after it and nothing for units.

我是scrapy的新手并且有一点python经验.任何有关如何在 Units: 之后提取整数并从标题中删除额外转义字符u'\r\n\t..."的帮助将不胜感激.

I am new to scrapy and have a little python experience. Any help with how to extract the integer after Units: and remove the extra escape characters "u'\r\n\t..." from the title would be much appreciated.

根据评论,这里是示例页面的部分 html 摘录.请注意,在此示例中,除了 p 之外,这可能位于不同的标签内:

As per comment here is an partial html extract of an example page. Note this could be within different tags aside from the p in this example:

<body>
<div> Some content and multiple Divs here <div>
<h1>This is the count for Dala</h1>
<p><strong>Number of Units:</strong> 801</p>
<p>We will have other content here and more divs beyond</p>
</body>

根据下面的答案,这就是其中的大部分内容.仍在努力删除 Units: 和额外的转义字符.

Based on the answer below this is what got most of the way there. Still working on removing Units: and extra escape characters.

units = response.xpath('string(//body)').re("(Units: [\d]+)")

推荐答案

尝试:

response.xpath('string(//body)').re(r"Units: (\d)")

这篇关于使用正则表达式从页面文本中提取数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆