使用正则表达式从页面文本中提取数字 [英] Scrapy Extract number from page text with regex
问题描述
我花了几个小时来了解如何搜索页面上的所有文本,如果它与正则表达式匹配,则提取它.我的蜘蛛设置如下:
I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:
def parse(self, response):
title = response.xpath('//title/text()').extract()
units = response.xpath('//body/text()').re(r"Units: (\d)")
print title, units
我想在页面上的单位:"之后提取数字.当我在带有 Units: 351 的页面上运行 scrapy 时,我只会得到页面的标题,前后有一堆转义,而没有任何单位.
I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and after it and nothing for units.
我是scrapy的新手并且有一点python经验.任何有关如何在 Units: 之后提取整数并从标题中删除额外转义字符u'\r\n\t..."的帮助将不胜感激.
I am new to scrapy and have a little python experience. Any help with how to extract the integer after Units: and remove the extra escape characters "u'\r\n\t..." from the title would be much appreciated.
根据评论,这里是示例页面的部分 html 摘录.请注意,在此示例中,除了 p 之外,这可能位于不同的标签内:
As per comment here is an partial html extract of an example page. Note this could be within different tags aside from the p in this example:
<body>
<div> Some content and multiple Divs here <div>
<h1>This is the count for Dala</h1>
<p><strong>Number of Units:</strong> 801</p>
<p>We will have other content here and more divs beyond</p>
</body>
根据下面的答案,这就是其中的大部分内容.仍在努力删除 Units: 和额外的转义字符.
Based on the answer below this is what got most of the way there. Still working on removing Units: and extra escape characters.
units = response.xpath('string(//body)').re("(Units: [\d]+)")
推荐答案
尝试:
response.xpath('string(//body)').re(r"Units: (\d)")
这篇关于使用正则表达式从页面文本中提取数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!