使用正则表达式从页面文本中提取数字 [英] Scrapy Extract number from page text with regex

查看：120 发布时间：2021/6/26 19:30:40 regex python-2.7 scrapy

本文介绍了使用正则表达式从页面文本中提取数字的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我花了几个小时来了解如何搜索页面上的所有文本，如果它与正则表达式匹配，则提取它.我的蜘蛛设置如下:

I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:

def parse(self, response):
        title = response.xpath('//title/text()').extract()
        units = response.xpath('//body/text()').re(r"Units: (\d)")
        print title, units

我想在页面上的单位:"之后提取数字.当我在带有 Units: 351 的页面上运行 scrapy 时，我只会得到页面的标题，前后有一堆转义，而没有任何单位.

I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and after it and nothing for units.

我是scrapy的新手并且有一点python经验.任何有关如何在 Units: 之后提取整数并从标题中删除额外转义字符u'\r\n\t..."的帮助将不胜感激.

I am new to scrapy and have a little python experience. Any help with how to extract the integer after Units: and remove the extra escape characters "u'\r\n\t..." from the title would be much appreciated.

根据评论，这里是示例页面的部分 html 摘录.请注意，在此示例中，除了 p 之外，这可能位于不同的标签内:

As per comment here is an partial html extract of an example page. Note this could be within different tags aside from the p in this example:

<body>
<div> Some content and multiple Divs here <div>
<h1>This is the count for Dala</h1>
<p><strong>Number of Units:</strong> 801</p>
<p>We will have other content here and more divs beyond</p>
</body>

根据下面的答案，这就是其中的大部分内容.仍在努力删除 Units: 和额外的转义字符.

Based on the answer below this is what got most of the way there. Still working on removing Units: and extra escape characters.

units = response.xpath('string(//body)').re("(Units: [\d]+)")

使用正则表达式从页面文本中提取数字 [英] Scrapy Extract number from page text with regex

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用正则表达式从页面文本中提取数字 [英] Scrapy Extract number from page text with regex

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭