首页
其他开发
Scrapy 从 div 中提取文本

Scrapy 从 div 中提取文本 [英] Scrapy extracting text from div

查看：33 发布时间：2021/7/17 18:35:44 xpath scrapy

本文介绍了Scrapy 从 div 中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Scrapy 构建一个简单的抓取工具，但在提取数据的某些部分时遇到问题.该网站包含大约 20 个以下代码块:

 
<div class="updateCont date col-md-2 col-sm-2 col-xs-3"><跨度><strong>星期五.2 月 10 日<br/>0:00 AM</span>
<div class="updateCont eventIcon col-md-1 col-sm-1 col-xs-3"><div class="icon"><i class="fa fa-update"></i>

<div class="updateCont event col-md-9 col-sm-8 col-xs-6"><跨度>买方已收到有关此更新的通知.<br/><span class="内部部门">124</span></span>

我设法提取了其中的每一个:

sel = Selector(text=response.body)更新 = sel.xpath("//div[@class='row result']")

我现在想隔离日期并将其转换为日期时间对象以及 updateCont 事件字符串.购买者已收到此更新的通知.

我试过了:

 用于更新中的更新:date = update.xpath('//span').extract()打印 ( len(date) )

结果是 7.我原以为它是 3.更令人担忧的是，如果我只打印日期，它会打印三遍相同的数据.我期待三个不同数量的数据，因为 html 中有三个独立的数据.

是

sel = Selector(text=response.body)更新 = sel.xpath("//div[@class='row result']")

隔离这些部分的正确代码?提取跨度的最佳方法是什么?

解决方案

In [19]: for update in updates:...: spans = update.xpath('//span')...:对于跨度中的跨度:...: text = span.xpath('normalize-space()').extract_first()...:打印(文本)...:...:

出:

星期五.2 月 10 日 0:00 上午买方已收到有关此更新的通知.124124

使用.将其与当前节点隔离

I am building a simple scraper with Scrapy but am having issues extracting certain parts of the data. The website contains about 20 of the following blocks of code:

 <div class="row result">
    <div class="updateCont date col-md-2 col-sm-2 col-xs-3">
         <span>    
            <strong>Fri. 10 Feb</strong> <br />0:00 AM
         </span>
    </div>
    <div class="updateCont eventIcon col-md-1 col-sm-1 col-xs-3">
        <div class="icon ">
            <i class="fa fa-update"></i>
        </div>
    </div>
    <div class="updateCont event col-md-9 col-sm-8 col-xs-6">
        <span> 
              The buyer has been notified of this update. <br />
              <span class="inner department">
                  124
              </span>
        </span>
    </div>
</div>

I have managed to extract each one of these with:

sel = Selector(text=response.body)
updates =  sel.xpath("//div[@class='row result']")

I now would like to isolate the date and convert it into a datetime object as well as the updateCont event string. The buy has been notified of this update.

I tried:

for update in updates:
        date = update.xpath('//span').extract()
        print ( len(date) )

which results in 7. I was expecting it to be 3. More worringly, if I print out just date it prints out the same data three times. I was expecting three different lot of data as there are three separate in the html.

sel = Selector(text=response.body)
updates =  sel.xpath("//div[@class='row result']")

the correct code to isolate the sections? What would be the best approach to extract the spans?

解决方案

In [19]: for update in updates:
    ...:         spans = update.xpath('//span')
    ...:         for span in spans:
    ...:             text = span.xpath('normalize-space()').extract_first()
    ...:             print(text)
    ...:             
    ...:

out:

Fri. 10 Feb 0:00 AM
The buyer has been notified of this update. 124
124

Use . to isolate it to current node

这篇关于Scrapy 从 div 中提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

登录关闭

扫码关注1秒登录

发送“验证码”获取 | 15天全站免登陆

Scrapy 从 div 中提取文本 [英] Scrapy extracting text from div

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Scrapy 从 div 中提取文本 [英] Scrapy extracting text from div

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭