Scrapy 从 div 中提取文本 [英] Scrapy extracting text from div

查看:33
本文介绍了Scrapy 从 div 中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy 构建一个简单的抓取工具,但在提取数据的某些部分时遇到问题.该网站包含大约 20 个以下代码块:

 

<div class="updateCont date col-md-2 col-sm-2 col-xs-3"><跨度><strong>星期五.2 月 10 日<br/>0:00 AM</span>

<div class="updateCont eventIcon col-md-1 col-sm-1 col-xs-3"><div class="icon"><i class="fa fa-update"></i>

<div class="updateCont event col-md-9 col-sm-8 col-xs-6"><跨度>买方已收到有关此更新的通知.<br/><span class="内部部门">124</span></span>

我设法提取了其中的每一个:

sel = Selector(text=response.body)更新 = sel.xpath("//div[@class='row result']")

我现在想隔离日期并将其转换为日期时间对象以及 updateCont 事件字符串.购买者已收到此更新的通知.

我试过了:

 用于更新中的更新:date = update.xpath('//span').extract()打印 ( len(date) )

结果是 7.我原以为它是 3.更令人担忧的是,如果我只打印日期,它会打印三遍相同的数据.我期待三个不同数量的数据,因为 html 中有三个独立的数据.

sel = Selector(text=response.body)更新 = sel.xpath("//div[@class='row result']")

隔离这些部分的正确代码?提取跨度的最佳方法是什么?

解决方案

In [19]: for update in updates:...: spans = update.xpath('//span')...:对于跨度中的跨度:...: text = span.xpath('normalize-space()').extract_first()...:打印(文本)...:...:

出:

星期五.2 月 10 日 0:00 上午买方已收到有关此更新的通知.124124

使用.将其与当前节点隔离

I am building a simple scraper with Scrapy but am having issues extracting certain parts of the data. The website contains about 20 of the following blocks of code:

 <div class="row result">
    <div class="updateCont date col-md-2 col-sm-2 col-xs-3">
         <span>    
            <strong>Fri. 10 Feb</strong> <br />0:00 AM
         </span>
    </div>
    <div class="updateCont eventIcon col-md-1 col-sm-1 col-xs-3">
        <div class="icon ">
            <i class="fa fa-update"></i>
        </div>
    </div>
    <div class="updateCont event col-md-9 col-sm-8 col-xs-6">
        <span> 
              The buyer has been notified of this update. <br />
              <span class="inner department">
                  124
              </span>
        </span>
    </div>
</div>

I have managed to extract each one of these with:

sel = Selector(text=response.body)
updates =  sel.xpath("//div[@class='row result']")

I now would like to isolate the date and convert it into a datetime object as well as the updateCont event string. The buy has been notified of this update.

I tried:

for update in updates:
        date = update.xpath('//span').extract()
        print ( len(date) )

which results in 7. I was expecting it to be 3. More worringly, if I print out just date it prints out the same data three times. I was expecting three different lot of data as there are three separate in the html.

Is

sel = Selector(text=response.body)
updates =  sel.xpath("//div[@class='row result']")

the correct code to isolate the sections? What would be the best approach to extract the spans?

解决方案

In [19]: for update in updates:
    ...:         spans = update.xpath('//span')
    ...:         for span in spans:
    ...:             text = span.xpath('normalize-space()').extract_first()
    ...:             print(text)
    ...:             
    ...:    

out:

Fri. 10 Feb 0:00 AM
The buyer has been notified of this update. 124
124

Use . to isolate it to current node

这篇关于Scrapy 从 div 中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆