Scrapy 从 div 中提取文本
[英] Scrapy extracting text from div
本文介绍了Scrapy 从 div 中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在使用 Scrapy 构建一个简单的抓取工具,但在提取数据的某些部分时遇到问题.该网站包含大约 20 个以下代码块:
<div class="updateCont date col-md-2 col-sm-2 col-xs-3"><跨度><strong>星期五.2 月 10 日<br/>0:00 AM</span>
<div class="updateCont eventIcon col-md-1 col-sm-1 col-xs-3"><div class="icon"><i class="fa fa-update"></i>
<div class="updateCont event col-md-9 col-sm-8 col-xs-6"><跨度>买方已收到有关此更新的通知.<br/><span class="内部部门">124</span></span>
我设法提取了其中的每一个:
sel = Selector(text=response.body)更新 = sel.xpath("//div[@class='row result']")
我现在想隔离日期并将其转换为日期时间对象以及 updateCont 事件字符串.购买者已收到此更新的通知.
我试过了:
用于更新中的更新:date = update.xpath('//span').extract()打印 ( len(date) )
结果是 7.我原以为它是 3.更令人担忧的是,如果我只打印日期,它会打印三遍相同的数据.我期待三个不同数量的数据,因为 html 中有三个独立的数据.
是
sel = Selector(text=response.body)更新 = sel.xpath("//div[@class='row result']")
隔离这些部分的正确代码?提取跨度的最佳方法是什么?
解决方案
In [19]: for update in updates:...: spans = update.xpath('//span')...:对于跨度中的跨度:...: text = span.xpath('normalize-space()').extract_first()...:打印(文本)...:...:
出:
星期五.2 月 10 日 0:00 上午买方已收到有关此更新的通知.124124
使用.
将其与当前节点隔离
I am building a simple scraper with Scrapy but am having issues extracting certain parts of the data. The website contains about 20 of the following blocks of code:
<div class="row result">
<div class="updateCont date col-md-2 col-sm-2 col-xs-3">
<span>
<strong>Fri. 10 Feb</strong> <br />0:00 AM
</span>
</div>
<div class="updateCont eventIcon col-md-1 col-sm-1 col-xs-3">
<div class="icon ">
<i class="fa fa-update"></i>
</div>
</div>
<div class="updateCont event col-md-9 col-sm-8 col-xs-6">
<span>
The buyer has been notified of this update. <br />
<span class="inner department">
124
</span>
</span>
</div>
</div>
I have managed to extract each one of these with:
sel = Selector(text=response.body)
updates = sel.xpath("//div[@class='row result']")
I now would like to isolate the date and convert it into a datetime object as well as the updateCont event string. The buy has been notified of this update.
I tried:
for update in updates:
date = update.xpath('//span').extract()
print ( len(date) )
which results in 7. I was expecting it to be 3. More worringly, if I print out just date it prints out the same data three times. I was expecting three different lot of data as there are three separate in the html.
Is
sel = Selector(text=response.body)
updates = sel.xpath("//div[@class='row result']")
the correct code to isolate the sections? What would be the best approach to extract the spans?
解决方案
In [19]: for update in updates:
...: spans = update.xpath('//span')
...: for span in spans:
...: text = span.xpath('normalize-space()').extract_first()
...: print(text)
...:
...:
out:
Fri. 10 Feb 0:00 AM
The buyer has been notified of this update. 124
124
Use .
to isolate it to current node
这篇关于Scrapy 从 div 中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!