如何使用scrapy抓取各种标签之间包含的文本 [英] How to scrape text included between various tags using scrapy

查看:54
本文介绍了如何使用scrapy抓取各种标签之间包含的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从这个链接中获取产品说明.但是我如何抓取整个文本,包括 标签之间的文本.这是 hxs 对象hxs.select('//div[@class="overview"]/div/text()').extract() 但原始 HTML :

I am trying to scrape product description from this link. But how do i scrape the whole text including text between tags. Here is the hxs object hxs.select('//div[@class="overview"]/div/text()').extract() but the original HTML :

These classic sneakers from
<b>Puma</b>
are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a
<b>leather and synthetic upper.</b>
A vulcanized non-slip rubber sole that is
<b>abrasion resistant ensures good traction.</b>

如果我使用上面提到的 hxs 对象,我会得到这个:

If i use the above mentioned hxs object i get this :

hxs.select('//div[@class="overview"]/div/text()').extract()
Output: 
[u'These classic sneakers from ',
 u' are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a ',
 u' A vulcanized non-slip rubber sole that is ',
 u' sportswear, jeans and tees.',
 u' Gently brush away dust or dirt using a soft cleaning brush.',
 u'\r\nUse a leather conditioner/wax and a brush for added shine.',
 u'Avoid contact with liquids.\xa0']

我想要的是这个:

These classic sneakers from Puma are best known for their neat and simple design. These
 basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a leather and synthetic upper.A vulcanized non-slip rubber sole 
that is abrasion resistant ensures good traction.

正如你所看到的, 之间的文本丢失了,所以你能告诉我如何从页面中提取整个文本.

As you can see the text between is missing so can you tell me how do i extract the whole text from the page.

推荐答案

尝试使用

 //div[@class="overview"]/div

然后您可以使用正则表达式从中删除标签,或者如果它们没有问题就保留它们.

and then you can remove tags from it with regex or leave them if they are not a problem.

类似于这个正则表达式:

Something like this regex:

 re.sub('<[^>]*>', '', mystring)

这篇关于如何使用scrapy抓取各种标签之间包含的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆