如何提取scrapy中的确切标签 [英] How to extract exact tags in scrapy

查看:160
本文介绍了如何提取scrapy中的确切标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为scrapy编写了一个类,以获得如下所示的页面内容:

 #!/ usr / bin / python 
从scrapy.spider导入html2text
从scrapy.selector导入BaseSpider
导入HtmlXPathSelector

$ b $ class StockSpider(BaseSpider):
name =stock_spider
allowed_domains = [www.hamshahrionline.ir]
start_urls = [http://www.hamshahrionline.ir/details/261730/Health/publichealth]

def parse(self,response):
hxs = HtmlXPathSelector(response)
#sample = hxs.select(WhatShouldIputHere)。extract()[AndHere]
converter = html2text.HTML2Text()
converter.ignore_links = True
print converter.handle(sample)

我的主要问题是我评论过的状态。



如何设置路径并提取参数?



你能指导我解释一些例子吗?

谢谢

解决方案

首先,您需要决定哪些数据你想离开页面,定义一个 Item 课程和一套 字段 秒。然后,为了用数据填充项目字段,您需要在您的 parse()方法中使用 xpath 蜘蛛。



下面是一个例子,它将所有的段落检索出来(所有新闻,我想):

来自scrapy.item的

 从scrapy.spider导入项目,字段
从scrapy.selector导入Spider
导入选择器


class MyItem(Item):
content = Field()


class StockSpider(Spider):
name =stock_spider
allowed_domains = [www.hamshahrionline.ir]
start_urls = [http://www.hamshahrionline.ir/details/261730/Health/publichealth]

def parse(self,response ):
sel = Selector(响应)
段落= sel.xpath(// div [@ class ='newsBodyCont'] / p / text())。extract()
对于p段:
item = MyItem()
item ['content'] = p
yield item

请注意,我使用 选择器 class已自 HtmlXPathSelector 不推荐使用。另外,我使用 xpath() 方法,而不是 select(),因为同样的原因。



另外,请注意,您最好在单独的python脚本中提取 Item 定义,以遵循 Scrapy项目结构

希望帮助。


I wrote a class for scrapy in order to get the piece of content of a page like so:

#!/usr/bin/python
import html2text
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class StockSpider(BaseSpider):
    name = "stock_spider"
    allowed_domains = ["www.hamshahrionline.ir"]
    start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
#       sample = hxs.select("WhatShouldIputHere").extract()[AndHere]
        converter = html2text.HTML2Text()
        converter.ignore_links = True
        print converter.handle(sample)

My main problem is the state that I commented it.

How can I set path and extract parameter for that?

Can you guide me over this and give me some examples?

Thank you

解决方案

First you need to decide what data do you want to get out of the page, define an Item class and a set of Fields. Then, in order to fill item fields with data, you need use xpath expressions in the parse() method of your spider.

Here's an example that retrieves all of the paragraphs out of the body (all news, I suppose):

from scrapy.item import Item, Field
from scrapy.spider import Spider
from scrapy.selector import Selector


class MyItem(Item):
    content = Field()


class StockSpider(Spider):
    name = "stock_spider"
    allowed_domains = ["www.hamshahrionline.ir"]
    start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"]

    def parse(self, response):
        sel = Selector(response)
        paragraphs = sel.xpath("//div[@class='newsBodyCont']/p/text()").extract()
        for p in paragraphs:
            item = MyItem()
            item['content'] = p
            yield item

Note that I'm using a Selector class since HtmlXPathSelector is deprecated. Also, I'm using xpath() method instead of select() because of the same reason.

Also, note that you'd better extract your Item definition in a separate python script to follow the Scrapy project structure.

Hope that helps.

这篇关于如何提取scrapy中的确切标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆