如何提取scrapy中的确切标签 [英] How to extract exact tags in scrapy
问题描述
我为scrapy编写了一个类,以获得如下所示的页面内容:
#!/ usr / bin / python
从scrapy.spider导入html2text
从scrapy.selector导入BaseSpider
导入HtmlXPathSelector
$ b $ class StockSpider(BaseSpider):
name =stock_spider
allowed_domains = [www.hamshahrionline.ir]
start_urls = [http://www.hamshahrionline.ir/details/261730/Health/publichealth]
def parse(self,response):
hxs = HtmlXPathSelector(response)
#sample = hxs.select(WhatShouldIputHere)。extract()[AndHere]
converter = html2text.HTML2Text()
converter.ignore_links = True
print converter.handle(sample)
我的主要问题是我评论过的状态。
如何设置路径并提取参数?
你能指导我解释一些例子吗?
谢谢
首先,您需要决定哪些数据你想离开页面,定义一个 Item
课程和一套 字段
秒。然后,为了用数据填充项目字段,您需要在您的 parse()
方法中使用 xpath
蜘蛛。
下面是一个例子,它将所有的段落检索出来(所有新闻,我想):
来自scrapy.item的
从scrapy.spider导入项目,字段
从scrapy.selector导入Spider
导入选择器
class MyItem(Item):
content = Field()
class StockSpider(Spider):
name =stock_spider
allowed_domains = [www.hamshahrionline.ir]
start_urls = [http://www.hamshahrionline.ir/details/261730/Health/publichealth]
def parse(self,response ):
sel = Selector(响应)
段落= sel.xpath(// div [@ class ='newsBodyCont'] / p / text())。extract()
对于p段:
item = MyItem()
item ['content'] = p
yield item
请注意,我使用 选择器
class已自 HtmlXPathSelector
不推荐使用。另外,我使用 xpath()
方法,而不是 select()
,因为同样的原因。
另外,请注意,您最好在单独的python脚本中提取 Item
定义,以遵循 Scrapy项目结构。
希望帮助。
I wrote a class for scrapy in order to get the piece of content of a page like so:
#!/usr/bin/python
import html2text
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class StockSpider(BaseSpider):
name = "stock_spider"
allowed_domains = ["www.hamshahrionline.ir"]
start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
# sample = hxs.select("WhatShouldIputHere").extract()[AndHere]
converter = html2text.HTML2Text()
converter.ignore_links = True
print converter.handle(sample)
My main problem is the state that I commented it.
How can I set path and extract parameter for that?
Can you guide me over this and give me some examples?
Thank you
First you need to decide what data do you want to get out of the page, define an Item
class and a set of Field
s. Then, in order to fill item fields with data, you need use xpath
expressions in the parse()
method of your spider.
Here's an example that retrieves all of the paragraphs out of the body (all news, I suppose):
from scrapy.item import Item, Field
from scrapy.spider import Spider
from scrapy.selector import Selector
class MyItem(Item):
content = Field()
class StockSpider(Spider):
name = "stock_spider"
allowed_domains = ["www.hamshahrionline.ir"]
start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"]
def parse(self, response):
sel = Selector(response)
paragraphs = sel.xpath("//div[@class='newsBodyCont']/p/text()").extract()
for p in paragraphs:
item = MyItem()
item['content'] = p
yield item
Note that I'm using a Selector
class since HtmlXPathSelector
is deprecated. Also, I'm using xpath()
method instead of select()
because of the same reason.
Also, note that you'd better extract your Item
definition in a separate python script to follow the Scrapy project structure.
Hope that helps.
这篇关于如何提取scrapy中的确切标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!