使用scrapy从多个网站中查找特定文本 [英] Using scrapy to find specific text from multiple websites

查看:151
本文介绍了使用scrapy从多个网站中查找特定文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取/检查多个网站(在同一域中)的特定关键字.我已经找到了该脚本,但是找不到如何添加要搜索的特定关键字.脚本需要做的是找到关键字,并给出在其中找到它的链接的结果.谁能指出我在哪里可以阅读更多有关此的信息? 我一直在阅读 scrapy的文档,但似乎找不到.

I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ? I have been reading scrapy's documentation, but I can't seem to find this.

谢谢.

class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
    self.page_number = starting_number

def start_requests(self):
    # generate page IDs from 1000 down to 501
    for i in range (self.page_number, number_of_pages, -1):
        yield Request(url = URL % i, callback=self.parse)

def parse(self, response):
    **parsing data from the webpage**

推荐答案

您需要使用一些解析器或正则表达式在响应正文中查找所需的文本.

You'll need to use some parser or regex to find the text you are looking for inside the response body.

每个scrapy回调方法都在response对象内部包含响应主体,您可以使用response.body检查该响应主体(例如,在parse方法内部),然后必须使用一些

every scrapy callback method contains the response body inside the response object, which you can check with response.body (for example inside the parse method), then you'll have to use some regex or better xpath or css selectors to go to the path of your text knowing the xml structure of the page you crawled.

Scrapy允许您将response对象用作选择器,因此可以使用response.xpath('//head/title/text()')转到页面的标题.

Scrapy lets you use the response object as a Selector, so you can go to the title of the page with response.xpath('//head/title/text()') for example.

希望有帮助.

这篇关于使用scrapy从多个网站中查找特定文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆