在网站上找到一个词并获取其页面链接 [英] find a word on a website and get its page link

查看:42
本文介绍了在网站上找到一个词并获取其页面链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取几个网站,看看katalog"这个词是否合适.存在于那里.如果是,我想检索该词所在的所有选项卡/子页面的链接.可以这样做吗?

I want to scrape a few websites and see if the word "katalog" is present there. If yes, I want to retrieve the link of all the tabs/sub pages where that word is present. Is it possible to do so?

我尝试按照本教程进行操作,但最终得到的 wordlist.csv 是空的,即使网站上确实存在目录"一词.

I tried following this tutorial but the wordlist.csv I get at the end is empty even though the word catalog does exist on the website.

https://www.phooky.com/blog/find-specific-words-on-web-pages-with-scrapy/

        wordlist = [
            "katalog",
            "downloads",
            "download"
            ]

def find_all_substrings(string, sub):
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    allowed_domains = ["www.reichelt.com/"]
    start_urls = ["https://www.reichelt.com/"]
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    crawl_count = 0
    words_found = 0                                 

    def check_buzzwords(self, response):

        self.__class__.crawl_count += 1

        crawl_count = self.__class__.crawl_count

        url = response.url
        contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
        data = response.body.decode('utf-8')

        for word in wordlist:
                substrings = find_all_substrings(data, word)
                print("substrings", substrings)
                for pos in substrings:
                        ok = False
                        if not ok:
                                self.__class__.words_found += 1
                                print(word + ";" + url + ";")
        return Item()

    def _requests_to_follow(self, response):
        if getattr(response, "encoding", None) != None:
                return CrawlSpider._requests_to_follow(self, response)
        else:
                return []

如何在网站上找到一个词的所有实例,并获得该词所在页面的链接?

How can I find all instances of a word on a website and obtain the link of the page where the word is founded?

推荐答案

主要问题是错误 allowed_domain - 它必须没有路径 /

Main problem is wrong allowed_domain - it has to be without path /

    allowed_domains = ["www.reichelt.com"]

其他问题可能是本教程已经 3 年了(有指向 Scarpy 1.5 文档的链接,但最新版本是 2.5.0).

Other problems can be this tutorial is 3 years old (there is link to documentation for Scarpy 1.5 but newest version is 2.5.0).

它还使用了一些无用的代码行.

It also uses some useless lines of code.

它获取contenttype,但从不使用它来decode request.body.您的网址使用 iso8859-1 作为原始语言,使用 utf-8 作为 ?LANGUAGE=PL - 但您可以简单地使用 request.文本,它会自动解码.

It gets contenttype but never use it to decode request.body. Your url uses iso8859-1 for original language and utf-8 for ?LANGUAGE=PL - but you can simply use request.text and it will automatically decode it.

它也使用了 ok = False 并且稍后检查它但它完全没用.

It also uses ok = False and later check it but it is totally useless.

最少的工作代码 - 您可以将其复制到单个文件并作为 python script.py 运行,而无需创建项目.

Minimal working code - you can copy it to single file and run as python script.py without creating project.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re

wordlist = [
    "katalog",
    "catalog",
    "downloads",
    "download",
]

def find_all_substrings(string, sub):
    return [match.start() for match in re.finditer(re.escape(sub), string)]

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    
    allowed_domains = ["www.reichelt.com"]
    start_urls = ["https://www.reichelt.com/"]
    #start_urls = ["https://www.reichelt.com/?LANGUAGE=PL"]
    
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    #crawl_count = 0
    #words_found = 0                                 

    def check_buzzwords(self, response):
        print('[check_buzzwords] url:', response.url)
        
        #self.crawl_count += 1

        #content_type = response.headers.get("content-type", "").decode('utf-8').lower()
        #print('content_type:', content_type)
        #data = response.body.decode('utf-8')
        
        data = response.text

        for word in wordlist:
            print('[check_buzzwords] check word:', word)
            substrings = find_all_substrings(data, word)
            print('[check_buzzwords] substrings:', substrings)
            
            for pos in substrings:
                #self.words_found += 1
                # only display
                print('[check_buzzwords] word: {} | pos: {} | sub: {} | url: {}'.format(word, pos, data[pos-20:pos+20], response.url))
                # send to file
                yield {'word': word, 'pos': pos, 'sub': data[pos-20:pos+20], 'url': response.url}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(WebsiteSpider)
c.start() 


我添加了 data[pos-20:pos+20] 以生成数据以查看子字符串的位置,有时它在 URL 中,例如 .../elements/adw_2018/catalog/... 或其他像 <img alt=""catalog"" 这样的地方 - 所以使用 regex 不一定是个好主意.也许更好的是使用 xpathcss selector 仅在某些地方或链接中搜索文本.

I added data[pos-20:pos+20] to yielded data to see where is substring and sometimes it is in URL like .../elements/adw_2018/catalog/... or other place like <img alt=""catalog"" - so using regex doesn't have to be good idea. Maybe better is to use xpath or css selector to search text only in some places or in links.

从列表中搜索单词链接的版本.它使用 response.xpath 搜索所有 linsk,然后检查 href 中是否有单词 - 所以它不需要 regex.

Version which search links with words from list. It uses response.xpath to search all linsk and later it check if there is word in href - so it doesn't need regex.

问题可能在于它将带有 -downloads-(带有 s)的链接视为带有 downloaddownloads<的链接/code> 所以它需要更复杂的方法来检查(即使用regex)将其仅视为带有单词downloads

Problem can be that it treats link with -downloads- (with s) as link with word download and downloads so it would need more complex method to check (ie. using regex) to treats it only as link with word downloads

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

wordlist = [
    "katalog",
    "catalog",
    "downloads",
    "download",
]

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    
    allowed_domains = ["www.reichelt.com"]
    start_urls = ["https://www.reichelt.com/"]
    
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    def check_buzzwords(self, response):
        print('[check_buzzwords] url:', response.url)
        
        links = response.xpath('//a[@href]')
        
        for word in wordlist:
            
            for link in links:
                url = link.attrib.get('href')
                if word in url:
                    print('[check_buzzwords] word: {} | url: {} | page: {}'.format(word, url, response.url))
                    # send to file
                    yield {'word': word, 'url': url, 'page': response.url}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(WebsiteSpider)
c.start() 

这篇关于在网站上找到一个词并获取其页面链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆