计算页面上的字数 [英] Count word on the page

查看：51 发布时间：2021/7/16 22:14:35 python-3.x scrapy

本文介绍了计算页面上的字数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何抓取任何网站并搜索给定的单词并显示它出现的次数

How to scrap any site and search for the given word and displays how many times it occurred

class LinkedinScraper(scrapy.Spider):
    name = "linked"

    def start_requests(self):
        urls = ['https://www.linkedin.com/']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'linkedin.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

推荐答案

您可以使用带有 response.body 的正则表达式来查找任何地方的所有事件

You can use regex with response.body to find all occurrances in any places

即.

 import re 

 r = re.findall('\\bcat\\b', "cat catalog cattering") 
 print(len(r), 'cat(s)')

给出"1 cat(s)"，而不是"3 cat(s)"

如果您只需要在某些标签中使用单词，那么您首先使用 response.css()、response.xpath() 等

If you need word only in some tags then you use first response.css(), response.xpath(), etc.

示例说明如何使用

 re.findall(pattern, response.text)

但它也可以在标签内找到文本.

but it can find text inside tag too.

它还展示了如何使用

response.css('body').re(pattern)

它计算 'view'、'\\bviews\\b' 和 '\d+ views' 在 Stackoverflow 并显示前三个元素

It counts 'view', '\\bviews\\b' and '\d+ views' on Stackoverflow and display first three elements

您可以在不创建项目的情况下运行它.

You can run it without creating project.

import scrapy
import re

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://stackoverflow.com/']

    def parse(self, response):
        print('url:', response.url)

        for pattern in ['view', '\\bviews\\b', '\d+ views']:
            print('>>> pattern:', pattern)

            result = re.findall(pattern, response.text) 
            print('>>>          re:', len(result), result[0:3])

            result = response.css('body').re(pattern)
            print('>>> response.re:', len(result), result[0:3])

# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
c.crawl(MySpider)
c.start()

这篇关于计算页面上的字数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

计算页面上的字数 [英] Count word on the page

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

计算页面上的字数 [英] Count word on the page

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭