计算页面上的字数 [英] Count word on the page
本文介绍了计算页面上的字数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何抓取任何网站并搜索给定的单词并显示它出现的次数
How to scrap any site and search for the given word and displays how many times it occurred
class LinkedinScraper(scrapy.Spider):
name = "linked"
def start_requests(self):
urls = ['https://www.linkedin.com/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'linkedin.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
推荐答案
您可以使用带有 response.body
的正则表达式来查找任何地方的所有事件
You can use regex with response.body
to find all occurrances in any places
即.
import re
r = re.findall('\\bcat\\b', "cat catalog cattering")
print(len(r), 'cat(s)')
给出"1 cat(s)"
,而不是"3 cat(s)"
如果您只需要在某些标签中使用单词,那么您首先使用 response.css()
、response.xpath()
等
If you need word only in some tags then you use first response.css()
, response.xpath()
, etc.
示例说明如何使用
re.findall(pattern, response.text)
但它也可以在标签内找到文本.
but it can find text inside tag too.
它还展示了如何使用
response.css('body').re(pattern)
它计算 'view'
、'\\bviews\\b'
和 '\d+ views'
在 Stackoverflow
并显示前三个元素
It counts 'view'
, '\\bviews\\b'
and '\d+ views'
on Stackoverflow
and display first three elements
您可以在不创建项目的情况下运行它.
You can run it without creating project.
import scrapy
import re
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://stackoverflow.com/']
def parse(self, response):
print('url:', response.url)
for pattern in ['view', '\\bviews\\b', '\d+ views']:
print('>>> pattern:', pattern)
result = re.findall(pattern, response.text)
print('>>> re:', len(result), result[0:3])
result = response.css('body').re(pattern)
print('>>> response.re:', len(result), result[0:3])
# --- it runs without project and saves in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
c.crawl(MySpider)
c.start()
这篇关于计算页面上的字数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文