抓取结果链接的页面打不开 [英] The page of the crawl result link does not open
问题描述
这是我的 Google 搜索结果抓取代码.
class GoogleBotsSpider(scrapy.Spider):名称 = 'GoogleScrapyBot'allowed_domains = ['google.com']start_urls = [f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']定义解析(自我,响应):titles = response.xpath('//*[@id=main"]/div/div/div/a/h3/div//text()').extract()links = response.xpath('//*[@id=main"]/div/div/div/a/@href').extract()项目 = []对于范围内的 idx(len(titles)):item = GoogleScraperItem()item['title'] = 标题[idx]item['link'] = links[idx].lstrip("/url?q=")items.append(item)df = pd.DataFrame(items, columns=['title', 'link'])writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')df.to_excel(writer, sheet_name='test1.xlsx')writer.save()退换货品
我可以为每个标题/链接获得九个项目结果.
<块引用>在settings.py"上添加如下.
<块引用>USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
<块引用>
ROBOTSTXT_OBEY = 错误
如果你仔细观察你提取的 url,它们都有 sa
, ved
> 和 usg
查询参数.显然,这些不是目标网站网址的一部分,而是谷歌搜索结果查询参数.
要仅获取目标网址,您应该使用urllib
库解析网址,并仅提取q
查询参数.
from urllib.parse import urlparse, parse_qsparsed_url = urlparse(url)query_params = parse_qs(parsed_url.query)target_url = query_params[q"][0]
完整的工作代码:
from urllib.parse import urlparse, parse_qs类 GoogleBotsSpider(scrapy.Spider):名称 = 'GoogleScrapyBot'allowed_domains = ['google.com']start_urls = [f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']定义解析(自我,响应):titles = response.xpath('//*[@id=main"]/div/div/div/a/h3/div//text()').extract()links = response.xpath('//*[@id=main"]/div/div/div/a/@href').extract()项目 = []对于范围内的 idx(len(titles)):item = GoogleScraperItem()item['title'] = 标题[idx]# 解析项目urlparsed_url = urlparse(links[idx])query_params = parse_qs(parsed_url.query)item['link'] = query_params[q"][0]items.append(item)df = pd.DataFrame(items, columns=['title', 'link'])writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')df.to_excel(writer, sheet_name='test1.xlsx')writer.save()退换货品
This is my Google search result crawl code.
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = [
f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx].lstrip("/url?q=")
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test1.xlsx')
writer.save()
return items
I can get nine item results for each title/link.
https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0
When I open the excel file (test1.xlsx), all links do not open properly. Added as below on "settings.py".
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
ROBOTSTXT_OBEY = False
If you pay close attention to the urls that you have extracted, all of them have sa
, ved
and usg
query params. Obviously, these are not part of the target sites urls, but are google search results query params.
To get only the target urls, you should parse the urls using urllib
library, and extract only the q
query param.
from urllib.parse import urlparse, parse_qs
parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)
target_url = query_params["q"][0]
FULL Working code:
from urllib.parse import urlparse, parse_qs
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = [
f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
# Parsing item url
parsed_url = urlparse(links[idx])
query_params = parse_qs(parsed_url.query)
item['link'] = query_params["q"][0]
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test1.xlsx')
writer.save()
return items
这篇关于抓取结果链接的页面打不开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!