抓取结果链接的页面打不开 [英] The page of the crawl result link does not open

查看:101
本文介绍了抓取结果链接的页面打不开的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的 Google 搜索结果抓取代码.

class GoogleBotsSpider(scrapy.Spider):名称 = 'GoogleScrapyBot'allowed_domains = ['google.com']start_urls = [f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']定义解析(自我,响应):titles = response.xpath('//*[@id=main"]/div/div/div/a/h3/div//text()').extract()links = response.xpath('//*[@id=main"]/div/div/div/a/@href').extract()项目 = []对于范围内的 idx(len(titles)):item = GoogleScraperItem()item['title'] = 标题[idx]item['link'] = links[idx].lstrip("/url?q=")items.append(item)df = pd.DataFrame(items, columns=['title', 'link'])writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')df.to_excel(writer, sheet_name='test1.xlsx')writer.save()退换货品

我可以为每个标题/链接获得九个项目结果.

<块引用>

在settings.py"上添加如下.

<块引用>

USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"

<块引用>

ROBOTSTXT_OBEY = 错误

解决方案

如果你仔细观察你提取的 url,它们都有 sa, ved> 和 usg 查询参数.显然,这些不是目标网站网址的一部分,而是谷歌搜索结果查询参数.

要仅获取目标网址,您应该使用urllib 库解析网址,并仅提取q 查询参数.

from urllib.parse import urlparse, parse_qsparsed_url = urlparse(url)query_params = parse_qs(parsed_url.query)target_url = query_params[q"][0]

完整的工作代码:

from urllib.parse import urlparse, parse_qs类 GoogleBotsSpider(scrapy.Spider):名称 = 'GoogleScrapyBot'allowed_domains = ['google.com']start_urls = [f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']定义解析(自我,响应):titles = response.xpath('//*[@id=main"]/div/div/div/a/h3/div//text()').extract()links = response.xpath('//*[@id=main"]/div/div/div/a/@href').extract()项目 = []对于范围内的 idx(len(titles)):item = GoogleScraperItem()item['title'] = 标题[idx]# 解析项目urlparsed_url = urlparse(links[idx])query_params = parse_qs(parsed_url.query)item['link'] = query_params[q"][0]items.append(item)df = pd.DataFrame(items, columns=['title', 'link'])writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')df.to_excel(writer, sheet_name='test1.xlsx')writer.save()退换货品

This is my Google search result crawl code.

class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']

start_urls = [
    f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']

def parse(self, response):
    titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
    links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
    items = []

    for idx in range(len(titles)):
        item = GoogleScraperItem()
        item['title'] = titles[idx]
        item['link'] = links[idx].lstrip("/url?q=")
        items.append(item)
        df = pd.DataFrame(items, columns=['title', 'link'])
        writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
        df.to_excel(writer, sheet_name='test1.xlsx')
        writer.save()
    return items

I can get nine item results for each title/link.

https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0

When I open the excel file (test1.xlsx), all links do not open properly. Added as below on "settings.py".

USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"

ROBOTSTXT_OBEY = False

解决方案

If you pay close attention to the urls that you have extracted, all of them have sa, ved and usg query params. Obviously, these are not part of the target sites urls, but are google search results query params.

To get only the target urls, you should parse the urls using urllib library, and extract only the q query param.

from urllib.parse import urlparse, parse_qs

parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)
target_url = query_params["q"][0]

FULL Working code:

from urllib.parse import urlparse, parse_qs

class GoogleBotsSpider(scrapy.Spider):
    name = 'GoogleScrapyBot'
    allowed_domains = ['google.com']

    start_urls = [
        f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']

    def parse(self, response):
        titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
        links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
        items = []

        for idx in range(len(titles)):
            item = GoogleScraperItem()
            item['title'] = titles[idx]
    
            # Parsing item url
            parsed_url = urlparse(links[idx])
            query_params = parse_qs(parsed_url.query)
            item['link'] = query_params["q"][0]

            items.append(item)
            df = pd.DataFrame(items, columns=['title', 'link'])
            writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
            df.to_excel(writer, sheet_name='test1.xlsx')
            writer.save()
        return items

这篇关于抓取结果链接的页面打不开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆