Python 网络爬虫有时会返回一半的源代码,有时会返回全部......来自同一个网站 [英] Python web crawler sometimes returns half of the source code, sometimes all of it... From the same website

查看:60
本文介绍了Python 网络爬虫有时会返回一半的源代码,有时会返回全部......来自同一个网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个专利号电子表格,我通过抓取 Google 专利、美国专利商标局网站和其他一些网站来获取额外数据.我大部分时间都在运行,但有一件事我一整天都在坚持.当我去 USPTO 网站并获取源代码时,它有时会给我完整的东西并且工作得很好,但有时它只给我大约后半部分(我正在寻找的是第一部分).

I have a spreadsheet of patent numbers that I'm getting extra data for by scraping Google Patents, the USPTO website, and a few others. I mostly have it running, but there's one thing I've been stuck on all day. When I go for the USPTO site and get the source code it will sometimes give me the whole thing and work wonderfully, but other times it only gives me about the second half (and what I'm looking for is in the first).

在这里搜索了很多,我还没有看到任何人有这个确切的问题.这是相关的代码段(因为我已经尝试解决了一段时间,所以它有一些冗余,但我确定这是它的问题中最少的):

searched around here quite a bit, and I haven't seen anyone with this exact issue. Here's the relevant piece of code (it's got some redundancies since I've been trying to figure this out for a while now, but I'm sure that's the least of its problems):

from bs4 import BeautifulSoup
import html5lib
import re
import csv
import urllib
import requests

# This is the base URL for Google Patents
gpatbase = "https://www.google.com/patents/US"
ptobase = "http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/"

# Bring in the patent numbers and define the writer we'll use to add the new info we get
with open(r'C:\Users\Filepathblahblahblah\Patent Data\scrapeThese.csv', newline='') as csvfile:
patreader = csv.reader(csvfile)
writer = csv.writer(csvfile)

for row in patreader:
    patnum = row[0]
    #print(row)

    print(patnum)
    # Take each patent and append it to the base URL to get the actual one
    gpaturl = gpatbase + patnum
    ptourl = ptobase + patnum


    gpatreq = requests.get(gpaturl)
    gpatsource = gpatreq.text
    soup = BeautifulSoup(gpatsource, "html5lib")

    # Find the number of academic citations on that patent

    # From the Google Patents page, find the link labeled USPTO and extract the url
    for tag in soup.find_all("a"):
        if tag.next_element == "USPTO":
            uspto_link = tag.get('href')

    #uspto_link = ptourl
    requested = urllib.request.urlopen(uspto_link)
    source = requested.read()

    pto_soup = BeautifulSoup(source, "html5lib")

    print(uspto_link)
    # From the USPTO page, find the examiner's name and save it
    for italics in pto_soup.find_all("i"):
        if italics.next_element == "Primary Examiner:":
            prim = italics.next_element
        else:
            prim = "Not found"

    if prim != "Not found":
        examiner = prim.next_element
    else:
        examiner = "Not found"

    print(examiner)

截至目前,关于我是否会得到考官姓名或未找到"的评分约为 50-50,而且我看不出任何一组的成员之间有任何共同点,所以我'我完全没有想法.

As of now, it's about 50-50 on whether I'll get the examiner name or "Not found," and I don't see anything that the members of either group have in common with each other, so I'm all out of ideas.

推荐答案

我仍然不知道是什么导致了这个问题,但是如果有人遇到类似的问题,我能够找到一个解决方法.如果您将源代码发送到文本文件而不是尝试直接使用它,它不会被切断.我猜问题出现在数据下载之后,但在它导入到工作区"之前.这是我写在刮板中的一段代码:

I still don't know what's causing the issue, but if someone has a similar problem I was able to figure out a workaround. If you send the source code to a text file instead of trying to work with it directly, it won't be cut off. I guess the issue comes after the data is downloaded, but before it's imported to the 'workspace'. Here's the piece of code I wrote into the scraper:

 if examiner == "Examiner not found":
        filename = r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.html'
        sys.stdout = open(filename, 'w')
        print(patnum)
        print(pto_soup.prettify())
        sys.stdout = console_out

        # Take that logged code and find the examiner name
        sec = "Not found"
        prim = "Not found"
        scraped_code = open(r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.txt')

        scrapedsoup = BeautifulSoup(scraped_code.read(), 'html5lib')
        # Find all italics (<i>) tags
        for italics in scrapedsoup.find_all("i"):
            for desc in italics.descendants:
                # Check to see if any of them affect the words "Primary Examiner"
                if "Primary Examiner:" in desc:
                    prim = desc.next_element.strip()
                    #print("Primary found: ", prim)
                else:
                    pass
                # Same for "Assistant Examiner"
                if "Assistant Examiner:" in desc:
                    sec = desc.next_element.strip()
                    #print("Assistant found: ", sec)
                else:
                    pass

        # If "Secondary Examiner" in there, set 'examiner' to the next string 
        # If there is no secondary examiner, use the primary examiner
        if sec != "Not found":
            examiner = sec
        elif prim != "Not found":
            examiner = prim
        else:
            examiner = "Examiner not found"
        # Show new results in the console
        print(examiner)

这篇关于Python 网络爬虫有时会返回一半的源代码,有时会返回全部......来自同一个网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆