通过创建单独的函数来使try-except变通办法适用于单行中的许多语句 [英] Making workaround of try-except to apply on many statement in single line by creating a separate function

查看:74
本文介绍了通过创建单独的函数来使try-except变通办法适用于单行中的许多语句的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从 https://www.dictionary.com/网站上删除字典数据.目的是从字典页面中删除不需要的元素,然后将其离线保存以进行进一步处理.由于网页的结构有些杂乱,因此下面的代码中提到的要删除的元素可能存在也可能不存在;缺少元素会导致异常(在代码段2中).而且由于在实际代码中,有许多要删除的元素,它们可能存在或不存在,因此,如果对每个这样的语句应用try - except,则代码行将急剧增加.

I am scrapping dictionary data from https://www.dictionary.com/ website. The purpose is to remove the unwanted elements from the dictionary pages and save them offline for further processing. Because of the webpages are somewhat unstructured there may and may not be the elements present that are mentioned in the code below to remove; the absence of the elements gives an exception (In snippet 2). And since in the actual code, there are many elements to be removed and they may be present or absent, if we apply the try - except to every such statement the lines of code will increase drasticly.

因此,我正在通过为try - except创建一个单独的函数(在代码段3中)来解决此问题,我从

Thus I am working on a work-around for this problem by creating a separate function for try - except (In snippet 3), the idea of which I got from here. But I am unable to get the code in snippet 3 working as the command such as soup.find_all('style') is returning None where as it should return the list of all the style tags similar to snippet 2. I cannot apply the refered solution directly as sometime I have to reach the intended element to remvove indirectly by refering to its parent or sibling such as in soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent

代码段1用于设置代码执行的环境.

Snippet 1 is used to set the environment for code execution.

如果您能提出一些使片段3正常工作的建议,那就太好了.

It would be great if you could provide some suggestion to get snippet 3 working.

代码段1(设置执行代码的环境):

import urllib.request
import requests
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',}

folder = "dictionary_com"

代码段2(有效):

def makedefinition(url):
    success = False
    while success==False:
        try:
            request=urllib.request.Request(url,headers=headers)
            final_url = urllib.request.urlopen(request, timeout=5).geturl()
            r = requests.get(final_url, headers=headers, timeout=5)
            success=True
        except:
            success=False

    soup = BeautifulSoup(r.text, 'lxml')

    soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})

    # there are many more elements to remove. mentioned only 2 for shortness
    remove = soup.find_all("style") # style tags
    remove.extend(safe_execute(soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent)) # related content in the page

    for x in remove: x.decompose()

    return(soup)

# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)

with open(folder+"/demo.html", "w") as file:
    file.write(str(maggi))

代码段3(无效):

soup = None

def safe_execute(command):
    global soup
    try:
        print(soup) # correct soup is printed
        print(exec(command)) # this should print the list of style tags but printing None, and for related content this should throw some exception
        return exec(command) # None is being returned for style
    except Exception:
        print(Exception.with_traceback())
        return []

def makedefinition(url):
    global soup
    success = False
    while success==False:
        try:
            request=urllib.request.Request(url,headers=headers)
            final_url = urllib.request.urlopen(request, timeout=5).geturl()
            r = requests.get(final_url, headers=headers, timeout=5)
            success=True
        except:
            success=False

    soup = BeautifulSoup(r.text, 'lxml')

    soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})

    # there are many more elements to remove. mentioned only 2 for shortness
    remove = safe_execute("soup.find_all('style')") # style tags
    remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent")) # related content in the page

    for x in remove: x.decompose()

    return(soup)

# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)

with open(folder+"/demo.html", "w") as file:
    file.write(str(maggi))

推荐答案

在代码段3中的代码中,您将使用exec内置方法,该方法将返回None而不管其参数如何.有关详细信息,请参见 SO线程.

In your code in snippet 3 you use the exec builtin method which returns None regardless of what it does with its argument. For details see this SO thread.

补救措施:

使用exec修改变量并返回它,而不是返回exec本身的输出.

Use exec to modify a variable and return it instead of returning the output of exec itself.

def safe_execute(command):
   d = {}
   try:
       exec(command, d)
       return d['output']
   except Exception:
       print(Exception.with_traceback())
       return []

然后将其命名为:

remove = safe_execute("output = soup.find_all('style')")

执行此代码后,再次返回None.但是,在调试时,如果在try部分中打印了soup正确的soup值,但是exec(command,d)给出了NameError: name 'soup' is not defined.

Upon execution of this code, again None is returned. Upon debugging however, inside try section if we print(soup) a correct soup value is printed, but exec(command,d) gives NameError: name 'soup' is not defined.

通过使用eval()而不是exec()克服了这种差异.定义的函数是:

This disparity have been overcome by using eval() instead of exec(). The function defined is:

def safe_execute(command):
    global soup
    try:
        output = eval(command)
        return(output)
    except Exception:
        return []

呼叫看起来像:

remove = safe_execute("soup.find_all('style')")
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent"))

这篇关于通过创建单独的函数来使try-except变通办法适用于单行中的许多语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆