从网站提取文本时出错:AttributeError 'NoneType' 对象没有属性 'get_text' [英] Error extracting text from website: AttributeError 'NoneType' object has no attribute 'get_text'

查看:23
本文介绍了从网站提取文本时出错:AttributeError 'NoneType' 对象没有属性 'get_text'的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取这个网站并使用 .get_text().strip() 将标题"和类别"作为文本获取.

I am scraping this website and get "title" and "category" as text using .get_text().strip().

我在使用相同的方法将作者"提取为文本时遇到问题.

I have a problem using the same approach for extracting the "author" as text.

data2 = {
    'url' : [],
    'title' : [],
    'category': [],
    'author': [],
} 

url_pattern = "https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&year=2018&page={}"
count_min = 1
count_max = 3

while count_min <= count_max: 
    print (count_min)
    url = url_pattern.format(count_min)
    r = requests.get(url)
    try: 
        soup = BeautifulSoup(r.content, 'lxml')
        for links in soup.find_all('article'):
            data2['url'].append(links.a.attrs['href']) 
            data2['title'].append(links.h3.get_text().strip())
            data2["category"].append(links.span.get_text().strip()) 
            data2["author"].append(links.find('span', {"itemprop": "name"}).get_text().strip()) #??????

    except Exception as exc:
        print(exc.__class__.__name__, exc)

    time.sleep(0.1)
    count_min = count_min + 1

print ("Fertig.")
df = pd.DataFrame( data2 )
df

df 应该打印一个带有author"、category"、title"、url"的表格.打印异常给了我以下提示:AttributeError 'NoneType' object has no attribute 'get_text'.但我收到以下消息,而不是表格.

df is supposed to print a table with "author", "category", "title", "url". The print Exception gives me the following hint: AttributeError 'NoneType' object has no attribute 'get_text'. But instead of the table I get the following message.

ValueError                                Traceback (most recent call last)
<ipython-input-34-9bfb92af1135> in <module>()
     29 
     30 print ("Fertig.")
---> 31 df = pd.DataFrame( data2 )
     32 df

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    328                                  dtype=dtype, copy=copy)
    329         elif isinstance(data, dict):
--> 330             mgr = self._init_dict(data, index, columns, dtype=dtype)
    331         elif isinstance(data, ma.MaskedArray):
    332             import numpy.ma.mrecords as mrecords

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
    459             arrays = [data[k] for k in keys]
    460 
--> 461         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    462 
    463     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   6161     # figure out the index, if necessary
   6162     if index is None:
-> 6163         index = extract_index(arrays)
   6164     else:
   6165         index = _ensure_index(index)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)
   6209             lengths = list(set(raw_lengths))
   6210             if len(lengths) > 1:
-> 6211                 raise ValueError('arrays must all be same length')
   6212 
   6213             if have_dicts:

ValueError: arrays must all be same length 

如何改进我的代码以提取作者"姓名?

How can I improve my code to get the "author" names extracted?

推荐答案

你们很亲近——我推荐几件事情.首先,我建议仔细查看 HTML——在这种情况下,作者姓名实际上在 ul 中,其中每个 li 包含一个 span 其中 itemprop'name'.然而,并非所有文章都有任何作者姓名.在这种情况下,使用您当前的代码,对 links.find('div', {'itemprop': 'name'}) 的调用将返回 None.None 当然没有属性get_text.这意味着该行将抛出一个错误,在这种情况下只会导致没有值被附加到 data2 'author' 列表.我建议将作者存储在如下列表中:

You're very close--there's a couple of things I recommend. First, I'd recommend taking a closer look at the HTML--in this case the author names are actually in a ul, where each li contains a span where itemprop is 'name'. However, not all articles have any author names at all. In this case, with your current code, the call to links.find('div', {'itemprop': 'name'}) returns None. None, of course, has no attribute get_text. This means that line will throw an error, which in this case will just cause no value to be appended to the data2 'author' list. I'd recommend storing the author(s) in a list like so:

authors = []
ul = links.find('ul', itemprop='creator')
for author in ul.find_all('span', itemprop='name'):
    authors.append(author.text.strip())
data2['authors'].append(authors)

这处理了我们所期望的没有作者的情况,作者"是一个空列表.

This handles the case where there are no authors as we would expect, by "authors" being an empty list.

作为旁注,将您的代码放在一个

As a side note, putting your code inside a

try:
    ...
except:
    pass

construct 通常被认为是糟糕的实践,这正是您现在看到的原因.默默地忽略错误可以使您的程序看起来运行正常,而实际上任何数量的事情都可能出错.至少,将错误信息打印到 stdout 很少是一个坏主意.即使只是做这样的事情也比什么都不做要好:

construct is generally considered poor practice, for exactly the reason you're seeing now. Ignoring errors silently can give your program the appearance of running properly, while in fact any number of things could be going wrong. At the very least it's rarely a bad idea to print error info to stdout. Even just doing something like this is better than nothing:

try:
    ...
except Exception as exc:
    print(exc.__class__.__name__, exc)

然而,对于调试,通常也需要完整的回溯.为此,您可以使用 traceback 模块.

For debugging, however, having the full traceback is often desirable as well. For this you can use the traceback module.

import traceback
try:
    ...
except:
    traceback.print_exc()

这篇关于从网站提取文本时出错:AttributeError 'NoneType' 对象没有属性 'get_text'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆