蟒蛇 - BeautifulSoup - 如何检查,如果结果集包含一个元素 [英] Python - BeautifulSoup - how to check if ResultSet contains an element
问题描述
我做一些网页刮,但我想出的东西,我想不通。基本上,我需要检查,如果我的ResultSet元素RELEASEDATE的第0元素包含内容作为
[<元CONTENT =1992年9月11日itemprop =datePublished/>]
但是,当内容是不是在标签,我得到这样
错误 回溯(最后最近一次调用):
文件<&标准输入GT;,1号线,上述<&模块GT;
文件imdbQuestion.py18行,上述<&模块GT;
如果RELEASEDATE [0] ['内容']:
文件建立/ bdist.macosx-10.8-英特尔/蛋/ BS4 / element.py,线路879,在__getitem__
KeyError异常:内容
我
如何检查'内容'是RELEASEDATE没有导致错误?
此外,我怎么能不管提取我想出来的ResultSet对象?
满code是:
进口要求
从BS4进口BeautifulSoup文件= codecs.open('imdb.txt','W',编码=UTF-8)通过最后一个值#iterate
对于增量范围(7,10):
imdbNum ='015008'+ STR(增量)
URL ='http://www.imdb.com/title/tt'+ imdbNum 网址code = requests.get(URL)
汤= BeautifulSoup(URL code.content) #获取的发布日期
RELEASEDATE = soup.findAll(ATTRS = {'itemprop':'datePublished'})
ABC = RELEASEDATE
#ERROR检查 - 分配。到RELEASEDATE如果RELEASEDATE [0]是空白
#如果不是空白,检查'内容'是RELEASEDATE [0]。如果是这样,我们是很好的。如果没有,分配'检查',以RELEASEDATE [0]
如果RELEASEDATE:
如果RELEASEDATE [0] ['内容']:
RELEASEDATE = RELEASEDATE [0] ['内容']
其他:
RELEASEDATE ='检查'
其他:
RELEASEDATE ='。 打印RELEASEDATE
file.close()
针对的 Tag.attrs
字典:
如果RELEASEDATE:
如果RELEASEDATE内容[0] .attrs:
RELEASEDATE = RELEASEDATE [0] ['内容']
其他:
RELEASEDATE ='检查'
或 dict.get()
方法上属性:
如果RELEASEDATE:
RELEASEDATE = RELEASEDATE [0] .attrs.get('内容','检查')
快速演示:
>>>进口要求
>>>从BS4进口BeautifulSoup
>>> imdbNum ='0150087'
>>> URL ='http://www.imdb.com/title/tt'+ imdbNum
>>>网址code = requests.get(URL)
>>>汤= BeautifulSoup(URL code.content)
>>> RELEASEDATE = soup.findAll(ATTRS = {'itemprop':'datePublished'})
>>> RELEASEDATE [0]
<元CONTENT =1966至1904年itemprop =datePublished/>
>>> RELEASEDATE [0] .attrs
{'内容':'1966至1904年,itemprop':'datePublished'}
>>>在RELEASEDATE [0] .attrs内容
真正
>>> RELEASEDATE [0] .attrs.get('内容','检查')
1966至1904年
I am doing some web-scraping, but I've come up with something that I can't figure out. Basically, I need to check if the 0'th element of my ResultSet element releaseDate contains 'content' as in
[<meta content="1992-09-11" itemprop="datePublished"/>]
But when 'content' is not in the tag, I get an error like
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "imdbQuestion.py", line 18, in <module>
if releaseDate[0]['content']:
File "build/bdist.macosx-10.8-intel/egg/bs4/element.py", line 879, in __getitem__
KeyError: 'content'
How can I check if 'content' is in releaseDate without causing an error?
Additionally, how can I extract whatever I want out of ResultSet objects?
The full code is:
import requests
from bs4 import BeautifulSoup
file = codecs.open('imdb.txt', 'w', encoding = 'utf-8')
#iterate through last value
for increment in range(7,10):
imdbNum = '015008' + str(increment)
url = 'http://www.imdb.com/title/tt' + imdbNum
urlCode = requests.get(url)
soup = BeautifulSoup(urlCode.content)
#get release date
releaseDate = soup.findAll(attrs={'itemprop':'datePublished'})
abc = releaseDate
#error checking - assign '.' to releaseDate if releaseDate[0] is blank
#if not blank, check if 'content' is in releaseDate[0]. if so, we are good. if not, assign 'CHECK' to releaseDate[0]
if releaseDate:
if releaseDate[0]['content']:
releaseDate = releaseDate[0]['content']
else:
releaseDate = 'CHECK'
else:
releaseDate = '.'
print releaseDate
file.close()
Test against the Tag.attrs
dictionary:
if releaseDate:
if 'content' in releaseDate[0].attrs:
releaseDate = releaseDate[0]['content']
else:
releaseDate = 'CHECK'
or use the dict.get()
method on that attribute:
if releaseDate:
releaseDate = releaseDate[0].attrs.get('content', 'CHECK')
Quick demo:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> imdbNum = '0150087'
>>> url = 'http://www.imdb.com/title/tt' + imdbNum
>>> urlCode = requests.get(url)
>>> soup = BeautifulSoup(urlCode.content)
>>> releaseDate = soup.findAll(attrs={'itemprop':'datePublished'})
>>> releaseDate[0]
<meta content="1966-04" itemprop="datePublished"/>
>>> releaseDate[0].attrs
{'content': '1966-04', 'itemprop': 'datePublished'}
>>> 'content' in releaseDate[0].attrs
True
>>> releaseDate[0].attrs.get('content', 'CHECK')
'1966-04'
这篇关于蟒蛇 - BeautifulSoup - 如何检查,如果结果集包含一个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!