python中的网页抓取:BS、硒和无错误 [英] Webscraping in python: BS, selenium, and None error
问题描述
我想使用 python webscraping 来提供我所做的 ml 应用程序,该应用程序将汇总摘要以简化我的日常研究工作.我似乎遇到了一些困难,因为我一直在使用网络上的很多建议,例如:
Python Selenium 访问 HTML 源代码我不断收到 AttributeError: 'NoneType' object has no attribute 'page_source'/'content' 取决于尝试/使用的模块我需要这个源来提供美丽的汤来抓取源并找到我的 ml 脚本.我的第一次尝试是使用请求:
I wanted to use python webscraping to feed an ml application I did that would make a summary of summaries to ease my daily research work.
I seem to meet some difficulties as while I have been using a lot of suggestions on the web, such as this one:
Python Selenium accessing HTML source
I keep getting the AttributeError: 'NoneType' object has no attribute 'page_source'/'content' depending on the tries/used modules
I need this source to feed beautiful soup to scrape the source and find my ml script.
My first attempt was to use requests:
from bs4 import BeautifulSoup as BS
import requests
import time
import datetime
print ('start!')
print(datetime.datetime.now())
page="http://www.genecards.org/cgi-bin/carddisp.pl?gene=COL1A1&keywords=COL1A1"
这是我的目标页面.我通常每天处理 20 个请求,所以我不想吸血化网站,因为我同时需要它们,我想自动化检索任务,因为最长的部分是获取 url,加载它,复制并粘贴摘要.我也很合理,因为我在加载另一个页面之前会考虑一些延迟.我尝试通过常规浏览器,因为该网站不喜欢机器人(它不允许/ProductRedirect 和一个我在谷歌中找不到的数字?)
This is my target page. I usually do like 20 requests a day, so it's not like I wanted to vampirize the website, and since I need them at the same moment, I wanted to automate the retrieval task since the longest part is to get the url, load it, copy and paste the summaries. I am also reasonnable since I respect some delays before loading another page. I tried passing as a regular browser since the site doesn't like robots (it disallows /ProductRedirect and a thing with a number I could not find in google?)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
current_page = requests.get(page, headers=headers)
print(current_page)
print(current_page.content)
soup=BS(current_page.content,"lxml")
我总是没有得到任何内容,而请求获取代码 200,我可以在 Firefox 中自己加载此页面.所以我尝试使用硒
I always end up getting no content, while request get code 200 and I can load this page by myself in firefox. So i tried with Selenium
from bs4 import BeautifulSoup as BS
from selenium import webdriver
import time
import datetime
print ('start!')
print(datetime.datetime.now())
browser = webdriver.Firefox()
current_page =browser.get(page)
time.sleep(10)
这有效并加载页面.我添加了延迟以确保不会向主机发送垃圾邮件并确保完全加载页面.那么两者都不是:
this works and loads a page. I added the delay to be sure not to spam the host and to be sure to fully load the page. then neither:
html=current_page.content
也没有
html=current_page.page_source
也没有
html=current_page
用作输入:
soup=BS(html,"lxml")
它总是最终说它没有 page_source 属性(它应该有,因为它在调用 selenium 的 Web 浏览器窗口中正确加载).
It always ends up saying that it doesn't have the page_source attribute (while it should have since it loads correctly in the selenium invoked web browser window).
我不知道接下来该尝试什么.就像用户代理头对请求不起作用一样,很奇怪 selenium 返回的页面没有来源.
I don't know what to try next. It's like the user-agent header was not working for requests, and it is very strange that selenium returned page has no source.
接下来我可以尝试什么?谢谢.
What could I try next? Thanks.
请注意,我也尝试过:
browser.get(page)
time.sleep(8)
print(browser)
print(browser.page_source)
html=browser.page_source
soup=BS(html,"lxml")
for summary in soup.find('section', attrs={'id':'_summaries'})
print(summary)
虽然它可以获得源代码,但它只是在 BS 阶段失败了;AttributeError: 'NoneType' 对象没有属性 'find'"
but while it can get the source, it just fails at BS stage with ; "AttributeError: 'NoneType' object has no attribute 'find'"
推荐答案
问题是您正在尝试迭代 .find()
的结果.相反,您需要 .find_all()
:
The problem is that you are trying to iterate over the result of .find()
. Instead you need .find_all()
:
for summary in soup.find_all('section', attrs={'id':'_summaries'})
print(summary)
或者,如果只有一个元素,不要使用循环:
Or, if there is a single element, don't use a loop:
summary = soup.find('section', attrs={'id':'_summaries'})
print(summary)
这篇关于python中的网页抓取:BS、硒和无错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!