python中的网页抓取:BS、硒和无错误 [英] Webscraping in python: BS, selenium, and None error

查看:10
本文介绍了python中的网页抓取:BS、硒和无错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 python webscraping 来提供我所做的 ml 应用程序,该应用程序将汇总摘要以简化我的日常研究工作.我似乎遇到了一些困难,因为我一直在使用网络上的很多建议,例如:
Python Selenium 访问 HTML 源代码我不断收到 AttributeError: 'NoneType' object has no attribute 'page_source'/'content' 取决于尝试/使用的模块我需要这个源来提供美丽的汤来抓取源并找到我的 ml 脚本.我的第一次尝试是使用请求:

I wanted to use python webscraping to feed an ml application I did that would make a summary of summaries to ease my daily research work. I seem to meet some difficulties as while I have been using a lot of suggestions on the web, such as this one:
Python Selenium accessing HTML source I keep getting the AttributeError: 'NoneType' object has no attribute 'page_source'/'content' depending on the tries/used modules I need this source to feed beautiful soup to scrape the source and find my ml script. My first attempt was to use requests:

from bs4 import BeautifulSoup as BS
import requests
import time
import datetime
print ('start!')
print(datetime.datetime.now())

page="http://www.genecards.org/cgi-bin/carddisp.pl?gene=COL1A1&keywords=COL1A1"

这是我的目标页面.我通常每天处理 20 个请求,所以我不想吸血化网站,因为我同时需要它们,我想自动化检索任务,因为最长的部分是获取 url,加载它,复制并粘贴摘要.我也很合理,因为我在加载另一个页面之前会考虑一些延迟.我尝试通过常规浏览器,因为该网站不喜欢机器人(它不允许/ProductRedirect 和一个我在谷歌中找不到的数字?)

This is my target page. I usually do like 20 requests a day, so it's not like I wanted to vampirize the website, and since I need them at the same moment, I wanted to automate the retrieval task since the longest part is to get the url, load it, copy and paste the summaries. I am also reasonnable since I respect some delays before loading another page. I tried passing as a regular browser since the site doesn't like robots (it disallows /ProductRedirect and a thing with a number I could not find in google?)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
current_page = requests.get(page,  headers=headers)
print(current_page)
print(current_page.content)
soup=BS(current_page.content,"lxml")

我总是没有得到任何内容,而请求获取代码 200,我可以在 Firefox 中自己加载此页面.所以我尝试使用硒

I always end up getting no content, while request get code 200 and I can load this page by myself in firefox. So i tried with Selenium

from bs4 import BeautifulSoup as BS
from selenium import webdriver
import time
import datetime
print ('start!')
print(datetime.datetime.now())

browser = webdriver.Firefox()
current_page =browser.get(page)
time.sleep(10)

这有效并加载页面.我添加了延迟以确保不会向主机发送垃圾邮件并确保完全加载页面.那么两者都不是:

this works and loads a page. I added the delay to be sure not to spam the host and to be sure to fully load the page. then neither:

html=current_page.content

也没有

html=current_page.page_source

也没有

html=current_page

用作输入:

soup=BS(html,"lxml")

它总是最终说它没有 page_source 属性(它应该有,因为它在调用 selenium 的 Web 浏览器窗口中正确加载).

It always ends up saying that it doesn't have the page_source attribute (while it should have since it loads correctly in the selenium invoked web browser window).

我不知道接下来该尝试什么.就像用户代理头对请求不起作用一样,很奇怪 selenium 返回的页面没有来源.

I don't know what to try next. It's like the user-agent header was not working for requests, and it is very strange that selenium returned page has no source.

接下来我可以尝试什么?谢谢.

What could I try next? Thanks.

请注意,我也尝试过:

browser.get(page)
time.sleep(8)
print(browser)
print(browser.page_source)
html=browser.page_source
soup=BS(html,"lxml")
for summary in soup.find('section', attrs={'id':'_summaries'})
    print(summary)

虽然它可以获得源代码,但它只是在 BS 阶段失败了;AttributeError: 'NoneType' 对象没有属性 'find'"

but while it can get the source, it just fails at BS stage with ; "AttributeError: 'NoneType' object has no attribute 'find'"

推荐答案

问题是您正在尝试迭代 .find() 的结果.相反,您需要 .find_all():

The problem is that you are trying to iterate over the result of .find(). Instead you need .find_all():

for summary in soup.find_all('section', attrs={'id':'_summaries'})
    print(summary)

或者,如果只有一个元素,不要使用循环:

Or, if there is a single element, don't use a loop:

summary = soup.find('section', attrs={'id':'_summaries'})
print(summary)

这篇关于python中的网页抓取:BS、硒和无错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆