Beautifulsoup没有达到子元素 [英] Beautifulsoup doesn't reach a child element
问题描述
我编写了以下代码,以尝试抓取Google学术搜索页面
I wrote the following code trying to scrape a google scholar page
import requests as req
from bs4 import BeautifulSoup as soup
url = r'https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections'
session = req.Session()
content = session.get(url)
html2bs = soup(content.content, 'lxml')
gs_cit = html2bs.select('#gs_cit')
gs_citd = html2bs.find('div', {'id':"gs_citd"})
gs_cit1 = html2bs.find('div', {'id':"gs_cit1"})
,但是gs_citd
仅给我这条线<div aria-live="assertive" id="gs_citd"></div>
,并且没有达到其下的任何高度.同样,gs_cit1
返回None
.
but the gs_citd
gives me only this line <div aria-live="assertive" id="gs_citd"></div>
and doesn't reach any level beneath it. Also gs_cit1
returns a None
.
出现在此图像中
我想达到突出显示的班级,以便能够获取BibTeX的引文.
I want to reach the highlighted class to be able to grab the BibTeX citation.
请您帮忙!
推荐答案
好,所以我知道了.我将selenium模块用于python,它会创建一个虚拟浏览器,如果允许的话,该浏览器将允许您执行诸如单击链接并获取生成的HTML的输出之类的操作.解决此问题时,我遇到了另一个问题,那就是必须加载页面,否则它会在弹出div中返回内容正在加载...",因此我将python时间模块用于time.sleep(2)
2秒钟这样就可以加载内容.然后,我使用BeautifulSoup解析了结果HTML输出,以找到带有"gs_citi"类的锚标记.然后从锚中拉出href并将其放入带有"requests" python模块的请求中.最后,我将解码后的响应写到了本地文件-Scholar.bib.
Ok, so I figured it out. I used the selenium module for python which creates a virtual browser if you will that will allow you to perform actions like clicking links and getting the output of the resulting HTML. There was another issue I ran into while solving this which was the page had to be loaded otherwise it just returned the content "Loading..." in the pop-up div so I used the python time module to time.sleep(2)
for 2 seconds which allowed the content to load in. Then I just parsed the resulting HTML output using BeautifulSoup to find the anchor tag with the class "gs_citi". Then pulled the href from the anchor and put this into a request with "requests" python module. Finally, I wrote the decoded response to a local file - scholar.bib.
我按照以下说明在Mac上安装了chromedriver和Selenium: https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f
I installed chromedriver and selenium on my Mac using these instructions here: https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f
然后由python文件签名,以按照以下说明阻止防火墙问题: 将Python添加到OS X防火墙选项吗?
Then signed by python file to allow to stop firewall issues using these instructions: Add Python to OS X Firewall Options?
以下是我用于生成输出文件"scholar.bib"的代码:
The following is the code I used to produce the output file "scholar.bib":
import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests as req
# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
# Navigate in Chrome to specified page.
driver.get("https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections")
# Find "Cite" link by looking for anchors that contain "Cite" - second link selected "[1]"
link = driver.find_elements_by_xpath('//a[contains(text(), "' + "Cite" + '")]')[1]
# Click the link
link.click()
print("Waiting for page to load...")
time.sleep(2) # Sleep for 2 seconds
# Get Page source after waiting for 2 seconds of current page in Chrome
source = driver.page_source
# We are done with the driver so quit.
driver.quit()
# Use BeautifulSoup to parse the html source and use "html.parser" as the Parser
soupify = soup(source, 'html.parser')
# Find anchors with the class "gs_citi"
gs_citt = soupify.find('a',{"class":"gs_citi"})
# Get the href attribute of the first anchor found
href = gs_citt['href']
print("Fetching: ", href)
# Instantiate a new requests session
session = req.Session()
# Get the response object of href
content = session.get(href)
# Get the content and then decode() it.
bibtex_html = content.content.decode()
# Write the decoded data to a file named scholar.bib
with open("scholar.bib","w") as file:
file.writelines(bibtex_html)
希望这可以帮助任何寻求解决方案的人.
Hope this helps anyone looking for a solution to this out.
Scholar.bib文件:
Scholar.bib file:
@article{arrow2013sustainability,
title={Sustainability and the measurement of wealth: further reflections},
author={Arrow, Kenneth J and Dasgupta, Partha and Goulder, Lawrence H and Mumford, Kevin J and Oleson, Kirsten},
journal={Environment and Development Economics},
volume={18},
number={4},
pages={504--516},
year={2013},
publisher={Cambridge University Press}
}
这篇关于Beautifulsoup没有达到子元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!