Python web抓取javascript生成的内容 [英] Python web scraping for javascript generated content

查看:206
本文介绍了Python web抓取javascript生成的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python3来返回 http://www.doi2bib.org/。 url是可预测的,因此脚本可以在不必与网页交互的情况下计算出url。我尝试过使用selenium,bs4等,但无法获取文本框内的文字。

I am trying to use python3 to return the bibtex citation generated by http://www.doi2bib.org/. The url's are predictable so the script can work out the url without having to interact with the web page. I have tried using selenium, bs4, etc but cant get the text inside the box.

url = "http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9"
import urllib.request
from bs4 import BeautifulSoup
text = BeautifulSoup(urllib.request.urlopen(url).read())
print(text)

任何人都可以建议在python中将bibtex引用作为字符串(或其他)返回吗?

Can anyone suggest a way of returning the bibtex citation as a string (or whatever) in python?

推荐答案

此处不需要 BeautifulSoup 。有一个额外的XHR请求被发送到服务器以填写bibtex引文,模拟它,例如,用 请求

You don't need BeautifulSoup here. There is an additional XHR request sent to the server to fill out the bibtex citation, simulate it, for example, with requests:

import requests

bibtex_id = '10.1007/s00425-007-0544-9'

url = "http://www.doi2bib.org/#/doi/{id}".format(id=bibtex_id)
xhr_url = 'http://www.doi2bib.org/doi2bib'

with requests.Session() as session:
    session.get(url)

    response = session.get(xhr_url, params={'id': bibtex_id})
    print(response.content)

打印:

@article{Burgert_2007,
    doi = {10.1007/s00425-007-0544-9},
    url = {http://dx.doi.org/10.1007/s00425-007-0544-9},
    year = 2007,
    month = {jun},
    publisher = {Springer Science $\mathplus$ Business Media},
    volume = {226},
    number = {4},
    pages = {981--987},
    author = {Ingo Burgert and Michaela Eder and Notburga Gierlinger and Peter Fratzl},
    title = {Tensile and compressive stresses in tracheids are induced by swelling based on geometrical constraints of the wood cell},
    journal = {Planta}
}






你也可以用 selenium 来解决它。这里的关键技巧是使用显式等待等待引用< a href =http://selenium-python.readthedocs.org/en/latest/api.html#selenium.webdriver.support.expected_conditions.visibility_of_element_located>以显示:


You can also solve it with selenium. The key trick here is to use an Explicit Wait to wait for the citation to become visible:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9')

element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//pre[@ng-show="bib"]')))
print(element.text)

driver.close()

打印与上述解决方案相同。

Prints the same as the above solution.

这篇关于Python web抓取javascript生成的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆