Python web抓取javascript生成的内容 [英] Python web scraping for javascript generated content

查看：206 发布时间：2019/4/24 13:50:03 javascript python web-scraping scrape

本文介绍了Python web抓取javascript生成的内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用python3来返回 http://www.doi2bib.org/。 url是可预测的，因此脚本可以在不必与网页交互的情况下计算出url。我尝试过使用selenium，bs4等，但无法获取文本框内的文字。

I am trying to use python3 to return the bibtex citation generated by http://www.doi2bib.org/. The url's are predictable so the script can work out the url without having to interact with the web page. I have tried using selenium, bs4, etc but cant get the text inside the box.

url = "http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9"
import urllib.request
from bs4 import BeautifulSoup
text = BeautifulSoup(urllib.request.urlopen(url).read())
print(text)

任何人都可以建议在python中将bibtex引用作为字符串（或其他）返回吗？

Can anyone suggest a way of returning the bibtex citation as a string (or whatever) in python?

You don't need BeautifulSoup here. There is an additional XHR request sent to the server to fill out the bibtex citation, simulate it, for example, with requests:

import requests

bibtex_id = '10.1007/s00425-007-0544-9'

url = "http://www.doi2bib.org/#/doi/{id}".format(id=bibtex_id)
xhr_url = 'http://www.doi2bib.org/doi2bib'

with requests.Session() as session:
    session.get(url)

    response = session.get(xhr_url, params={'id': bibtex_id})
    print(response.content)

打印：

@article{Burgert_2007,
    doi = {10.1007/s00425-007-0544-9},
    url = {http://dx.doi.org/10.1007/s00425-007-0544-9},
    year = 2007,
    month = {jun},
    publisher = {Springer Science $\mathplus$ Business Media},
    volume = {226},
    number = {4},
    pages = {981--987},
    author = {Ingo Burgert and Michaela Eder and Notburga Gierlinger and Peter Fratzl},
    title = {Tensile and compressive stresses in tracheids are induced by swelling based on geometrical constraints of the wood cell},
    journal = {Planta}
}

你也可以用 selenium 来解决它。这里的关键技巧是使用显式等待等待引用< a href =http://selenium-python.readthedocs.org/en/latest/api.html#selenium.webdriver.support.expected_conditions.visibility_of_element_located>以显示：

You can also solve it with selenium. The key trick here is to use an Explicit Wait to wait for the citation to become visible:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9')

element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//pre[@ng-show="bib"]')))
print(element.text)

driver.close()

打印与上述解决方案相同。

Prints the same as the above solution.

这篇关于Python web抓取javascript生成的内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python web抓取javascript生成的内容 [英] Python web scraping for javascript generated content

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python web抓取javascript生成的内容 [英] Python web scraping for javascript generated content

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭