用于 JavaScript 生成内容的 Python 网页抓取 [英] Python web scraping for javascript generated content

查看：27 发布时间：2021/12/17 13:49:22 javascript python web-scraping scrape

本文介绍了用于 JavaScript 生成内容的 Python 网页抓取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 python3 返回由 http://www.doi2bib.org/生成的 bibtex 引用一>.url 是可预测的，因此脚本可以计算出 url，而无需与网页交互.我曾尝试使用 selenium、bs4 等，但无法获取框中的文本.

I am trying to use python3 to return the bibtex citation generated by http://www.doi2bib.org/. The url's are predictable so the script can work out the url without having to interact with the web page. I have tried using selenium, bs4, etc but cant get the text inside the box.

url = "http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9"
import urllib.request
from bs4 import BeautifulSoup
text = BeautifulSoup(urllib.request.urlopen(url).read())
print(text)

谁能建议一种在python中将bibtex引用作为字符串(或其他)返回的方法?

Can anyone suggest a way of returning the bibtex citation as a string (or whatever) in python?

You don't need BeautifulSoup here. There is an additional XHR request sent to the server to fill out the bibtex citation, simulate it, for example, with requests:

import requests

bibtex_id = '10.1007/s00425-007-0544-9'

url = "http://www.doi2bib.org/#/doi/{id}".format(id=bibtex_id)
xhr_url = 'http://www.doi2bib.org/doi2bib'

with requests.Session() as session:
    session.get(url)

    response = session.get(xhr_url, params={'id': bibtex_id})
    print(response.content)

打印:

@article{Burgert_2007,
    doi = {10.1007/s00425-007-0544-9},
    url = {http://dx.doi.org/10.1007/s00425-007-0544-9},
    year = 2007,
    month = {jun},
    publisher = {Springer Science $mathplus$ Business Media},
    volume = {226},
    number = {4},
    pages = {981--987},
    author = {Ingo Burgert and Michaela Eder and Notburga Gierlinger and Peter Fratzl},
    title = {Tensile and compressive stresses in tracheids are induced by swelling based on geometrical constraints of the wood cell},
    journal = {Planta}
}

<小时>

你也可以用selenium来解决.这里的关键技巧是使用 Explicit Wait 来等待引用变得可见:

You can also solve it with selenium. The key trick here is to use an Explicit Wait to wait for the citation to become visible:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9')

element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//pre[@ng-show="bib"]')))
print(element.text)

driver.close()

打印与上述解决方案相同.

Prints the same as the above solution.

这篇关于用于 JavaScript 生成内容的 Python 网页抓取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用于 JavaScript 生成内容的 Python 网页抓取 [英] Python web scraping for javascript generated content

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

用于 JavaScript 生成内容的 Python 网页抓取 [英] Python web scraping for javascript generated content

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭