如何提取此JavaScript呈现的内容? [英] How to extract this content rendered by javascript?

查看:64
本文介绍了如何提取此JavaScript呈现的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用requests_html提取此 url 中的元素<div id="TranslationsHead">...</div> <span id="LangBar"> ... </span>由javascript呈现.

I'm using requests_html to extract the element <div id="TranslationsHead">...</div> in this url in which <span id="LangBar"> ... </span> is rendered by javascript.

from requests_html import HTMLSession
session = HTMLSession()
from bs4 import BeautifulSoup

url = 'https://www.thefreedictionary.com/love'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

r = session.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'html.parser')


soup.select_one('#TranslationsHead')

,其结果为<div id="TranslationsHead"><span id="TranslationsTitle">Translations</span></div>.可悲的是,它仍然无法捕获<span id="LangBar"> ... </span>.

and its result is <div id="TranslationsHead"><span id="TranslationsTitle">Translations</span></div>. Sadly, it still does not capture <span id="LangBar"> ... </span>.

能否请您详细说明如何捕获此类内容?

Could you please elaborate on how to capture such content?

非常感谢您的帮助!

推荐答案

您需要调用r.html.render()来使用JavaScript渲染页面:

You need to call r.html.render() to render the page with JavaScript:

from requests_html import HTMLSession

url = 'https://www.thefreedictionary.com/love'
session = HTMLSession()
r = session.get(url)
r.html.render()
lang_bar = r.html.find('#LangBar', first=True)
print(lang_bar.html)

如果要美化输出,请导入BeautifulSoup并使用:

If you want to prettify the output import BeautifulSoup and use:

soup = BeautifulSoup(lang_bar.html, 'html.parser')
print(soup.prettify())

如果您要使用以下语言:

If you want the languages:

for lcd in lang_bar.find('div.lcd'):
    print(lcd.text)

输出:

Afrikaans / Afrikaans
Arabic / العربية
Bulgarian / Български
Chinese Simplified / 中文简体
Chinese Traditional / 中文繁體
Croatian / Hrvatski
Czech / Česky
Danish / Dansk
Dutch / Nederlands
Esperanto / Esperanto
Estonian / eesti keel
Farsi / فارسی
Finnish / Suomi
etc

如果要获取所有翻译说明,则默认为es:

If you want to get all the translations note es is the default:

from requests_html import HTMLSession

url = 'https://www.thefreedictionary.com/love'
session = HTMLSession()
r = session.get(url)
r.html.render()
for span in r.html.find('span.trans'):
    print(span, span.text)

输出:

<Element 'span' class=('trans',) lang='af' style='display: none;'> liefde
<Element 'span' class=('trans',) lang='ar' style='display: none;'> حُب
<Element 'span' class=('trans',) lang='bg' style='display: none;'> любов
<Element 'span' class=('trans',) lang='br' style='display: none;'> amor
<Element 'span' class=('trans',) lang='cs' style='display: none;'> láska
<Element 'span' class=('trans',) lang='de' style='display: none;'> die Liebe
<Element 'span' class=('trans',) lang='da' style='display: none;'> kærlighed
<Element 'span' class=('trans',) lang='el' style='display: none;'> αγάπη
<Element 'span' class=('trans',) lang='es' style='display: inline;'> amor

如果要模拟一种语言的点击并显示结果:

If you want to simulate a click on one language and display the results:

from requests_html import HTMLSession

url = 'https://www.thefreedictionary.com/love'
session = HTMLSession()
r = session.get(url)
script = """
 () => {
              if ( document.readyState === "complete" ) {
                   document.getElementsByClassName("fl_ko")[0].click();
              }
        }
         """
r.html.render(script=script, timeout=10, sleep=2)
for span in r.html.find('span.trans[style="display: inline;"]'):
    print(span, span.text)

输出:

<Element 'span' class=('trans',) lang='ko' style='display: inline;'> 애정
<Element 'span' class=('trans',) lang='ko' style='display: inline;'> 연애
<Element 'span' class=('trans',) lang='ko' style='display: inline;'> 사랑하는 사람
<Element 'span' class=('trans',) lang='ko' style='display: inline;'> (테니스) 영점
<Element 'span' class=('trans',) lang='ko' style='display: inline;'> 사랑하다

已更新以回应评论

Jupyter,Spyder等在幕后使用事件循环,而request-html调用loop.run_until_complete会在循环已经运行时引发该异常.您是否尝试过使用AsyncHTMLSession?

Jupyter, Spyder etc.use an event loop under the hood and request-html calls loop.run_until_complete which rise that exception when the loop is already running. Have you tried using AsyncHTMLSession?

from requests_html import AsyncHTMLSession

url = 'https://www.thefreedictionary.com/love'

asession = AsyncHTMLSession()

async def get_results():
    r = await asession.get(url)
    await r.html.arender()
    return r

r = asession.run(get_results)
lang_bar = r[0].html.find('#LangBar', first=True)
print(lang_bar.html)

或者:

from requests_html import AsyncHTMLSession

url = 'https://www.thefreedictionary.com/love'

asession = AsyncHTMLSession()
script = """
 () => {
              if ( document.readyState === "complete" ) {
                   document.getElementsByClassName("fl_ko")[0].click();
              }
        }
         """


async def get_results():
    r = await asession.get(url)
    await r.html.arender(script=script, timeout=10, sleep=2)
    return r

r = asession.run(get_results)
for span in r[0].html.find('span.trans[style="display: inline;"]'):
    print(span, span.text)

这篇关于如何提取此JavaScript呈现的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆