Web抓取Python / R中的JavaScript [英] Web Scraping javascript in Python / R
问题描述
我正在做一些个人数据科学项目,其中之一是看看某些歌曲在收音机上播放的频率。
I'm doing some personal data science projects and one of them is to see how often certain songs are played on the radio.
http:// www .iheart.com / live / radio-1045-3401 /
查看上面的网址,当我查看网页来源时,没有感兴趣的值填充。不知道为什么,但是当我将鼠标悬停在正在播放标题上时,我在chrome中使用了inspect元素时,我可以看到正在播放的歌曲和艺术家的值。
Looking at the above URL, when I look at page source, no values of interest populate. Not sure why, but when I use inspect element in chrome when I hover over the "Now Playing" header, I can see values for song and artist now playing.
示例:
a class="player-song" href="/artist/rem-3610/songs/-2450662/" title="Losing My Religion" data-reactid=".1hpdfx1l4ow.a.1.0.1.1">Losing My Religion</a
我的两个问题是:
- 为什么这不会出现在页面来源中,但我可以看到它在Inspect Element下?
- 我如何通过网页抓取此信息,因为它没有出现在页面源中?
推荐答案
-
大多数涉及动态元素的网页都有由浏览器为您解析和执行的Javascript生成和插入的页面元素。我怀疑,根据问题标题,你已经猜到了这一点。
Most web pages that involve dynamic elements have page elements generated and inserted by Javascript that the browser parses and executes for you. You already guessed this, I suspect, based on the question title.
您在页面源中看到的是原始HTML 之前 Javascript启动并更新它。
What you see in the page source is the raw HTML before Javascript kicks in and updates it.
你想要一个无头浏览器:没有图形用户界面的浏览器。这将为您解析并执行Javascript,并相应地更新页面HTML。
You want a headless browser: a browser without a graphical user interface. This will parse and execute Javascript for you, and update page HTML accordingly.
这是无头浏览器的完整列表。请注意,您可以使用任何语言执行此任务。
Here is a full list of headless browsers. Note that you can do this task in any language.
这篇关于Web抓取Python / R中的JavaScript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!