Python selenium 获取由 javascript 添加的网页内容 [英] Python selenium get contents of a webpage added by javascript

查看:45
本文介绍了Python selenium 获取由 javascript 添加的网页内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用名为网易云音乐"的在线音乐播放器,并且我的帐户中有多个播放列表,它们包含数千首曲目,并且组织和分类非常差,并且保存了重复条目,因此我想将它们导出到一个 SQL 表来组织它们.

我找到了一种不使用客户端软件查看播放列表的方法,即点击播放列表页面顶部的分享按钮,然后点击复制链接".

但在客户端以外的任何浏览器中打开链接,播放列表将限制为 1000 首曲目.

但我找到了克服它的方法,我安装了

第一列是歌曲名,第二列是时长,第三列是艺术家,最后一列是专辑.

第一列、第三列和第四列的文字分别是歌曲、艺术家和专辑页面的超链接.

我对 html 一无所知,但我设法获得了它的数据结构.

我们需要的是位于xpath//table/tbody的表,每一行都是表的一个子节点,名为tr(xpath //table/tbody/tr).

这是一个示例行:

<div class="hd"><span data-res-id="5221710";数据资源类型=18"data-res-action=播放"data-res-from=13";数据资源数据=158624364";class=ply"></span><span class=num">1</span></div></td><td><div class="f-cb"><div class="tt"><div class="ttc"><span class="txt"><a href=#/song?id=5221710"><b title=Axel F">Axel F</b></a></span>

</td><td class="s-fc3"<span class="u-dur candel">03:00</span><div class="opt hshow"><a类=u-icn u-icn-81 icn-add"href="javascript:;";title="添加到播放列表";hidefocus =真"数据资源类型=18"data-res-id=5221710";data-res-action =添加到"data-res-from=13";data-res-data=158624364"</a><span data-res-id="5221710";数据资源类型=18"data-res-action=收藏"class =icn icn-fav";title="收藏"></span><span data-res-id="5221710";数据资源类型=18"data-res-action=共享"data-res-name=千禧年 80 年代最伟大的歌曲第 2 卷";data-res-author=Harold Faltermeyer"data-res-pic="https://p2.music.126.net/tOa6Tizqy755OZE7ITsw_g==/775155697626111.jpg";类=icn icn-share";title="分享">分享</span><span data-res-id="5221710";数据资源类型=18"data-res-action=下载"类=icn icn-dl";title="下载"></span><span data-res-id="5221710";数据资源类型=18"data-res-from=13";数据资源数据=158624364";data-res-action=删除"类=icn icn-del";title="删除">删除</span>

</td><td><div class="text"标题=哈罗德法尔特迈尔"><span title="Harold Faltermeyer"><a href="#/artist?id=34854";hidefocus=true">Harold Faltermeyer</a></span>

</td><td><div class="text"><a href="#/album?id=509819";title=千禧年 80 年代最伟大的歌曲第 2 卷">千禧年 80 年代的最伟大歌曲第 2 卷</a>

</td>

列是元素的子节点.

我设法获得了与列对应的 xpath:

/td[2]/div/div/div/span/a/b -->标题/td[2]/div/div/div/span/a -->歌曲链接/td[3]/span -->期间/td[4]/div/span/a -->艺术家/td[4]/div/span/a['href'] -->艺术家链接/td[5]/div/a -->专辑/td[5]/div/a['href'] -->专辑链接

我们应该在链接前添加地址 music.163.com/ 以获得完整地址.

我正在考虑使用 selenium 来获取元素,更具体地说,通过 xpath 查找行,然后遍历行并通过行内的 xpath 获取列,然后将值添加到命名元组列表中.

>

从这里开始,将元素添加到 SQL 表中是微不足道的.

但我就是无法让它工作.

我设法打开了一个 Firefox selenium 窗口,安装了 tampermonkey 和脚本来访问完整的播放列表(这两个安装是手动完成的),然后进入播放列表页面并尝试获取元素:

from selenium import webdriverFirefox = webdriver.Firefox()Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')Firefox.find_elements_by_xpath('//table/tbody/tr')

结果是一个空列表.

我不知道出了什么问题,我可以在开发人员工具中查看表格元素就好了,然后我查看了它的源代码并意识到表格不在其源代码中.

我什至设法使用开发人员工具获取了完整表格,并将其上传此处.

但是硒是看不见的.显然浏览器有办法显示不在原始 html 源代码中的内容,而 selenium 不能.那时我才意识到浏览器可以执行javascript,原始源代码中没有的附加内容可能是某个地方的javascript添加的,而我使用的代码不涉及javascript并且只能获取原始源代码,没有附加内容.

我试过谷歌搜索 python selenium 获取由 javascript 添加的网页内容,但没有帮助.

所以我有两个问题,第一,在短期内,如何使用一些html解析库来解析本地存储在txt文件中的一段html代码?

第二,从长远来看,我如何使用 selenium 或任何其他 Python html 库来获取包含由 javascript 添加的附加内容的网页的完整源代码,而不仅仅是没有附加内容的原始源代码,以便不需要每次都手动导出元素?

解决方案

最简单的答案是你必须在用 Firefox.get('https://music.163.com//#/playlist?id=158624364&userid=126762751') 在使用 Firefox.find_elements_by_xpath('//table/tbody/tr') 获取元素之前,让元素在页面加载.这需要几分钟.
所以,你可以简单地在那里添加一种 time.sleep(5) .
更好的方法是使用预期条件.
像这样:

from selenium import webdriverfrom selenium.webdriver.common.by import By从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.support 导入 expected_conditions 作为 ECFirefox = webdriver.Firefox()# 等待初始化,以秒为单位等待 = WebDriverWait(火狐,20)Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')wait.until(EC.visibility_of_element_located((By.XPATH, '//table/tbody/tr')))Firefox.find_elements_by_xpath('//table/tbody/tr')

UPD
那里有一个 iframe,因此您需要按如下方式切换到该 iframe:

from selenium import webdriverfrom selenium.webdriver.common.by import By从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.support 导入 expected_conditions 作为 ECFirefox = webdriver.Firefox()# 等待初始化,以秒为单位等待 = WebDriverWait(火狐,20)Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')iframe = driver.find_element_by_xpath('//iframe[@id="g_iframe"]')driver.switch_to.frame(iframe)wait.until(EC.visibility_of_element_located((By.XPATH, '//table/tbody/tr')))Firefox.find_elements_by_xpath('//table/tbody/tr')

I use an online music player called "Netease Cloud Music", and I have multiple playlists in my account, they hold thousands of tracks and are very poorly organized and categorized and held duplicate entries, so I want to export them into an SQL table to organize them.

I have found a way to view the playlists without using the client software, that is, clicking the share button on top of the playlist page and then click "copy link".

But opening the link in any browser other than the client, the playlist will be limited to 1000 tracks.

But I have found a way to overcome it, I installed Tampermonkey and then installed this script.

Now I can view full playlists in a browser.

This is a sample playlist.

The playlists look like this:

The first column holds the songtitle, the second column holds the duration, the third column holds the artist, and the last column holds the album.

The text in the first, third and fourth columns are hyperlinks to the song, artist and album pages respectively.

I don't know a thing about html but I managed to get its data structure.

The thing we need is the table located at xpath //table/tbody, each row is a childnode of the table named tr(xpath //table/tbody/tr).

this is a sample row:

<td class="left">
    <div class="hd "><span data-res-id="5221710" data-res-type="18" data-res-action="play" data-res-from="13" data-res-data="158624364" class="ply ">&nbsp;</span><span class="num">1</span></div>
</td>
<td>
    <div class="f-cb">
        <div class="tt">
            <div class="ttc">
                <span class="txt">
                    <a href="#/song?id=5221710"><b title="Axel F">Axel F</b></a>
                    
                    
                </span>
            </div>
        </div>
    </div>
</td>
<td class=" s-fc3">
    <span class="u-dur candel">03:00</span>
    <div class="opt hshow">
        <a class="u-icn u-icn-81 icn-add" href="javascript:;" title="添加到播放列表" hidefocus="true" data-res-type="18" data-res-id="5221710" data-res-action="addto" data-res-from="13" data-res-data="158624364"></a>
        <span data-res-id="5221710" data-res-type="18" data-res-action="fav" class="icn icn-fav" title="收藏"></span>
        <span data-res-id="5221710" data-res-type="18" data-res-action="share" data-res-name="Greatest Hits Of The Millennium 80's Vol.2" data-res-author="Harold Faltermeyer" data-res-pic="https://p2.music.126.net/tOa6Tizqy755OZE7ITsw_g==/775155697626111.jpg" class="icn icn-share" title="分享">分享</span>
        <span data-res-id="5221710" data-res-type="18" data-res-action="download" class="icn icn-dl" title="下载"></span>
        <span data-res-id="5221710" data-res-type="18" data-res-from="13" data-res-data="158624364" data-res-action="delete" class="icn icn-del" title="删除">删除</span>
    </div>
</td>
<td>
    <div class="text" title="Harold Faltermeyer">
        <span title="Harold Faltermeyer">
            <a href="#/artist?id=34854" hidefocus="true">Harold Faltermeyer</a>
        </span>
    </div>
</td>
<td>
    <div class="text">
        <a href="#/album?id=509819" title="Greatest Hits Of The Millennium 80's Vol.2">Greatest Hits Of The Millennium 80's Vol.2</a>
    </div>
</td>

The columns are childnodes of the element.

I have managed to get the xpaths corresponding to the columns:

/td[2]/div/div/div/span/a/b -->  title
/td[2]/div/div/div/span/a -->  song link
/td[3]/span -->  duration
/td[4]/div/span/a -->  artist
/td[4]/div/span/a['href'] -->  artist link
/td[5]/div/a -->  album
/td[5]/div/a['href'] -->  album link

We should add the address music.163.com/ in front of the links to get full addresses.

I was thinking about using selenium to get the elements, more specifically find the rows by xpath, then loop through the rows and get the columns by their xpaths inside the rows, then add the values to a list of namedtuples.

From here it is trivial to add the elements to an SQL table.

But I just can't get it to work.

I have managed to open a Firefox selenium window, install tampermonkey and the script to access the full playlist(these two installations are done manually), then get to the playlist page and tried to get the elements:

from selenium import webdriver
Firefox = webdriver.Firefox()
Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')
Firefox.find_elements_by_xpath('//table/tbody/tr')

The result is an empty list.

I don't know what went wrong, I can view the table element in developer tools just fine, then I have viewed its source code and realized that the table isn't in its source code.

I have even managed to obtain the full table with developer tools, and I uploaded it here.

But it is invisible to selenium. Apparently browsers have a way to display contents not in the original html source code and selenium can't. That's when I realized browsers can execute javascript and the additional contents not in the original source code are probably added by a javascript somewhere, and the code I used didn't involve javascript and can only get the original source code without the additional contents.

I tried Googling python selenium get contents of a webpage added by javascript, but it isn't helping.

So I have two questions, first, in the short term, how can I use some html parsing library to parse a piece of html code locally stored in a txt file?

And second, in the long term, how can I use selenium or any other Python html library to get complete source code of a webpage with additional contents added by javascript instead of only the original source code without the additional contents, so that I don't need to export the elements manually every time?

解决方案

The simplest answer is that you have to add some delay after opening the page with Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751') before getting the elements with Firefox.find_elements_by_xpath('//table/tbody/tr') to let the elements on the page loaded. It takes few moments.
So, you can simply add a kind of time.sleep(5) there.
The better approach is to use expected conditions instead.
Something like this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Firefox = webdriver.Firefox()

# Wait for initialize, in seconds
wait = WebDriverWait(Firefox, 20)

Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')

wait.until(EC.visibility_of_element_located((By.XPATH, '//table/tbody/tr')))

Firefox.find_elements_by_xpath('//table/tbody/tr')

UPD
There is an iframe there, so you need to switch to that iframe as following:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Firefox = webdriver.Firefox()

# Wait for initialize, in seconds
wait = WebDriverWait(Firefox, 20)

Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')

iframe = driver.find_element_by_xpath('//iframe[@id="g_iframe"]')
driver.switch_to.frame(iframe)

wait.until(EC.visibility_of_element_located((By.XPATH, '//table/tbody/tr')))

Firefox.find_elements_by_xpath('//table/tbody/tr')

这篇关于Python selenium 获取由 javascript 添加的网页内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆