使用 BeautifulSoup 解析 HTML 标签时,HTML 标签显示为空,但在浏览器中打开时有内容 [英] HTML tag appears empty when parsing it with BeautifulSoup but has content when opened in browser

查看:26
本文介绍了使用 BeautifulSoup 解析 HTML 标签时,HTML 标签显示为空,但在浏览器中打开时有内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在通过 BS4 解析 html 页面时遇到问题.我在 html 页面中有一个隐藏的 div,我想使用 BeautifulSoup 读取其中的内容.其内容由通过主体 onload 触发的 javascript 函数动态生成.

问题是:当我在浏览器中调用页面时,标签具有它应该具有的内容.当我通过 BS4 解析同一个页面时,标签为空.

我找不到有关 BS4 无法处理 onload javascript 生成的内容的任何信息,因此不确定这里可能存在什么问题.

Python 脚本:

import urllib.request从 bs4 导入 BeautifulSoup导入时间导入日期时间eT = time.time()版本 = 1vNum = str(版本)t = datetime.datetime.now()d = "0" + str(t.day)#d = d.rstrip()d = d[-2:]m = "0" + str(t.month)#m = m.rstrip()m = m[-2:]y = str(t.year)dStr = y + m + dresultFile = 'output/classAndIdList-' + dStr + '-v' + vNum + '.txt'pageListFile = '输入/quickListFR.txt'f = open(pageListFile, mode='r', encoding='utf-8')urlRoot = 'http://dev.example.com/'fOut = open(resultFile, 'w')ciList = []# 对于 urls.split('
') 中的 url:对于 f 中的 l:u = l.rstrip()url = urlRoot + uhtml_content = urllib.request.urlopen(url)时间.sleep(1)html_text = html_content.read()汤 = BeautifulSoup(html_text)ciTag = 汤.find(id="testDivCSS")打印(ciTag)ciString = ciTag.get_text()# 打印(ciString)ciArray = ciString.split(',')# 打印(ciArray)对于 ciArray 中的 c:如果 c 不在 ciList 中:ciList.append(c)fOut.write(c + '
')打印(c)打印(u + '...完成')fOut.close()

通过 BeautifulSoup 的示例结果页面:

Example-page-1.html...完成<div id="testDivCSS" style="display: none;">

以及浏览器中的 div(表明 php 和 javascript 部分工作正常):

<div id="testDivCSS" style="display: none;">div#menu_rightup,div#social,div#sidebar,div#specific,div#menu_rightdown,div#footer</div>

解决方案

BeautifulSoup 无法通过 javascript 处理动态生成的内容.您可以先使用浏览器自动化工具(例如selenium)帮助获取整个页面(包括动态部分),然后使用BeautifulSoup 来解析页面.

参考这个问题:如何使用蟒蛇

I have an issue when parsing an html page through BS4. I have a hidden div in an html page of which I want to read the content using BeautifulSoup. The content of which is generated dynamically by a javascript function which is triggered via body onload.

The problem is: when I call the page in my browser, the tag has the content it is supposed to have. When I parse the same page via BS4, the tag is empty.

I could not find any information with regards to BS4 not being able to handle onload javascript-generated content, so not sure what the issue may be here.

Python script:

import urllib.request
from bs4 import BeautifulSoup

import time
import datetime
eT = time.time()

version = 1
vNum = str(version)

t = datetime.datetime.now()

d = "0" + str(t.day)
#d = d.rstrip()
d = d[-2:]
m = "0" + str(t.month)
#m = m.rstrip()
m = m[-2:]
y = str(t.year)

dStr = y + m + d

resultFile = 'output/classAndIdList-' + dStr + '-v' + vNum + '.txt'
pageListFile = 'input/quickListFR.txt'
f = open(pageListFile, mode='r', encoding='utf-8')

urlRoot = 'http://dev.example.com/'

fOut = open(resultFile, 'w')
ciList = []

# for url in urls.split('
'):
for l in f:
    u = l.rstrip()  
    url = urlRoot + u
    html_content = urllib.request.urlopen(url)
    time.sleep(1)
    html_text = html_content.read()
    soup = BeautifulSoup(html_text)
    ciTag = soup.find(id="testDivCSS")
    print(ciTag)
    ciString = ciTag.get_text()
    # print(ciString)
    ciArray = ciString.split(',')
    # print(ciArray)
    for c in ciArray:
        if c not in ciList:
            ciList.append(c)
            fOut.write(c + '
')
            print(c)
    print(u + '... DONE')       
fOut.close()

Example result page via BeautifulSoup:

Example-page-1.html... DONE
<div id="testDivCSS" style="display: none;"> </div>

And the div in the browser (indicating that the php and javascript parts work fine):

<div id="testDivCSS" style="display: none;">div#menu_rightup,div#social,div#sidebar,div#specific,div#menu_rightdown,div#footer</div>

解决方案

BeautifulSoup cannot handle dynamic generate contents by javascript. You may use browser automation tools (such as selenium) to help get the whole page (including dynamic part) first, then use BeautifulSoup to parse the page.

Refer to this question: How to retrieve the values of dynamic html content using Python

这篇关于使用 BeautifulSoup 解析 HTML 标签时,HTML 标签显示为空,但在浏览器中打开时有内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆