美丽的汤找不到标签 [英] Beautiful Soup Can't Find Tags
问题描述
我目前正在尝试使用Python 3.6中的请求和BeautifulSoup模块进行练习,并且遇到了一个问题,我似乎无法在其他问题和解答中找到任何信息.
I am currently trying to practice with the requests and BeautifulSoup Modules in Python 3.6 and have run into an issue that I can't seem to find any info on in other questions and answers.
似乎在页面上的某个时刻,Beuatiful Soup停止识别标签和ID.我试图从这样的页面中提取播放数据:
It seems that at some point in the page, Beuatiful Soup stops recognizing tags and Ids. I am trying to pull Play-by-play data from a page like this:
http://www.pro-football-reference.com/boxscores/201609080den.htm
import requests, bs4
source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
raise Exception('No data found for this link: '+source_url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))
#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))
使用Chrome中的检查器,我可以看到该表确实存在.我也曾尝试在HTML的后半部分的'div'和'tr'上使用它,但它似乎不起作用.我已经尝试过标准的'html.parser'以及lxml和html5lib,但是似乎没有任何效果.
Using the inspector in Chrome, I can see that the table definitely exists. I have also tried to use it on 'div's and 'tr's in the later half of the HTML and it doesn't seem to work. I have tried the standard 'html.parser' as well as lxml and html5lib, but nothing seems to work.
我在这里做错什么了吗,或者HTML或其格式中是否有某些东西阻止BeautifulSoup正确地找到后面的标签?我遇到了与该公司(hockey-reference.com,Basketball-reference.com)相似的页面的问题,但是能够在其他站点上正确使用这些工具.
Am I doing something wrong here, or is there something in the HTML or its formatting that prevents BeautifulSoup from correctly finding the later tags? I have run into issues with similar pages run by this company (hockey-reference.com, basketball-reference.com), but have been able to use these tools properly on other sites.
如果HTML中包含某些内容,是否有更好的工具/库可帮助您从中提取此信息?
If it is something with the HTML, is there any better tool/library for helping to extract this info out there?
感谢您的帮助, 高炉
推荐答案
BS4在执行URL的GET请求后将无法执行网页的javascript.我认为关注的表是从客户端javascript异步加载的.
BS4 won't be able to execute the javascript of a web page after doing the GET request for a URL. I think that the table of concern is loaded async from client-side javascript.
因此,客户端Javascript必须先运行,然后才能抓取HTML.这篇帖子描述了如何做到这一点!
As a result, the client-side javascript will need to run first before scraping the HTML. This post describes how to do so!
这篇关于美丽的汤找不到标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!