美丽的汤找不到标签 [英] Beautiful Soup Can't Find Tags

查看:57
本文介绍了美丽的汤找不到标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试使用Python 3.6中的请求和BeautifulSoup模块进行练习,并且遇到了一个问题,我似乎无法在其他问题和解答中找到任何信息.

I am currently trying to practice with the requests and BeautifulSoup Modules in Python 3.6 and have run into an issue that I can't seem to find any info on in other questions and answers.

似乎在页面上的某个时刻,Beuatiful Soup停止识别标签和ID.我试图从这样的页面中提取播放数据:

It seems that at some point in the page, Beuatiful Soup stops recognizing tags and Ids. I am trying to pull Play-by-play data from a page like this:

http://www.pro-football-reference.com/boxscores/201609080den.htm

import requests, bs4

source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
    raise Exception('No data found for this link: '+source_url)

soup = bs4.BeautifulSoup(res.text,'html.parser')

#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))

#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))

使用Chrome中的检查器,我可以看到该表确实存在.我也曾尝试在HTML的后半部分的'div'和'tr'上使用它,但它似乎不起作用.我已经尝试过标准的'html.parser'以及lxml和html5lib,但是似乎没有任何效果.

Using the inspector in Chrome, I can see that the table definitely exists. I have also tried to use it on 'div's and 'tr's in the later half of the HTML and it doesn't seem to work. I have tried the standard 'html.parser' as well as lxml and html5lib, but nothing seems to work.

我在这里做错什么了吗,或者HTML或其格式中是否有某些东西阻止BeautifulSoup正确地找到后面的标签?我遇到了与该公司(hockey-reference.com,Basketball-reference.com)相似的页面的问题,但是能够在其他站点上正确使用这些工具.

Am I doing something wrong here, or is there something in the HTML or its formatting that prevents BeautifulSoup from correctly finding the later tags? I have run into issues with similar pages run by this company (hockey-reference.com, basketball-reference.com), but have been able to use these tools properly on other sites.

如果HTML中包含某些内容,是否有更好的工具/库可帮助您从中提取此信息?

If it is something with the HTML, is there any better tool/library for helping to extract this info out there?

感谢您的帮助, 高炉

推荐答案

BS4在执行URL的GET请求后将无法执行网页的javascript.我认为关注的表是从客户端javascript异步加载的.

BS4 won't be able to execute the javascript of a web page after doing the GET request for a URL. I think that the table of concern is loaded async from client-side javascript.

因此,客户端Javascript必须先运行,然后才能抓取HTML.这篇帖子描述了如何做到这一点!

As a result, the client-side javascript will need to run first before scraping the HTML. This post describes how to do so!

这篇关于美丽的汤找不到标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆