Beautifulsoup 和 AJAX 表问题 [英] Beautifulsoup and AJAX-table problem

查看:16
本文介绍了Beautifulsoup 和 AJAX 表问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在制作一个脚本,用于抓取国际星际争霸 2 游戏的 Team Liquid 数据库中的游戏.(http://www.teamliquid.net/tlpd/sc2-international/games)

I am making a script that scrapes the games of the Team Liquid database of international StarCraft 2 games. (http://www.teamliquid.net/tlpd/sc2-international/games)

但是我遇到了一个问题.我的脚本循环遍历所有页面,但是 Team Liquid 站点使用我认为在表中的某种 AJAX 来更新它.现在,当我使用 BeautifulSoup 时,我无法获得正确的数据.

However I come accros a problem. I have my script looping through all the pages, however the Team Liquid site uses some kind of AJAX I think in the table to update it. Now when I use BeautifulSoup I can't get the right data.

所以我循环浏览这些页面:

So I loop through these pages:

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-1-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-2-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-3-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-4-1-DESC等等...

当您自己打开这些页面时,您会看到不同的页面,但是我的脚本每次都会获得相同的第一页.我认为这是因为在打开其他页面时,您会看到一些加载内容,将带有游戏的表格更新到正确的页面.所以我猜 beatifulsoup 太快了,需要等待表的加载和更新完成.

When you open these yourself you see different pages, however my script keeps getting the same first page every time. I think this is because when opening the other pages you see some loading thing for a small amount of time updating the table with games to the correct page. So I guess beatifulsoup is to fast and needs to wait for the loading and updating of the table to be done.

所以我的问题是:我如何确保它采用更新后的表格?

我现在使用此代码获取表格的内容,然后将内容放入 .csv:

I now use this code to get the contents of the table, after which I put the contents in a .csv:

html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id')
                and tag['id']=="tblt_table") 
rows = table.findAll(lambda tag: tag.name=='tr')

推荐答案

当您尝试使用 AJAX 抓取站点时,最好查看 javascript 代码实际执行的操作.在许多情况下,它只是检索 XML 或 HTML,这比非 AJAXy 内容更容易抓取.它只需要查看一些源代码.

When you try to scrape a site using AJAX, it's best to see what the javascript code actually does. In many cases it simply retrieves XML or HTML, which would be even easier to scrape than the non-AJAXy content. It just requires looking at some source code.

在您的情况下,站点从特殊 URL 中自行检索表格控件的 HTML 代码(而不是刷新整个页面),并在浏览器 DOM 中动态替换它.看着 http://www.teamliquid.net/tlpd/tabulator/ajax.js,您会看到此 URL 的格式如下:

In your case, the site retrieves the HTML code for the table control by itself (instead of refreshing the whole page) from a special URL and dynamically replaces it in the browser DOM. Looking at http://www.teamliquid.net/tlpd/tabulator/ajax.js, you'd see this URL is formatted like this:

http://www.teamliquid.net/tlpd/tabulator/update.php?tabulator_id=1811&tabulator_page=1&tabulator_order_col=1&tabulator_order_desc=1&tabulator_Search&tabulator_search=

http://www.teamliquid.net/tlpd/tabulator/update.php?tabulator_id=1811&tabulator_page=1&tabulator_order_col=1&tabulator_order_desc=1&tabulator_Search&tabulator_search=

所以你需要做的就是直接用 BeautifulSoup 抓取这个 URL 并在每次你想要下一页时推进 tabulator_page 计数器.

So all you need to do is to scrape this URL directly with BeautifulSoup and advance the tabulator_page counter each time you want the next page.

这篇关于Beautifulsoup 和 AJAX 表问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆