Beautifulsoup和AJAX表问题 [英] Beautifulsoup and AJAX-table problem

查看:128
本文介绍了Beautifulsoup和AJAX表问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我提出了擦伤国际星际争霸2场的球队液体数据库的游戏脚本。 (http://www.teamliquid.net/tlpd/sc2-international/games)

不过,我来accros一个问题。我有我的剧本在所有的页面循环,但是团队液体网站使用某种形式的AJAX的,我认为在表中进行更新。现在,当我使用BeautifulSoup我不能得到正确的数据。

所以,通过这些页面我循环:

HTTP://www.teamliquid。净/ tlpd / SC2-国际/游戏任务型教学#-948-1-1-DESC

HTTP://www.teamliquid。净/ tlpd / SC2-国际/游戏任务型教学#-948-2-1-DESC

HTTP://www.teamliquid。净/ tlpd / SC2-国际/游戏任务型教学#-948-3-1-DESC

HTTP://www.teamliquid。净/ tlpd / SC2-国际/游戏任务型教学#-948-4-1-DESC
等等...

当您亲自打开你看到不同的页面,但我的剧本不断每次得到相同的第一页。我想这是因为在打开其他网页,当你看到一些东西装货的时间更新表与游戏正确的页面量小。所以我想beatifulsoup是快速的,需要等待装载和表的更新工作要做。

所以我的问题是:我如何确保它需要更新的表

我现在用这个code,以获得表的内容,之后,我把内容在.csv:

  HTML =的urlopen(URL).read()。低()
BS = BeautifulSoup(HTML)
表= bs.find(拉姆达标签:tag.name =='表'和tag.has_key(ID)
                和标签['身份证'] ==tblt_table)
行= table.findAll(拉姆达标签:tag.name =='TR')


解决方案

当你试图放弃使用AJAX网站,最好看什么的javascript code实际上做。在许多情况下,简单地检索XML或HTML,这将是更容易比非AJAXy内容刮。它只是需要寻找一些来源$ C ​​$ C。

在你的情况,该网站检索本身表控件的HTML code(而不是刷新整个页面)从一个特殊的URL,动态替换它在浏览器的DOM。综观<一个href=\"http://www.teamliquid.net/tlpd/tabulator/ajax.js\">http://www.teamliquid.net/tlpd/tabulator/ajax.js,你会看到这个URL的格式如下:

http://www.teamliquid.net/tlpd/tabulator/update.php?tabulator_id=1811&tabulator_page=1&tabulator_order_col=1&tabulator_order_desc=1&tabulator_Search&tabulator_search=

因此​​,所有你需要做的是直接与BeautifulSoup凑这个URL,并且希望下一个页面,每次推进tabulator_page计数器。

I am making a script that scrapes the games of the Team Liquid database of international StarCraft 2 games. (http://www.teamliquid.net/tlpd/sc2-international/games)

However I come accros a problem. I have my script looping through all the pages, however the Team Liquid site uses some kind of AJAX I think in the table to update it. Now when I use BeautifulSoup I can't get the right data.

So I loop through these pages:

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-1-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-2-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-3-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-4-1-DESC etc...

When you open these yourself you see different pages, however my script keeps getting the same first page every time. I think this is because when opening the other pages you see some loading thing for a small amount of time updating the table with games to the correct page. So I guess beatifulsoup is to fast and needs to wait for the loading and updating of the table to be done.

So my question is: How can i make sure it takes the updated table?

I now use this code to get the contents of the table, after which I put the contents in a .csv:

html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id')
                and tag['id']=="tblt_table") 
rows = table.findAll(lambda tag: tag.name=='tr')

解决方案

When you try to scrap a site using AJAX, it's best to see what the javascript code actually does. In many cases it simply retrieves XML or HTML, which would be even easier to scrape than the non-AJAXy content. It just requires looking at some source code.

In your case, the site retrieves the HTML code for the table control by itself (instead of refreshing the whole page) from a special URL and dynamically replaces it in the browser DOM. Looking at http://www.teamliquid.net/tlpd/tabulator/ajax.js, you'd see this URL is formatted like this:

http://www.teamliquid.net/tlpd/tabulator/update.php?tabulator_id=1811&tabulator_page=1&tabulator_order_col=1&tabulator_order_desc=1&tabulator_Search&tabulator_search=

So all you need to do is to scrape this URL directly with BeautifulSoup and advance the tabulator_page counter each time you want the next page.

这篇关于Beautifulsoup和AJAX表问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆