使用 Python 抓取 JavaScript 生成的数据 [英] Scraping javascript-generated data using Python

查看:33
本文介绍了使用 Python 抓取 JavaScript 生成的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Python 抓取以下 url 的一些数据.http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

这是关于公司信息的摘要.

我要抓取的内容没有显示在第一页上.通过单击名为재무제표"的选项卡,您可以访问财务报表.然后点击名为현금흐름표"的标签,您可以访问现金流".

我想抓取现金流"数据.

但是,现金流数据是由 javascript 跨 url 生成的.以下链接是隐藏的网址,http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=>

现金流数据是通过向这个 url 提交一些选项值和 cookie 来生成的.

如您所见,第一个链接中的 itemcode=078340 表示股票代码,我想收集现金流数据的股票多达 1680 只.我想让它成为一个循环结构.

有什么好的方法可以抓取现金流数据?我尝试了scrapy,但scrapy 很难处理我已经在使用的另一个抓取代码.

解决方案

还有 dryscape(一个图书馆由我写的,所以建议有点偏颇,显然:) 它使用基于 Webkit 的快速内存浏览器来导航.它也能理解 Javascript,但比 Selenium 轻得多.

I want to scrape some data of following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

It's about a summary of company information.

What I want to scrape is not shown on the first page. By clicking tab named "재무제표", you can access financial statement. And clicking tab named "현금흐름표', you can access "Cash Flow".

I want to scrape the "Cash Flow" data.

However, Cash flow data is generated by javascript across the url. The following link is that url which is hidden, http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=

Cash flow data is generated by submitting some option value and cookie to this url.

As you perceived, itemcode=078340 in the first link means stock code and there are as many as 1680 stocks that I want gather cash flow data. I want make it a loop structure.

Is there good way to scrape cash flow data? I tried scrapy but scrapy is difficult to cope with my another scraping code already I'm using.

解决方案

There's also dryscape (a library written by me, so the recommendation is a bit biased, obviously :) which uses a fast Webkit-based in-memory browser to navigate around. It understands Javascript, too, but is a lot more lightweight than Selenium.

这篇关于使用 Python 抓取 JavaScript 生成的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆