有没有一个Python库,允许你屏蔽一个严重依赖JavaScript的网站? [英] Is there a Python library that allows you to screen-scrape a web site that relies heavily on JavaScript?

查看:121
本文介绍了有没有一个Python库,允许你屏蔽一个严重依赖JavaScript的网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


可能存在重复:

什么是一个很好的工具,屏幕抓取Javascript支持?


我正在尝试对银行网站进行一些屏幕截图。 (我知道,我可能是一个失败者,但忍受着我。)



该网站似乎设置了几个cookie,通过不同的会话相关值,通过JavaScript,然后重定向到主页,如果它找不到这些值。



我一直在想办法找出这些cookie的值通过搜索页面的HTML / JavaScript代码,但相关代码看起来很混乱,所以我很难做到这一点。



有没有Python库模拟启用JavaScript的Web浏览器?我正在考虑像机械化那样:


  • 解析返回的HTML页面(例如使用类似lxml的东西)

  • 解析HTML页面上的任何JavaScript

  • 设置由JavaScript设置的任何Cookie

  • 用JavaScript修改任何DOM修改来修改已解析的HTML页面



  • 基本上是一个可以用Python编程的网页浏览器。否则,这是一种解决其他语言的解决方案。

    解决方案

    我回答了类似的问题:点击python中的JavaScript链接?


    Possible Duplicate:
    What's a good tool to screen-scrape with Javascript support?

    I’m trying to do some screen-scraping of my bank’s website. (I know, I’m probably onto a loser, but bear with me.)

    The site seems to be setting several cookies, with varying session-related values, via JavaScript, and then redirecting to the home page if it can’t find those values.

    I’ve been trying to figure out a way to spot the values of those cookies by searching the HTML/JavaScript code of the pages, but the relevant code looks very obfuscated, so I’m having a hard time doing it.

    Is there a Python library that simulates a web browser with JavaScript enabled? I was thinking something like mechanize that also:

    • parses the HTML page returned (e.g. with something like lxml)
    • parses any JavaScript on the HTML page
    • sets any cookies set by the JavaScript
    • amends the parsed HTML page with any DOM modifications made by the JavaScript

    Basically a web browser that’s programmable in Python. Failing that, a solution in any other language.

    解决方案

    I answered a similar question on: Click on a javascript link within python?

    这篇关于有没有一个Python库,允许你屏蔽一个严重依赖JavaScript的网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆