使用Python从安全网站中提取和解析HTML? [英] Extracting and parsing HTML from a secure website with Python?

查看:232
本文介绍了使用Python从安全网站中提取和解析HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们深入研究吧?

好吧,我需要编写一个脚本(我不在乎什么语言,更喜欢Python或Javascript之类的东西,但是无论如何我都会花一些时间来学习).该脚本将访问多个URL,从每个站点提取文本并将其存储到我的PC上的文件夹中. (我从那里开始用Python处理数据,我知道该怎么做.)

Ok, I need to write a script (I don't care what language, prefer something like Python or Javascript, but whatever works I will take time to learn). The script will access multiple URL's, extract text from each site and store it into a folder on my PC. (From there I am manipulating the data with Python, which I know how to do.)

目前,我正在使用python的NLTK模块.这是我的代码的简单版本:

Currently I am using python's NLTK module. Here is a simple version of my code:

url  = "<URL HERE>"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)

此代码对于 http https 均适用,但不适用于需要身份验证的实例.

This code works fine for both http and https, but not for instances where authentication is required.

是否有一个用于处理安全身份验证的Python模块?

Is there a Python module which deals with secure authentication?

在此先感谢您的帮助!对于那些认为这是一个不好的问题的mods,请给我一些方法来改善它.我需要别人的想法,而不是Google的想法.

Thanks in advance for help! And to the mods who will view this as a bad question, please just give me ways to make it better. I need ideas..from people, not Google.

推荐答案

机械化( 2 )是一个选择,其他仅适用于urllib2

Mechanize (2) is one option, other is just with urllib2

这篇关于使用Python从安全网站中提取和解析HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆