使用Python从安全网站中提取和解析HTML? [英] Extracting and parsing HTML from a secure website with Python?
问题描述
让我们深入研究吧?
好吧,我需要编写一个脚本(我不在乎什么语言,更喜欢Python或Javascript之类的东西,但是无论如何我都会花一些时间来学习).该脚本将访问多个URL,从每个站点提取文本并将其存储到我的PC上的文件夹中. (我从那里开始用Python处理数据,我知道该怎么做.)
Ok, I need to write a script (I don't care what language, prefer something like Python or Javascript, but whatever works I will take time to learn). The script will access multiple URL's, extract text from each site and store it into a folder on my PC. (From there I am manipulating the data with Python, which I know how to do.)
目前,我正在使用python的NLTK
模块.这是我的代码的简单版本:
Currently I am using python's NLTK
module. Here is a simple version of my code:
url = "<URL HERE>"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
此代码对于 http 和 https 均适用,但不适用于需要身份验证的实例.
This code works fine for both http and https, but not for instances where authentication is required.
是否有一个用于处理安全身份验证的Python模块?
Is there a Python module which deals with secure authentication?
在此先感谢您的帮助!对于那些认为这是一个不好的问题的mods,请给我一些方法来改善它.我需要别人的想法,而不是Google的想法.
Thanks in advance for help! And to the mods who will view this as a bad question, please just give me ways to make it better. I need ideas..from people, not Google.
推荐答案
Mechanize (2) is one option, other is just with urllib2
这篇关于使用Python从安全网站中提取和解析HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!