Python Web爬虫和“获取” html源代码 [英] Python Web Crawlers and "getting" html source code

查看:381
本文介绍了Python Web爬虫和“获取” html源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我的兄弟希望我用Python编写一个网络爬虫(自学),我知道C ++,Java和一些html。我正在使用2.7版本并阅读python库,但我遇到了一些问题
1. httplib.HTTPConnection request 概念对我来说是新的,我不明白它是否下载像Cookie或实例一样的html脚本。如果你同时做这两件事,你会得到一个网站页面的来源吗?我需要知道什么是修改页面并返回修改过的页面的单词。



仅用于背景,我需要下载页面并替换任何img与我拥有的



如果你们可以告诉我你对2.7和3.1的看法,那将是非常好的

>解决方案

使用Python 2.7,目前有更多的第三方库。编辑:见下文)。



我建议您使用stdlib模块 urllib2 ,它可以让您轻松获取网络资源。
示例:

  import urllib2 

response = urllib2.urlopen(http:/ /google.de)
page_source = response.read()

解析代码,看看 BeautifulSoup



顺便说一句:你到底想做什么:


仅用于背景,我需要下载一个页面,并用我有的$

$ b $替换任何img b

编辑:现在是2014年,大多数重要的图书馆已经移植,如果可以的话,您一定要使用Python 3。 python-requests 是一个非常好的高级库比 urllib2 更容易使用。


So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.

Just for background, I need to download a page and replace any img with ones I have

And it would be nice if you guys could tell me your opinion of 2.7 and 3.1

解决方案

Use Python 2.7, is has more 3rd party libs at the moment. (Edit: see below).

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example:

import urllib2

response = urllib2.urlopen("http://google.de")
page_source = response.read()

For parsing the code, have a look at BeautifulSoup.

BTW: what exactly do you want to do:

Just for background, I need to download a page and replace any img with ones I have

Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.

这篇关于Python Web爬虫和“获取” html源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆