在python中打印网页源代码 [英] Print web page source code in python

查看:41
本文介绍了在python中打印网页源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想打印一个网页源代码,但 python 打印命令只打印空白空间,我认为这是因为它的尺寸很大.有没有办法在 shell 或文件中的列表中打印页面源代码?我曾尝试在文件中打印,但发生此错误:

UnicodeEncodeError: 'charmap' codec can't encode character '\u06cc' in position 11826: character maps to <undefined>

我该如何解决?

导入 urllib.requestresponse = urllib.request.urlopen('http://www.farsnews.com')html = response.read()打印(html)#打印空白区域!hf=open('test.txt','w')a=str(html,'utf-8')hf.write(a)hf.close()

Python 很容易打印 a[0:1000] 但对于 a[0:len(a)] 正如我所说的空白!

解决方案

我刚刚在 Win7 上使用 python 3.2.5 尝试了同样的方法,这是我得到的:

Python 3.2.5(默认,2013 年 5 月 15 日,23:07:10)[MSC v.1500 64 位 (AMD64)] on win32输入帮助"、版权"、信用"或许可"以获取更多信息.>>>从 urllib 导入请求>>>r = request.urlopen("http://www.farsnews.com")>>>字节码 = r.read()>>>htmlstr = bytecode.decode()>>>打印(字节码)

打印 bytecode 效果很好,因为它打印了编码的表示对于 unicode 字符,但打印 htmlstr 会引发 UnicodeDecodeError在 Windows 上,因为无法使用当前语言环境打印某些字符默认编码(windows 的 cmd.exe 不是 unicode)

就我而言,使用的编码是 'cp866',正如我在回溯中看到的那样.

默认情况下,py3k 使用 'utf-8' 编码来存储字符串数据,如果你想覆盖它,你应该明确指定用于解码的编码

这里是可能的解决方法:

<预><代码>>>>safe_str = bytecode.decode(encoding='cp866', errors='ignore')>>>打印(safe_str)

其实等价于

<预><代码>>>>safe_str = str(bytecode, encoding='cp866', errors='ignore')>>>打印(safe_str)

第二个参数errors告诉当您尝试使用的编码无法解码特定字符

I want to print a web page source code but python print command just prints empty space and I think it's because of its large size. Is there any way to print page source code in shell or at list in a file? I've tried printing in a file but this error occurred:

UnicodeEncodeError: 'charmap' codec can't encode character '\u06cc' in position 11826: character maps to <undefined>

How can I fix it?

import urllib.request
response = urllib.request.urlopen('http://www.farsnews.com')
html = response.read()

print(html)#prints empty space! 

hf=open('test.txt','w')
a=str(html,'utf-8')
hf.write(a)
hf.close()

Python easily prints a[0:1000] but for a[0:len(a)] as I said empty space!

解决方案

I've just tried the same on Win7 using python 3.2.5 and here's what I got:

Python 3.2.5 (default, May 15 2013, 23:07:10) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib import request
>>> r = request.urlopen("http://www.farsnews.com")
>>> bytecode = r.read()
>>> htmlstr = bytecode.decode()
>>> print(bytecode)

Printing bytecode works well as it prints the encoded representations for unicode chars but printing the htmlstr raises the UnicodeDecodeError on windows because some chars cannot be printed using current locale's default encoding (windows' cmd.exe is not unicode)

In my case the encoding that has been used was 'cp866' as I saw it in traceback.

By default py3k uses the 'utf-8' encoding to store string data and if you want to override it you should explicitly specify the encoding to use for decoding

So here's the possibble workaround:

>>> safe_str = bytecode.decode(encoding='cp866', errors='ignore')
>>> print(safe_str)

Actually, it's equivalent to

>>> safe_str = str(bytecode, encoding='cp866', errors='ignore')
>>> print(safe_str)

The second parameter errors tells whether the error should be rose when the encoding you're trying to use cannot decode the particular character

这篇关于在python中打印网页源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆