python - 使用selenium，用PhantomJS抓取网页后保存为.html后出现中文乱码？

查看：718 发布时间：2017/9/6 6:15:01 python selenium phantomjs 中文乱码

本文介绍了python - 使用selenium，用PhantomJS抓取网页后保存为.html后出现中文乱码？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题

保存的html文档中出现中文乱码：
<meta name="keywords" content="鈽呯敤閽㈢惔璇犻噴鍛ㄦ澃浼︹櫔鏃犱笌浼︽瘮涓嶉€濈粡鍏革紝姊︽兂瀹禯eDragon锛岄挗鐞达紝缁忓吀锛岃交闊充箰">

代码：

from selenium import webdriver
browser = webdriver.PhantomJS( )
url = 'http://music.163.com/#/playlist?id=11362719'
browser.get(url)  # 打开网页
browser.switch_to.frame(browser.find_element_by_xpath("//iframe"))
#title = browser.find_elements_by_xpath('//*[@id="play-count"]')
#title = browser.find_elements_by_xpath('//*tr/@class')  
#print(browser.page_source.encoding('utf-8'))
print(browser.page_source,file=open('C:/Users/welwel/Desktop/source.html','w',encoding='utf-8'))
browser.quit()

1.使用type（browser.page_source）查看类型是str,无法使用。decode转换格式
2.用的是win7下python3.5 的IDLE，使用sys.getdefaultencoding()查看默认编码是‘utf-8’
3.直接使用print(browser.page_source)报错：

Traceback (most recent call last):
  File "C:\Users\welwel\Desktop\wangyi.py", line 8, in <module>
print(browser.page_source)
  File "C:\Python35-32\lib\idlelib\PyShell.py", line 1344, in write
return self.shell.write(s, self.tags)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 55288-55288: Non-BMP character not supported in Tk

这个问题在第一次爬取网页时使用print(browser.page_source,file=open('C:/Users/welwel/Desktop/source.html','w',encoding='utf-8'))
不会报错，但是加在for循环中就会从第二个开始，出现中文乱码，不知是不是bug。有没有遇到过的。

解决方案

试试这样:

print(browser.page_source.encode('utf-8').decode(), file=open("xxx.html","w", encoding='utf-8'))

这篇关于python - 使用selenium，用PhantomJS抓取网页后保存为.html后出现中文乱码？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

python - 使用selenium，用PhantomJS抓取网页后保存为.html后出现中文乱码？

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

python - 使用selenium，用PhantomJS抓取网页后保存为.html后出现中文乱码？

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭