刮汉字python [英] scraping chinese characters python

查看:90
本文介绍了刮汉字python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从 https://automatetheboringstuff.com 了解了如何抓取网站.我想取消 http://www.piaotian.net/html/3/3028/1473227.html ,其中内容为中文,并将其内容写入.txt文件.但是,.txt文件包含随机符号,我认为这是编码/解码问题.

I learnt how to scrap website from https://automatetheboringstuff.com. I wanted to scrap http://www.piaotian.net/html/3/3028/1473227.html in which the contents is in chinese and write its contents into a .txt file. However, the .txt file contains random symbols which I assume is a encoding/decoding problem.

我已阅读此线程"如何进行解码和用python编码网页?",并确定我网站的编码方法是"gb2312"和"windows-1252".我尝试用这两种编码方法解码,但失败了.

I've read this thread "how to decode and encode web page with python?" and figured the encoding method for my site is "gb2312" and "windows-1252". I tried decoding in those two encoding methods but failed.

有人可以向我解释我的代码问题吗?我是编程的新手,所以也请让我知道我的误解!

Can someone kindly explain to me the problem with my code? I'm very new to programming so please let me know my misconceptions as well!

此外,当我从代码中删除"html.parser"时,.txt文件原来是空的,而不是至少包含符号.为什么会这样呢?

Also, when I remove the "html.parser" from the code, the .txt file turns out to be empty instead of having at least symbols. Why is this the case?

import bs4, requests, sys

reload(sys)
sys.setdefaultencoding("utf-8")

novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()

novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")

content = novelSoup.select("br")

novelFile = open("novel.txt", "w")
for i in range(len(content)):
    novelFile.write(str(content[i].getText()))

推荐答案

novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()
novel.encoding = "GBK"
novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")

退出:

<br>
    一元宗,坐落在青峰山上,绵延极长,现在是盛夏时节,天空之中,太阳慢慢落了下去,夕阳将影子拉的很长。<br/>
<br/>
    一片不是很大的小湖泊边上,一个约莫着十七八岁的青衣少年坐在湖边,抓起湖边的一块石头扔出,顿时在湖边打出几朵浪花。<br/>
<br/>
    叶希文有些茫然,他没想到,他居然穿越了,原本叶希文只是二十一世纪的地球上一个普通的大学生罢了,一个月了,他才后知后觉的反应过来,这不是有人和他进行恶作剧,而是,他真的穿越了。<br/>

请求将自动解码来自服务器的内容.最多 unicode字符集被无缝解码.

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded.

发出请求时,请求会针对 基于HTTP标头的响应编码.文字编码 访问r.text时,将使用Requests猜测的值.你可以找出来 请求使用的是哪种编码,并使用r.encoding进行更改 属性:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property:

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

如果您更改编码,则请求将使用的新值 每次调用r.text时进行r.encoding.

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text.

这篇关于刮汉字python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆