如何删除python中的非Ascii字符 [英] How to remove nonAscii characters in python

查看:25
本文介绍了如何删除python中的非Ascii字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的代码:

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
f.write(str(div))
f.close()

抓取这些网页时,我发现一些非AScii 字符写入了从该脚本编写的 html 文件中,我需要将其删除或解析为可读字符.有什么建议吗?谢谢

Scraping those webpages i've found some nonAScii characters into the html file written from this script that i need to remove or solve into a readable chars. Any advice? Thanks

推荐答案

字符为 8 字节 (0-255),ascii 字符为 7 字节 (0-127),因此您可以简单地删除所有具有以下 ord 值的字符128

characters are 8 byte (0-255), ascii chars are 7 byte (0-127), so you can simply drop all chars with a ord value below 128

chr 将整数转换为字符,ord 将字符转换为整数.

chr convert a integer to a character, ord converts a character to an integer.

text = ''.join((c for c in str(div) if ord(c) < 128)

这应该是你的最终代码

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
text = ''.join((c for c in str(div) if ord(c) < 128)
f.write(text)
f.close()

这篇关于如何删除python中的非Ascii字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆