警告：某些字符不能去codeD，并通过替换字符被替换 [英] Warning: Some characters could not be decoded, and were replaced by REPLACEMENT CHARACTER

查看：760 发布时间：2016/8/5 19:18:40 python unicode encoding web-scraping beautifulsoup

本文介绍了警告：某些字符不能去codeD，并通过替换字符被替换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我创建一个脚本来从网站下载一些MP3播客，并将其写入到一定的位置。我接近完成，并且文件被下载并创建。不过，我运行到那里的二进制数据不能完全去codeD和MP3文件将无法播放的问题。

下面是我的code：

 进口重
进口OS
进口的urllib2
从BS4进口BeautifulSoup
进口时间高清getHTMLstring（URL）：
    HTML = urllib2.urlopen（URL）
    汤= BeautifulSoup（HTML）
    soupString = soup.en code（UTF-8）
    返回soupString高清的GetList（html_string）：
    urlList =通过re.findall（'（HTTP：//播客\\ .travelsinamathematicalworld \\ .CO \\ .UK \\ /mp3/.* \\ .MP3）'，html_string）
    firstUrl = urlList [0]
    finalList = [firstUrl]    在urlList网址：
        如果URL = finalList [0]！
            finalList.insert（0，URL）    返回finalList高清getBinary（netLocation）：
    REQ = urllib2.urlopen（netLocation）
    reqSoup = BeautifulSoup（REQ）
    reqString = reqSoup.en code（UTF-8）
    返回reqStringDEF用GetFileName（字符串）：
    splitTerms = string.split（'/'）
    文件名= splitTerms [-1]
    返回文件名高清WriteFile的（sourceBinary，文件名）：
    开放（文件名，世行）为FP：
        fp.write（sourceBinary）高清的main（）：
    htmlString = getHTMLstring（'http://www.travelsinamathematicalworld.co.uk'）
    urlList =的GetList（htmlString）    fileFolder ='D：\\\\ \\\\ Dropbox的数学\\\\游记在数学世界\\\\播客
    os.chdir（fileFolder）    在urlList网址：
        名称的getFileName =（URL）
        二进制= getBinary（URL）
        WriteFile的（二进制，名）
        time.sleep（2）如果__name__ =='__main__'：
    主要（）

当我运行code，我得到了我的控制台以下警告：

警告：根：某些字符不能去$ C $光盘，并与替换字符被替换

。

我想，这与事实，我使用的数据连接以UTF-8 codeD，也许写方法需要一个不同的编码呢？我是新来的Python（和真正的通用编程），我卡住了。

解决方案

假设你想从网址下载一些MP3文件。

您可以通过检索 BeautifulSoup 这些网址。但你并不需要使用 BeautifulSoup 来解析的URL。你只需要直接保存它。

例如，

  URL ='http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf
RES = urllib2.urlopen（URL）
开放（文件名，世行）为FP：
    fp.write（res.read（））

如果我用 BeautifulSoup 来解析PDF网址

  reqSoup = BeautifulSoup（'http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf'）

reqSoup 不是pdf文件，而是一个HTML响应。其实，这是

<$p$p><$c$c><html><body><p>http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf</p></body></html>

I'm creating a script to download some mp3 podcasts from a site and write them to a certain location. I'm nearly finished, and the files are being downloaded and created. However, I'm running into a problem where the binary data can't be fully decoded and the mp3 files won't play.

Here's my code:

import re
import os
import urllib2
from bs4 import BeautifulSoup
import time

def getHTMLstring(url):
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html)
    soupString = soup.encode('utf-8')
    return soupString

def getList(html_string):
    urlList = re.findall('(http://podcast\.travelsinamathematicalworld\.co\.uk\/mp3/.*\.mp3)', html_string)
    firstUrl = urlList[0]
    finalList = [firstUrl]

    for url in urlList:
        if url != finalList[0]:
            finalList.insert(0,url)

    return finalList

def getBinary(netLocation):
    req = urllib2.urlopen(netLocation)
    reqSoup = BeautifulSoup(req)
    reqString = reqSoup.encode('utf-8')
    return reqString

def getFilename(string):
    splitTerms = string.split('/')
    fileName = splitTerms[-1]
    return fileName

def writeFile(sourceBinary, fileName):
    with open(fileName, 'wb') as fp:
        fp.write(sourceBinary)



def main():
    htmlString = getHTMLstring('http://www.travelsinamathematicalworld.co.uk')
    urlList = getList(htmlString)

    fileFolder = 'D:\\Dropbox\\Mathematics\\Travels in a Mathematical World\\Podcasts'
    os.chdir(fileFolder)

    for url in urlList:
        name = getFilename(url)
        binary = getBinary(url)
        writeFile(binary, name)
        time.sleep(2)



if __name__ == '__main__':
    main()

When I run the code, I get the following warning in my console:

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

I'm thinking that it has to do with the fact that the data that I'm using is encoded in UTF-8, and maybe the write method expects a different encoding? I'm new to Python (and really to programming in general), and I'm stuck.

解决方案

Assuming that you want to download some mp3 files from urls.
You can retrieve those urls via BeautifulSoup. But you don't need to use BeautifulSoup to parse the urls. You just need to save it directly.
For example,

url = 'http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf'
res = urllib2.urlopen(url)
with open(fileName, 'wb') as fp:
    fp.write(res.read())

If I use BeautifulSoup to parse that pdf url

reqSoup = BeautifulSoup('http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf')

reqSoup is not the pdf file, but a HTML response. Actually, it is

<html><body><p>http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf</p></body></html>

这篇关于警告：某些字符不能去codeD，并通过替换字符被替换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

警告：某些字符不能去codeD，并通过替换字符被替换 [英] Warning: Some characters could not be decoded, and were replaced by REPLACEMENT CHARACTER

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

警告：某些字符不能去codeD，并通过替换字符被替换 [英] Warning: Some characters could not be decoded, and were replaced by REPLACEMENT CHARACTER

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭