以下代码中的uni代码编码错误是什么 [英] what is the uni code encoding error in the code below

查看:126
本文介绍了以下代码中的uni代码编码错误是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我运行下面显示的程序时,我遇到了与unicode编码相关的错误

When I'm running the program presented below, I'm getting a unicode encoding-related error

import bs4
import requests
from xhtml2pdf import pisa  # import python module
from xhtml2pdf.config.httpconfig import httpConfig

res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")

sourceHtml =str(pf)
outputFilename = "test.pdf"

def convertHtmlToPdf(sourceHtml, outputFilename):
    # open output file for writing (truncated binary)

    httpConfig.save_keys('nosslcheck', True)

    resultFile = open(outputFilename, "w+b")

    # convert HTML to PDF
    pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding="utf-8")

    # close output file
    resultFile.close()  # close output file

    # return True on success and False on errors
    return pisaStatus.err

# Main program
if __name__ == "__main__":
    pisa.showLogging()
    convertHtmlToPdf(sourceHtml, outputFilename)

错误在下面给出

self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)

我正在尝试使用xhtml2pdf下载网站的一部分.为此,我使用了bs4并将其抓取并存储.然后使用xhtml2pdf将其保存为pdf. 大多数时候,它像魅力一样运作.但是对于这种情况,它给了我错误.

I'm trying to download a portion of a website using xhtml2pdf. To do that I used bs4 and scrape the site and store it. Then save it into pdf by using xhtml2pdf. Most of the time it worked like charm. But for this instance it is giving me error. Link to the full code in github is given below

可以在此处

xhtml2pdf用ascii编码,由于我的html文件包含非ascii字符,因此显示错误.而且我不知道如何在xhtml2pdf中更改编码器.省略非ASCII字符不是一种选择.如果我忽略它,则指向图像的链接将被破坏,并且图像将不会以pdf显示.

xhtml2pdf is encoding with ascii, Since my html file contain non ascii characters it is showing error. And I don't know how to change the encoder in xhtml2pdf. Omitting non-ascii character is not not an option. If I ignore it then link to the image will be corrupted and image will not show in pdf.

完全追溯

```Traceback (most recent call last):
  File "test3.py", line 80, in 
    convertHtmlToPdf(sourceHtml, outputFilename)
  File "test3.py", line 68, in convertHtmlToPdf
    pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding= 'utf-8')
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 97, in pisaDocument
    encoding, context=context, xml_output=xml_output)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 59, in pisaStory
    pisaParser(src, context, default_css, xhtml, encoding, xml_output)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 759, in pisaParser
    pisaLoop(document, context)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 700, in pisaLoop
    pisaLoop(node, context, path, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
    pisaLoop(nnode, context, path, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
    pisaLoop(nnode, context, path, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
    pisaLoop(nnode, context, path, **kw)
  [Previous line repeated 2 more times]
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 514, in pisaLoop
    attr = pisaGetAttributes(context, node.tagName, node.attributes)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 124, in pisaGetAttributes
    nv = c.getFile(nv)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\context.py", line 818, in getFile
    return getFile(name, relative or self.pathDirectory)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 738, in getFile
    file = pisaFileObject(*a, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 644, in init
    conn.request("GET", path)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1240, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1107, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)

推荐答案

问题是所检索的html包含img标记,其中某些src属性是包含'\u2019'的url('RIGHT SINGLE QUOTATION MARK ') 特点.

The problem is that the retrieved html contains img tags some of whose src attributes are urls which contain the '\u2019' ('RIGHT SINGLE QUOTATION MARK') character.

xhtml2pdf将这些网址传递给python的 http.client 模块而不先逃脱它们. http.client尝试在检索URL之前将其编码为ASCII,然后发生错误.

xhtml2pdf is passing these urls to python's http.client module without escaping them first. http.client tries to encode the urls as ASCII before retrieving them, and the error happens.

这可以通过在生成pdf之前转义检索到的html中的url来解决.

This can be worked around by escaping the urls in the retrieved html before generating the pdf.

urllib.parse 提供了执行此操作的工具./p>

urllib.parse provides the tools to do this.

from urllib import parse
...
res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")

imgs = pf.find_all('img')
for img in imgs: 
    url = img['src'] 
    scheme, netloc, path, params, query, fragment = parse.urlparse(url)
    new_path = parse.quote(path)
    new_url = parse.urlunparse((scheme, netloc, new_path, params, query, fragment))
    img['src'] = new_url

sourceHtml =str(pf)
outputFilename = "test.pdf"
...

此问题的答案提供了一些有关unicode和url的背景信息.

The answers to this question provide some background information on unicode and urls.

这篇关于以下代码中的uni代码编码错误是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆