以下代码中的uni代码编码错误是什么 [英] what is the uni code encoding error in the code below
问题描述
当我运行下面显示的程序时,我遇到了与unicode编码相关的错误
When I'm running the program presented below, I'm getting a unicode encoding-related error
import bs4
import requests
from xhtml2pdf import pisa # import python module
from xhtml2pdf.config.httpconfig import httpConfig
res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")
sourceHtml =str(pf)
outputFilename = "test.pdf"
def convertHtmlToPdf(sourceHtml, outputFilename):
# open output file for writing (truncated binary)
httpConfig.save_keys('nosslcheck', True)
resultFile = open(outputFilename, "w+b")
# convert HTML to PDF
pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding="utf-8")
# close output file
resultFile.close() # close output file
# return True on success and False on errors
return pisaStatus.err
# Main program
if __name__ == "__main__":
pisa.showLogging()
convertHtmlToPdf(sourceHtml, outputFilename)
错误在下面给出
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)
我正在尝试使用xhtml2pdf下载网站的一部分.为此,我使用了bs4并将其抓取并存储.然后使用xhtml2pdf将其保存为pdf. 大多数时候,它像魅力一样运作.但是对于这种情况,它给了我错误.
I'm trying to download a portion of a website using xhtml2pdf. To do that I used bs4 and scrape the site and store it. Then save it into pdf by using xhtml2pdf. Most of the time it worked like charm. But for this instance it is giving me error. Link to the full code in github is given below
可以在此处
xhtml2pdf用ascii编码,由于我的html文件包含非ascii字符,因此显示错误.而且我不知道如何在xhtml2pdf中更改编码器.省略非ASCII字符不是一种选择.如果我忽略它,则指向图像的链接将被破坏,并且图像将不会以pdf显示.
xhtml2pdf is encoding with ascii, Since my html file contain non ascii characters it is showing error. And I don't know how to change the encoder in xhtml2pdf. Omitting non-ascii character is not not an option. If I ignore it then link to the image will be corrupted and image will not show in pdf.
完全追溯
```Traceback (most recent call last):
File "test3.py", line 80, in
convertHtmlToPdf(sourceHtml, outputFilename)
File "test3.py", line 68, in convertHtmlToPdf
pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding= 'utf-8')
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 97, in pisaDocument
encoding, context=context, xml_output=xml_output)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 59, in pisaStory
pisaParser(src, context, default_css, xhtml, encoding, xml_output)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 759, in pisaParser
pisaLoop(document, context)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 700, in pisaLoop
pisaLoop(node, context, path, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
pisaLoop(nnode, context, path, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
pisaLoop(nnode, context, path, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
pisaLoop(nnode, context, path, **kw)
[Previous line repeated 2 more times]
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 514, in pisaLoop
attr = pisaGetAttributes(context, node.tagName, node.attributes)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 124, in pisaGetAttributes
nv = c.getFile(nv)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\context.py", line 818, in getFile
return getFile(name, relative or self.pathDirectory)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 738, in getFile
file = pisaFileObject(*a, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 644, in init
conn.request("GET", path)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1240, in _send_request
self.putrequest(method, url, **skips)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1107, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)
推荐答案
问题是所检索的html包含img
标记,其中某些src
属性是包含'\u2019'
的url('RIGHT SINGLE QUOTATION MARK ') 特点.
The problem is that the retrieved html contains img
tags some of whose src
attributes are urls which contain the '\u2019'
('RIGHT SINGLE QUOTATION MARK') character.
xhtml2pdf将这些网址传递给python的 http.client 模块而不先逃脱它们. http.client尝试在检索URL之前将其编码为ASCII,然后发生错误.
xhtml2pdf is passing these urls to python's http.client module without escaping them first. http.client tries to encode the urls as ASCII before retrieving them, and the error happens.
这可以通过在生成pdf之前转义检索到的html中的url来解决.
This can be worked around by escaping the urls in the retrieved html before generating the pdf.
urllib.parse 提供了执行此操作的工具./p>
urllib.parse provides the tools to do this.
from urllib import parse
...
res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")
imgs = pf.find_all('img')
for img in imgs:
url = img['src']
scheme, netloc, path, params, query, fragment = parse.urlparse(url)
new_path = parse.quote(path)
new_url = parse.urlunparse((scheme, netloc, new_path, params, query, fragment))
img['src'] = new_url
sourceHtml =str(pf)
outputFilename = "test.pdf"
...
此问题的答案提供了一些有关unicode和url的背景信息.
The answers to this question provide some background information on unicode and urls.
这篇关于以下代码中的uni代码编码错误是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!