错误:“utf-8"编解码器无法解码位置 0 中的字节 0xb0:google colab 中的起始字节无效 [英] Error: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab

查看：59 发布时间：2021/9/5 19:52:32 python tensorflow compiler-errors

本文介绍了错误:“utf-8"编解码器无法解码位置 0 中的字节 0xb0:google colab 中的起始字节无效的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

import PyPDF4
from google.colab import files
files.upload()
fileReader = PyPDF4.PdfFileReader('ITC-1.pdf')
s=""
for i in range(2, fileReader.numPages):
    s+=fileReader.getPage(i).extractText()


sentences = []
while s.find('.') != -1:
    index = s.find('.')
    sentences.append(s[:index])
    s = s[index+1:]

text_ds = tf.data.TextLineDataset('ITC-1.pdf').filter(lambda x: tf.cast(tf.strings.length(x), bool))
vectorize_layer.adapt(text_ds.batch(1024))
inverse_vocab = vectorize_layer.get_vocabulary()

上面代码的最后一行显示了错误.我看了几篇帖子以了解它的含义，但似乎没有一个解决方案对我有用.我无法使用我的本地机器，因为我需要访问 GPU.请为此提出一个解决方法.谢谢！

The last line in the code above shows the error. I saw several posts to understand what it means, but none of the solutions seem to work for me. I cannot use my local machine because I would be needing access to GPUs. Please suggest a workaround for this. Thanks!

PS:按照这里的代码https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word2vec.ipynb#scrollTo=haJUNjSB60Kh，不同之处在于我阅读文件的方式.如果有更好的方法，请告诉我！

PS: Following the code here https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word2vec.ipynb#scrollTo=haJUNjSB60Kh, the difference is in the way I am reading the file. If there are better ways to do it, pleasee let me know!

推荐答案

import pdfplumber
from tensorflow.keras.layers.experimental import preprocessing
import tensorflow as tf

f = open('test.txt', 'w')

with pdfplumber.open(r'test.pdf') as pdf:
    for page in pdf.pages:
      f.write(page.extract_text())
f.close()
layer = preprocessing.TextVectorization()
text_ds = tf.data.TextLineDataset('test.txt').filter(lambda x: tf.cast(tf.strings.length(x), bool))

layer.adapt(text_ds.batch(1024))
inverse_vocab = layer.get_vocabulary()

你可以这样做:

使用 pdfplumber 阅读 pdf.
将页面写入文本文件.
然后使用该文本文件创建数据集.

这篇关于错误:“utf-8"编解码器无法解码位置 0 中的字节 0xb0:google colab 中的起始字节无效的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

错误:“utf-8"编解码器无法解码位置 0 中的字节 0xb0:google colab 中的起始字节无效 [英] Error: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

错误:“utf-8"编解码器无法解码位置 0 中的字节 0xb0:google colab 中的起始字节无效 [英] Error: &#39;utf-8&#39; codec can&#39;t decode byte 0xb0 in position 0: invalid start byte in google colab

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

错误:“utf-8"编解码器无法解码位置 0 中的字节 0xb0:google colab 中的起始字节无效 [英] Error: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab

登录关闭