将内容Tarfile读取到Python中-“不允许向后搜索"; [英] Read Contents Tarfile into Python - "seeking backwards is not allowed"

查看:177
本文介绍了将内容Tarfile读取到Python中-“不允许向后搜索";的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新手.我无法将tarfile的内容读入python.

I am new to python. I am having trouble reading the contents of a tarfile into python.

数据是期刊文章的内容(托管在pubmed中央).请参阅下面的信息.并链接到我想读入Python的tarfile.

The data are the contents of a journal article (hosted at pubmed central). See info below. And link to tarfile which I want to read into Python.

http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz

我有一个类似的.tar.gz文件的列表,我也将最终希望阅读.我认为(知道)所有的tarfile都有一个与之关联的.nxml文件.这是我实际上对提取/读取感兴趣的.nxml文件的内容.公开所有关于最佳方法的建议...

I have a list of similar .tar.gz file I will eventually want to read in as well. I think (know) all of the tarfiles have a .nxml file associated with them. It is the content of the .nxml files I am actually interested in extracting/reading. Open to any suggestions on the best way to do this...

如果将tarfile保存到PC,这就是我所拥有的.所有运行均符合预期.

Here is what I have if I save the tarfile to my PC. All runs as expected.

tarfile_name = "F:/PMC_OA_TextMining/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
tfile = tarfile.open(tarfile_name)

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

我今天了解到,为了直接从发布的中央FTP站点访问tarfile,我必须使用urllib设置网络请求.下面是修改后的代码(以及指向我收到的stackoverflow答案的链接):

I learned today that to in order to access the tarfile directly from the pubmed centrals FTP site I have to set up a network request using urllib. Below is the revised code (and link to stackoverflow answer I received):

阅读网站中.tar.gz文件的内容到python 3.x对象中

tarfile_name = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_name)
tfile = tarfile.open(fileobj=ftpstream, mode="r|gz")

但是,当我运行其余代码(如下)时,我收到一条错误消息(不允许向后搜索").怎么会来?

However, when I run the remaining piece of the code (below) I get an error message ("seeking backwards is not allowed"). How come?

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

代码在最后一行失败,我尝试读取与我的tarfile关联的.nxml内容.以下是我收到的实际错误消息.这是什么意思?对于读取/访问全部嵌入在tarfile中的.nxml文件的内容,我最好的解决方法是什么?

The code fails on the last line, where I try to read the .nxml content associated with my tarfile. Below is the actual error message I receive. What does it mean? What is my best workaround for reading/accessing the content of these .nxml files which are all embedded in tarfiles?

Traceback (most recent call last):
File "F:\PMC_OA_TextMining\test2.py", line 135, in <module>
tfile_extract1_text = tfile_extract1.read()
File "C:\Python30\lib\tarfile.py", line 804, in read
buf += self.fileobj.read()
File "C:\Python30\lib\tarfile.py", line 715, in read
return self.readnormal(size)
File "C:\Python30\lib\tarfile.py", line 722, in readnormal
self.fileobj.seek(self.offset + self.position)
File "C:\Python30\lib\tarfile.py", line 531, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed 

在此先感谢您的帮助.克里斯

Thanks in advance for your help. Chris

推荐答案

出了什么问题: Tar文件是交错存储的.它们按照标题,数据,标题,数据,标题,数据等顺序排列.使用getmembers()枚举文件时,您已经通读了整个文件以获取标题.然后,当您要求tarfile对象读取数据时,它尝试从最后一个标头向后搜索第一个数据.但是,如果不关闭并重新打开urllib请求,就无法在网络流中向后搜索.

What's going wrong: Tar files are stored interleaved. They come in the order header, data, header, data, header, data, etc. When you enumerated the files with getmembers(), you've already read through the entire file to get the headers. Then when you asked the tarfile object to read the data, it tried to seek backward from the last header to the first data. But you can't seek backward in a network stream without closing and reopening the urllib request.

如何解决该问题::您需要下载文件,将临时副本保存到磁盘或StringIO,枚举此临时副本中的文件,然后解压缩文件想要.

How to work around it: You'll need to download the file, save a temporary copy to disk or to a StringIO, enumerate the files in this temporary copy, and then extract the files you want.

#!/usr/bin/env python3
from io import BytesIO
import urllib.request
import tarfile

tarfile_url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_url)

# BytesIO creates an in-memory temporary file.
# See the Python manual: http://docs.python.org/3/library/io.html
tmpfile = BytesIO()
while True:
    # Download a piece of the file from the connection
    s = ftpstream.read(16384)

    # Once the entire file has been downloaded, tarfile returns b''
    # (the empty bytes) which is a falsey value
    if not s:  
        break

    # Otherwise, write the piece of the file to the temporary file.
    tmpfile.write(s)
ftpstream.close()

# Now that the FTP stream has been downloaded to the temporary file,
# we can ditch the FTP stream and have the tarfile module work with
# the temporary file.  Begin by seeking back to the beginning of the
# temporary file.
tmpfile.seek(0)

# Now tell the tarfile module that you're using a file object
# that supports seeking backward.
# r|gz forbids seeking backward; r:gz allows seeking backward
tfile = tarfile.open(fileobj=tmpfile, mode="r:gz")

# You want to limit it to the .nxml files
tfile_members2 = [filename
                  for filename in tfile.getnames()
                  if filename.endswith('.nxml')]

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

# And when you're done extracting members:
tfile.close()
tmpfile.close()

这篇关于将内容Tarfile读取到Python中-“不允许向后搜索";的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆