在appengine上处理一个大的(> 32mb)xml文件 [英] Processing a large (>32mb) xml file over appengine

查看:139
本文介绍了在appengine上处理一个大的(> 32mb)xml文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理大型(〜50mb)大小的xml文件以存储在数据存储中。我尝试过使用后端,套接字(通过urlfetch提取文件),甚至直接上传文件在我的源代码中,但仍然保持运行限制(即32 MB限制)。



所以,我很困惑(并且有点生气/沮丧)。 appengine真的没有真正的方法来处理一个大文件?似乎有一个潜在的工作,这将涉及remote_apis,亚马逊(或谷歌计算我猜)和安全/设置噩梦... ...
$ b

Http范围是另一件我考虑过的事情,但以不同方式将不同的分割部分连接在一起会非常痛苦(除非我可以设法将文件精确地分割)

这看起来很疯狂所以我想我会问stackover流...我错过了什么?



更新
尝试使用范围请求和它看起来像我试图从中流出的服务器不使用它。所以现在我正在考虑下载文件,将它托管在另一台服务器上,然后使用appengine通过后端的http请求访问该文件,然后自动执行整个过程,以便我可以将其作为cron作业运行:/(疯狂不得不为这么简单的事情做所有这些工作......

解决方案

将其存储在云存储中并逐步读取,因为您可以逐行访问它(无论如何都是Python),因此它不会消耗所有资源。



https://开发人员.google.com / appengine / docs / python / googlecloudstorageclient /



https://developers.google.com/storage/


GCS客户端库让您的应用程序从Google云端存储(GCS)中读取文件并将
文件写入存储区。这个库支持
读取和向GCS写入大量数据,并且内部错误
处理和重试,所以您不必编写自己的代码来执行
this。此外,它提供了预取读取缓冲,因此您的应用程序
可以更高效。



GCS客户端库提供以下功能:



一个打开的方法,返回一个类似文件的缓冲区,您可以在其中调用
标准Python文件操作进行读写操作。用于列出GCS存储桶内容的listbucket
方法。用于
的stat方法获取关于特定文件的元数据。一种删除方法,用于从GCS中删除
文件。


我已经用这种方式处理了一些非常大的CSV文件 - 尽可能多地阅读,处理,然后阅读更多内容。

  def read_file(self,filename):
self.response.write('Truncated file content:\\\
')

gcs_file = gcs.open(文件名)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024,os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()



使用标准python进行增量阅读!

I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).

So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...

Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)

This seems crazy so I thought I'd ask stackover flow... am I missing something?

update Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)

解决方案

What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.

https://developers.google.com/appengine/docs/python/googlecloudstorageclient/

https://developers.google.com/storage/

The GCS client library lets your application read files from and write files to buckets in Google Cloud Storage (GCS). This library supports reading and writing large amounts of data to GCS, with internal error handling and retries, so you don't have to write your own code to do this. Moreover, it provides read buffering with prefetch so your app can be more efficient.

The GCS client library provides the following functionality:

An open method that returns a file-like buffer on which you can invoke standard Python file operations for reading and writing. A listbucket method for listing the contents of a GCS bucket. A stat method for obtaining metadata about a specific file. A delete method for deleting files from GCS.

I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.

def read_file(self, filename):
    self.response.write('Truncated file content:\n')

    gcs_file = gcs.open(filename)
    self.response.write(gcs_file.readline())
    gcs_file.seek(-1024, os.SEEK_END)
    self.response.write(gcs_file.read())
    gcs_file.close()

Incremental reading with standard python!

这篇关于在appengine上处理一个大的(> 32mb)xml文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆