如何将数据批量上传到appengine数据存储区?旧的方法不起作用 [英] How to upload data in bulk to the appengine datastore? Older methods do not work

查看:105
本文介绍了如何将数据批量上传到appengine数据存储区?旧的方法不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这应该是一个相当普遍的要求,也是一个简单的过程:将数据批量上传到appengine数据存储区。

然而,在stackoverflow上提到的旧解决方案(下面的链接*)似乎没有工作了。使用数据库API上传到数据存储区时最合理的解决方案是bulkloader方法,它不适用于NDB API。现在批量加载程序方法似乎有已被弃用,旧的链接仍然存在于文档中,导致错误的页面。以下是一个示例



https://developers.google.com/appengine/docs/python/tools/uploadingdata



以上链接仍然存在于此页面上: https://developers.google.com/appengine/docs/python/tools/uploadinganapp



现在推荐的用于批量加载数据的方法是什么?

这两种可行的选择似乎是1)使用remote_api或2)将CSV文件写入GCS存储区并从中读取数据。任何人都有成功使用任何方法的经验?



任何指针将不胜感激。谢谢!

[*以下链接提供的解决方案不再有效]

[1] < a href =https://stackoverflow.com/questions/741599/how-does-one-upload-data-in-bulk-to-a-google-appengine-datastore>如何将数据批量上传到google appengine datastore?



[2]

方法1:使用remote_api

如何:写入 bulkloader.yaml 文件,并使用终端
中的 appcfg.py upload_data 命令直接运行它。我不推荐这种方法,原因有两个:1.巨大的延迟时间2.不支持对于NDB



方法2:使用GCS并使用mapreduce



上传数据文件到GCS:



使用 storage-file-transfer-json-python github项目(chunked_transfer.py)从本地系统上传文件到gcs。
请务必从应用引擎管理控制台生成适当的client-secrets.json文件。


Mapreduce: 使用 appengine-mapreduce github项目。将mapreduce文件夹复制到您的项目顶层文件夹中。



将以下行添加到您的app.yaml文件中:

 包括:
- mapreduce / include.yaml

下面是你的main.py文件

  import cgi 
import webapp2
导入日志
从模型导入os,csv
导入DataStoreModel
从google.appengine.api导入StringIO
从mapreduce导入app_identity
从mapreduce导入base_handler
导入mapreduce_pipeline从mapreduce导入操作
从mapreduce.input_readers导入操作op
import InputReader
$ b $ def testmapperFunc(newRequest):
f = StringIO.StringIO(newRequest)
reader = csv.reader(f,delimiter =',')
用于阅读器中的行:
newEntry = DataStoreModel(attr1 = row [0],link = row [1])$ ​​b $ b yield op .db.Put(newEntry)

class TestGCSReaderPipeline(base_handler.PipelineBase):
def run(sel
mapreduce_pipeline.MapreducePipeline(
test_gcs,$ b $testgcs.testmapperFunc,
mapreduce.input_readers.FileInputReader,
mapper_params = {
files:[filename],
format:'lines'
},
shards = 1)

class tempTestRequestGCSUpload(webapp2 .RequestHandler):
def get(self):
bucket_name = os.environ.get('BUCKET_NAME',
app_identity.get_default_gcs_bucket_name())

bucket = '/ gs /'+ bucket_name
filename = bucket +'/'+'tempfile.csv'

pipeline = TestGCSReaderPipeline(filename)
pipeline.with_params(target =mapreducetestmodtest )
pipeline.start()
self.response.out.write('done')

application = webapp2.WSGIAppl ([
('/ gcsupload',tempTestRequestGCSUpload),
],debug = True)

记住:


  1. Mapreduce项目使用现在不推荐使用的Google Cloud Storage Files API 。因此,我们无法保证将来的支持。
  2. Map reduce为数据存储读取和写入增加了一个小的开销。

方法3:GCS和GCS客户端库


  • 使用gcs客户端库(将'cloudstorage'文件夹复制到您的系统中

    将以下代码添加到应用程序的main.py文件中。

     导入cgi 
    导入webapp2
    导入日志
    导入jinja2
    导入os,csv
    导入cloudstorage as gcs
    from google.appengine.ext从google.appengine.api导入ndb
    从模型导入app_identity
    导入DataStoreModel
    $ b $ class UploadGCSData(webapp2.RequestHandler):
    def get(self):
    bucket_name = os.environ.get('BUCKET_NAME',
    app_identity.get_default_gcs_bucket_name())
    bucket ='/'+ bucket_name
    filename = bucket +'/tempfile.csv'
    self .upload_file(filename)

    def upload_file(self,filename):
    gcs_file = gcs.open(filename)
    datareader = csv.reader(gcs_file)
    count = 0
    entities = []
    用于datareader中的行:
    count + = 1
    newProd = DataStoreModel(attr1 = row [0],link = row [1])
    entities.append(newProd)

    if count%50 == 0 and entities:
    ndb.put_multi(entities)
    entities = []

    如果实体:
    ndb.put_multi(实体)

    application = webapp2.WSGIApplication([
    ('/ gcsupload',UploadGCSData),
    ], debug = True)


    This should be a fairly common requirement, and a simple process: upload data in bulk to the appengine datastore.

    However, none of the older solutions mentioned on stackoverflow (links below*) seem to work anymore. The bulkloader method, which was the most reasonable solution when uploading to the datastore using the DB API doesn't work with the NDB API

    And now the bulkloader method seems to have been deprecated and the old links, which are still present in the docs, lead to the wrong page. Here's an example

    https://developers.google.com/appengine/docs/python/tools/uploadingdata

    This above link is still present on this page: https://developers.google.com/appengine/docs/python/tools/uploadinganapp

    What is the recommended method for bulkloading data now?

    The two feasible alternatives seem to be 1) using the remote_api or 2) writing a CSV file to a GCS bucket and reading from that. Anybody have experience successfully using either method?

    Any pointers will be greatly appreciated. Thanks!

    [*The solutions offered at the links below are no longer valid]

    [1] how does one upload data in bulk to a google appengine datastore?

    [2] How to insert bulk data in Google App Engine Datastore?

    解决方案

    Method 1: Use remote_api

    How to : write a bulkloader.yaml file and run it directly using "appcfg.py upload_data" command from terminal I don’t recommend this method for a couple of reasons: 1. huge latency 2. no support for NDB

    Method 2: GCS and use mapreduce

    Uploading Data File to GCS:

    Use the "storage-file-transfer-json-python" github project (chunked_transfer.py) to upload files to gcs from your local system. Make sure to generate proper "client-secrets.json" file from the app engine admin console.

    Mapreduce:

    Use the "appengine-mapreduce" github project. Copy the "mapreduce" folder to your project top-level folder.

    Add the below line to your app.yaml file:

    includes:
      - mapreduce/include.yaml
    

    Below is your main.py file

    import cgi
    import webapp2
    import logging
    import os, csv
    from models import DataStoreModel
    import StringIO
    from google.appengine.api import app_identity
    from mapreduce import base_handler
    from mapreduce import mapreduce_pipeline
    from mapreduce import operation as op
    from mapreduce.input_readers import InputReader
    
    def testmapperFunc(newRequest):
        f = StringIO.StringIO(newRequest)
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            newEntry = DataStoreModel(attr1=row[0], link=row[1])
            yield op.db.Put(newEntry)
    
    class TestGCSReaderPipeline(base_handler.PipelineBase):
        def run(self, filename):
            yield mapreduce_pipeline.MapreducePipeline(
                    "test_gcs",
                    "testgcs.testmapperFunc",
                    "mapreduce.input_readers.FileInputReader",
                    mapper_params={
                        "files": [filename],
                        "format": 'lines'
                    },
                    shards=1)
    
    class tempTestRequestGCSUpload(webapp2.RequestHandler):
        def get(self):
            bucket_name = os.environ.get('BUCKET_NAME',
                                         app_identity.get_default_gcs_bucket_name())
    
            bucket = '/gs/' + bucket_name
            filename = bucket + '/' + 'tempfile.csv'
    
            pipeline = TestGCSReaderPipeline(filename)
            pipeline.with_params(target="mapreducetestmodtest")
            pipeline.start()
            self.response.out.write('done')
    
    application = webapp2.WSGIApplication([
        ('/gcsupload', tempTestRequestGCSUpload),
    ], debug=True)
    

    To remember:

    1. Mapreduce project uses the now-deprecated "Google Cloud Storage Files API". So support in future is not guaranteed.
    2. Map reduce adds a small overhead to datastore reads and writes.

    Method 3: GCS and GCS Client Library

    1. Upload the csv/text file to gcs using the above file-transfer method.
    2. Use gcs client library (copy the 'cloudstorage' folder to your application top-level folder).

    Add the below code to the application main.py file.

    import cgi
    import webapp2
    import logging
    import jinja2
    import os, csv
    import cloudstorage as gcs
    from google.appengine.ext import ndb
    from google.appengine.api import app_identity
    from models import DataStoreModel
    
    class UploadGCSData(webapp2.RequestHandler):
        def get(self):
            bucket_name = os.environ.get('BUCKET_NAME',
                                         app_identity.get_default_gcs_bucket_name())
            bucket = '/' + bucket_name
            filename = bucket + '/tempfile.csv'
            self.upload_file(filename)
    
        def upload_file(self, filename):
            gcs_file = gcs.open(filename)
            datareader = csv.reader(gcs_file)
            count = 0
            entities = []
            for row in datareader:
                count += 1
                    newProd = DataStoreModel(attr1=row[0], link=row[1])
                    entities.append(newProd)
    
                if count%50==0 and entities:
                    ndb.put_multi(entities)
                    entities=[]
    
            if entities:
                ndb.put_multi(entities)
    
    application = webapp2.WSGIApplication([
        ('/gcsupload', UploadGCSData),
    ], debug=True)
    

    这篇关于如何将数据批量上传到appengine数据存储区?旧的方法不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆