是否可以以批处理模式对大量文档运行Google Cloud Platform NLP-API实体情感分析? [英] Is it possible to run Google Cloud Platform NLP-API entity sentiment analysis in a batch processing mode for a large number of documents?
问题描述
我是Google Cloud Platform的新手.我有一个庞大的数据集(1800万篇文章).我需要使用GCP的NLP-API进行实体情感分析.我不确定我进行分析的方式是否是获得所有文章的实体情感所需的时间的最佳方法.我想知道是否有一种方法可以对所有这些文章进行批处理,而不是遍历每个文章并进行API调用.这是我一直在使用的过程的摘要.
I am relatively new to Google Cloud Platform. I have a large dataset (18 Million articles). I need to do an entity-sentiment analysis using GCP's NLP-API. I am not sure if the way I have been conducting my analysis is the most optimal way in terms of the time it takes to get the entity-sentiment for all the articles. I wonder if there is a way to batch-process all these articles instead of iterating through each of them and making an API call. Here is a summary of the process I have been using.
- 我大约有500个文件,每个文件包含大约30,000个文章.
- 使用本地服务器上的python脚本,遍历每个文件,并为每篇文章调用给定的函数完成此步骤后,我不需要Google API并对存储在protobuf中的API输出执行最终分析.
After this step, I don't require the Google API and perform my final analysis on the API output stored in the protobufs.
这对于一个研究项目来说已经足够好了,我有大约150万篇文章,花了几天的时间.现在我有1800万篇文章,我想知道是否有更好的方法可以解决此问题.我读过的有关批处理的文章旨在解决应用程序或图像处理任务. 此处是我想要的东西,但我不确定是否我可以使用NLP-API来做到这一点.
This worked well enough for a research project where I had about 1.5 Million articles and took a few days. Now that I have 18 Million articles, I wonder if there is a better way to go about this. The articles I have read about batch-processing are geared towards making an app or image processing tasks. There was something like what I wanted here but I am not sure if I can do this with NLP-API.
这是我的代码的片段,DF是我有文章的Pandas数据框.
This is a snippet of my code and DF is a Pandas data frame where I have my articles.
def entity_sentiment_text(text): """Detects entity sentiment in the provided text.""" if isinstance(text, six.binary_type): text = text.decode('utf-8') document = types.Document( content=text.encode('utf-8'), type=enums.Document.Type.PLAIN_TEXT) # Detect and send native Python encoding to receive correct word offsets. encoding = enums.EncodingType.UTF32 if sys.maxunicode == 65535: encoding = enums.EncodingType.UTF16 result = client.analyze_entity_sentiment(document, encoding) return result for i,id_val in enumerate(article_ids): loop_start = time.time() if i%100 == 0: print i # create dynamic name, like "D:\Current Download\Attachment82673" dynamic_folder_name = os.path.join(folder, str(i)) # create 'dynamic' dir, if it does not exist if not os.path.exists(dynamic_folder_name): os.makedirs(dynamic_folder_name) file_name = str(id_val) + ".txt" text = list(DF.loc[id_val])[1] try: text = unicode(text, errors='ignore') result = entity_sentiment_text(text) # print result with open(dynamic_folder_name + "/" + str(id_val) + ".bin", 'w') as result_file: result_file.write(result.SerializeToString()) result_file.close() except Exception as e: print(e) with open("../"article_id_error.log", "a") as error_file: error_file.write(json.dumps(str(id_val) + "\n")) log_exception(e,id_val)
请注意,这是一项一次性的研究分析,我没有构建应用程序.我也知道我无法减少对API的调用次数.总而言之,如果我要进行1800万次调用,那么进行所有这些调用而不是逐篇阅读并分别调用该函数的最快方法是什么?
Note that this is a one-time analysis for research and I am not building an app. I also know that I cannot reduce the number of calls to the API. In summary, if I am making 18 Million calls, what is the quickest way to make all these calls instead of going through each article and calling the function individually?
我觉得我应该做某种并行处理,但是对于花更多的时间学习Dataproc而不知道是否对我的问题有所帮助,我有点谨慎.
I feel like I should be doing some kind of parallel processing, but I am a bit wary about spending more time learning about Dataproc without knowing if that will help me with my problem.
推荐答案
您将需要管理合并文档以获得较小的总作业数.您还需要对每分钟的请求数和每天的总请求数进行速率限制.
You will need to manage merging documents to obtain a smaller total job count. You will also need to rate-limit your requests for both requests per minute and total requests per day.
定价基于以1,000个字符为单位的字符.如果您打算处理1800万篇文章(每篇文章要多少个单词?),我会与Google Sales联系以讨论您的项目并安排信用审批.您将很快达到配额限制,然后您的作业将返回API错误.
Pricing is based upon characters in units of 1,000 characters. If you are planning to process 18 million articles (how many words per article?) I would contact Google Sales to discuss your project and arrange for credit approval. You will hit quota limits very quickly and then your jobs will return API errors.
我将从阅读本节内容开始:
I would start with reading this section of the documentation:
https://cloud.google.com/natural-language/docs/resources
这篇关于是否可以以批处理模式对大量文档运行Google Cloud Platform NLP-API实体情感分析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!