提高预处理大量文档的性能 [英] Improving performance of preprocessing large set of documents

查看:124
本文介绍了提高预处理大量文档的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个与使用Java进行窃检测框架有关的项目.我的文档集中包含大约100个文档,我必须对其进行预处理并存储在合适的数据结构中.我有一个很大的问题,我将如何有效地处理大量文档并避免出现瓶颈.我的问题的主要重点是如何提高预处理性能.

I am working on a project related to plagiarism detection framework using Java. My document set contains about 100 documents and I have to preprocess them and store in a suitable data structure. I have a big question that how am i going to process the large set of documents efficiently and avoiding bottlenecks . The main focus on my question is how to improve the preprocessing performance.

谢谢

问候 女wan

推荐答案

您在此处缺少一些细节.适当的优化将取决于诸如文档格式,平均文档大小,如何处理它们以及在数据结构中存储什么样的信息之类的内容.不知道它们中的任何一个,一些常规的优化方法是:

You're a bit lacking on specifics there. Appropriate optimizations are going to depend upon things like the document format, the average document size, how you are processing them, and what sort of information you are storing in your data structure. Not knowing any of them, some general optimizations are:

  1. 假定给定文档的预处理独立于任何其他文档的预处理,并且假设您正在运行多核CPU,那么您的工作负载就是多线程的理想选择.每个CPU内核分配一个线程,然后将作业分配给您的线程.然后,您可以并行处理多个文档.

  1. Assuming that the pre-processing of a given document is independent of the pre-processing of any other document, and assuming you are running a multi-core CPU, then your workload is a good candidate for multi-threading. Allocate one thread per CPU core, and farm out jobs to your threads. Then you can process multiple documents in parallel.

更一般地说,在内存中尽力而为.尽量避免从磁盘读取/写入磁盘.如果必须写入磁盘,请尝试等待直到拥有所有要写入的数据,然后将其全部批量写入.

More generally, do as much in memory as you can. Try to avoid reading from/writing to disk as much as possible. If you must write to disk, try to wait until you have all the data you want to write, and then write it all in a single batch.

这篇关于提高预处理大量文档的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆