如何提高使用维基百科数据和巨大号码时的性能。网页? [英] How to improve the performance when working with wikipedia data and huge no. of webpages?

查看:187
本文介绍了如何提高使用维基百科数据和巨大号码时的性能。网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应该使用维基百科的文章链接数据转储从组织网站中提取代表性术语。
为了达到这个目标 -


  1. 检索&下载组织的网页。 (〜110,000)

  2. 创建维基百科ID和术语/标题的字典。 (约4000万条记录)

现在,我应该使用字典处理每个网页,以识别术语并跟踪他们的术语ID&频率。



为了适应内存的字典,我将字典拆分为更小的文件。根据我对一个小型数据集的实验,上述处理时间大约为75天。

这仅适用于1个组织。

执行 -


    li>用于在内存中存储字典的HashMap
  • 使用Boyer-Moore搜索实现循环每个映射条目以在网页中搜索术语。 >为每个网页重复上述操作,并将结果存储在HashMap中。



我试过优化代码并调整JVM更好的性能。

有人可以请一个更有效的方式来实现上述,减少处理时间几天。



是Hadoop的一个选项吗?

解决方案

根据您的问题:

 文件数量= 110000 

字典=> [TermID,Title Terms]列表= 4千万条目

文件大小=平均每个文件11000 * 1KB = 26.9GB
(平均每个文件1KB)

字典大小= 4000万* 256bytes = 9.5GB原始数据
(平均每笔256bytes)

你是如何到达75天的估计值的?

有很多表现目标:


  • 您如何存储文档?

  • 您如何存储/检索词典? (假设所有内容都不在内存中,除非您能负担得起)

  • 您运行多少台机器?

  • 是否正在执行字典并行查找? (当然,假设字典在你已经处理了整个维基百科后是不可变的)



下面是我相信你的概要做:

  dictionary =读维基百科字典
document =一系列文件
documents.map {doc => (术语< - doc.terms.map if(dictionary.contains(term))){
docTermFreq = docTermFreq +(())
var docTermFreq = Map [String,Int] term> docTermFreq.getOrElse(term,0)+ 1)
}
// store docTermFreq map
}

这实质上是在将每个文档分解为令牌,然后在维基百科字典中查找它的令牌。



这正是 Lucene分析器



A Lucene Tokenizer 会将文档转换为令牌。这是在术语索引到lucene之前发生的。因此,你所要做的就是实现一个分析器,它可以查找维基百科词典,了解令牌是否在字典中。



我会这样做:




  • 为每个文档创建一个标记流(使用上面描述的分析器)

  • 索引文档条款。

  • 此时,您只会在Lucene索引中拥有维基百科条款。



  • 当你这样做的时候,你可以从Lucene索引中获得现成的统计信息,例如:


    • 文档频率

    • TermFrequencyVector (正是你需要的)

    • 和准备使用倒排索引! (快速介绍倒排索引和检索
    • ul>

      您可以通过很多方法来提高性能。例如:


      • 并行化文档流处理。
      • 您可以将字典存储在key-价值数据库,例如 BerkeylyDB 或Kyoto Cabinet,或者甚至是内存中的键值存储,例如 Redis Memcache



      我希望有帮助。


      I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump. To achieve this I've -

      1. Crawled & downloaded organisation's webpages. (~110,000)
      2. Created a dictionary of wikipedia ID and terms/title. (~40million records)

      Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies.

      For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based on my experiment with a small data-set, the processing time for the above will be around 75 days.

      And this is just for 1 organisation. I have to do the same for more than 40 of them.

      Implementation -

      • HashMap for storing dictionary in memory.
      • looping through each map entry to search the term in a webpage, using Boyer-Moore search implementation.
      • Repeating the above for each webpage, and storing results in a HashMap.

      I've tried optimizing the code and tuning the JVM for better performance.

      Can someone please advise on a more efficient way to implement the above, reducing the processing time to a few days.

      Is Hadoop an option to consider?

      解决方案

      Based on your question:

      Number of Documents = 110000
      
      Dictionary => List of [TermID, Title Terms] = 40million entries
      
      Size of documents = 11000 * 1KB per document on an average =  26.9GB
      (1KB per document on an average)
      
      Size of dictionary = 40million * 256bytes = 9.5GB of raw data
      (256bytes per entry on an average)
      

      How did you arrive at the 75 days estimate?

      There are number of performance targets:

      • How are you storing the Documents?
      • How are you storing/retrieving the Dictionary? ( assuming not all of it in memory unless you can afford to)
      • How many machines are you running it on?
      • Are you performing the dictionary lookups in parallel? ( of-course assuming dictionary is immutable once you have already processed whole of wikipedia )

      Here is an outline of what I believe you are doing:

      dictionary = read wikipedia dictionary
      document = a sequence of documents
      documents.map { doc =>
        var docTermFreq = Map[String, Int]()
        for(term <- doc.terms.map if(dictionary.contains(term)) ) {
           docTermFreq = docTermFreq + (term -> docTermFreq.getOrElse(term, 0) + 1)
        }
        // store docTermFreq map
      }
      

      What this is essentially doing is breaking up each document into tokens and then performing a lookup in wikipedia dictionary for its token's existence.

      This is exactly what a Lucene Analyzer does.

      A Lucene Tokenizer will convert document into tokens. This happens before the terms are indexed into lucene. So all you have to do is implement a Analyzer which can lookup the Wikipedia Dictionary, for whether or not a token is in dictionary.

      I would do it like this:

      • Take every document and prepare a token stream ( using an Analyzer described above )
      • Index the document terms.
      • At this point you will have wikipedia terms only, in the Lucene Index.

      When you do this, you will have ready-made statistics from the Lucene Index such as:

      There are lot of things you can do to improve the performance. For example:

      • Parallelize the document stream processing.
      • You can store the dictionary in key-value database such as BerkeylyDB or Kyoto Cabinet, or even an in-memory key-value storage such as Redis or Memcache.

      I hope that helps.

      这篇关于如何提高使用维基百科数据和巨大号码时的性能。网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆