如果HBase没有在分布式环境中运行,HBase是否有意义? [英] Is HBase meaningful if it's not running in a distributed environment?

查看:161
本文介绍了如果HBase没有在分布式环境中运行,HBase是否有意义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个数据索引,这将需要以(文档,术语,权重)格式存储大量三元组。我将存储高达几百万这样的行。目前我正在MySQL中做这个简单的表格。我将文档和术语标识符存储为字符串值,而不是外键到其他表。我正在重新编写软件并寻找更好的数据存储方式。



考虑HBase的工作方式,这似乎很适合模式。我可以将文档映射到 {term =>重量}



我在单个节点上执行此操作,所以我不在乎分布式节点等。我应该坚持MySQL吗?它的工作原理,还是会尝试HBase是明智的?我看到Lucene将它用于全文索引(这与我正在做的类似)。我的问题是,单个HBase节点如何与单个MySQL节点进行比较?我来自Scala,那么直接的Java API可以通过JDBC和MySQL解析每个查询吗?



我的主要关注点是插入速度,因为它具有以前一直是瓶颈。处理完后,我可能会把数据放回到MySQL进行实时查询,因为我需要做一些在MySQL中更好的计算。



我会尝试原型,但我敢肯定,社区可以给我一些有价值的见解。

解决方案

使用正确的工具

有很多反RDBMS或BASE系统(基本可用,软状态,最终一致),而不是ACID(Atomicity,Consistency,Isolation,耐久性)可以从这里选择此处



我已经使用传统的RDBMS,尽管您可以存储CLOB / BLOB,但是
没有专门为searchi定制的内置索引



在插入文档时,您希望完成大部分工作(计算找到的每个元组的
的加权频率)。



您可能还希望在每次搜索后进行一些计算每个
(documentId,searchWord)对的有用性的工作。



您还希望为每次搜索存储分数或权重,并为每个搜索加权
分数与其他搜索相似。

很可能某些搜索比其他搜索更为常见,并且
用户不会正确地搜索他们的搜索查询,尽管他们的意思是
做一个普通的搜索。

插入文档也会导致搜索权重
索引的一些更改。



我越想它,解决方案就越复杂。
您必须先从一个好的设计开始。您的
设计预期的因素越多,结果就越好。

I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software and looking for better ways of storing the data.

Looking at the way HBase works, this seems to fit the schema rather well. Instead of storing lots of triplets, I could map document to {term => weight}.

I'm doing this on a single node, so I don't care about distributed nodes etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is analogous to what I'm doing). My question is really how would a single HBase node compare with a single MySQL node? I'm coming from Scala, so might a direct Java API have an edge over JDBC and MySQL parsing etc each query?

My primary concern is insertion speed, as that has been the bottleneck previously. After processing, I will probably end up putting the data back into MySQL for live-querying because I need to do some calculations which are better done within MySQL.

I will try prototyping both, but I'm sure the community can give me some valuable insight into this.

解决方案

Use the right tool for the job.

There are a lot of anti-RDBMSs or BASE systems (Basically Available, Soft State, Eventually consistent), as opposed to ACID (Atomicity, Consistency, Isolation, Durability) to choose from here and here.

I've used traditional RDBMSs and though you can store CLOBs/BLOBs, they do not have built-in indexes customized specifically for searching these objects.

You want to do most of the work (calculating the weighted frequency for each tuple found) when inserting a document.

You might also want to do some work scoring the usefulness of each (documentId,searchWord) pair after each search.

That way you can give better and better searches each time.

You also want to store a score or weight for each search and weighted scores for similarity to other searches.

It's likely that some searches are more common than others and that the users are not phrasing their search query correctly though they mean to do a common search.

Inserting a document should also cause some change to the search weight indexes.

The more I think about it, the more complex the solution becomes. You have to start with a good design first. The more factors your design anticipates, the better the outcome.

这篇关于如果HBase没有在分布式环境中运行,HBase是否有意义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆