图数据库(neo4j)与关系数据库。需要设计方面的帮助 [英] Graph database (neo4j) vs relational database. Need help in design

查看:184
本文介绍了图数据库(neo4j)与关系数据库。需要设计方面的帮助的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用开源项目( biojava ),但我不满意一些表现,我想花一些时间来改进它。

I have to work with an open source project (biojava), but I'm not satisfied with some performance, and I'd like to spend some time to improve it.

例如,我有一个以这种方式编码的文本数据库:

For example, I have a text database coded in this way:

chrX    Cufflinks   exon    65175856    65175971    .   .   .   gene_id "XLOC_002576"; transcript_id "TCONS_00004217"; exon_number "1"; gene_name "RP6-159A1.2"; oId "CUFF.3698.1"; nearest_ref "ENST00000456392"; class_code "p"; tss_id "TSS3873";    
chrX    Cufflinks   exon    128986006   128986088   .   .   .   gene_id "XLOC_002577"; transcript_id "TCONS_00004218"; exon_number "1"; oId "CUFF.3750.1"; class_code "u"; tss_id "TSS3874";

并非每个字段都是强制性的,每个 gene_id 可能与多个 transcript_id (1..n)相关联,并且每个 transcript_id 有1个或多个 exon

Not every field is mandatory, each gene_id may be associated to multiple transcript_id (1..n), and each transcript_id has 1 or more exon.

库的行为是将整个文本文件加载到 ArrayList ,并且对于每个搜索,必须迭代列表。这适用于小型列表,但在我的情况下,我有10 ^ 10个具有非常大的列表的查询,并且在一台好的计算机中需要几天。

The library behavior is to load the entire text file in an ArrayList, and for each search al the list must be iterated. This works good with small lists, but in my case I have 10^10 queries with a really large list, and it takes a couple of days in a good computer.

Neo4j会是个不错的选择吗?实施它的好方法是什么?例如,创建一个只有String的实体并在它们之间建立关系是不是很糟糕?或者将Hsqldb与单个表一起使用会更好吗?

Would Neo4j be a good choice? What would be a good way to implement it? For example, is it bad to create a String only entity, and make relationships between them? Or is it better to use Hsqldb with a single table?

请注意我不需要持久性,但速度和同步是强制性的。

Please note I don't need persistence, but speed and synchronization is mandatory.

编辑:如果你愿意,你可以查看项目这里

if you want, you can have a look at the project here.

推荐答案

当你想在大海捞针中找到针头时,即当你有一个大型数据集,但是当你运行查询时,Neo4J效果很好,您只对查询少量数据感兴趣。例如,如果你有一个图像:

Neo4J works well when you want to find needles in haystacks, i.e. when you have a large dataset, but when you run queries, you are only interested querying for a small amount of the data. For example, if you had a graph like:

(gene) -> (transcript) -> (exon)

然后Neo4J擅长运行查询,例如从基因XLOC_002576开始,给我所有它的成绩单,并给我所有其他基因也与那些成绩单有关。 (我不知道什么是成绩单和外显子,所以查询可能没有意义,但你明白了。)

then Neo4J would be good at running queries such as "Starting with gene XLOC_002576, give me all it's transcripts and give me all the other genes also related to those transcripts". (I have no idea what transcripts and exons are, so that query probably doesn't make sense, but you get the idea).

如果你不是在寻找在大海捞针中,而不是为每个查询处理整个数据集,那么Neo4J不太可能成为工作的工具。如果数据集真的很大(如数百GB),你是将整个数据集减少到一个小答案,你不介意在多台机器上分配处理,然后可能使用 hadoop 地图缩小并将大文本文件上传到HDFS可能是一种选择。

If you are not looking for the needle in the haystack, and instead are processing the whole dataset for every query, then Neo4J is unlikely to be the tool for the job. If the datasets are really huge (as in hundreds of Gigabytes) are you are reducing the whole data set down to a small answer and you don't mind distributing the processing across several machines, then maybe using hadoop map reduce and uploading you large text files to HDFS could be an option.

如果您提供有关查询配置文件的更多信息,这将有助于提供更好的答案。即你对数据做了什么? 搜索是什么意思?

If you provide a little more information about your query profile, it would help in providing a better answer. i.e. what are you doing to the data? what do you mean by 'search'?

这篇关于图数据库(neo4j)与关系数据库。需要设计方面的帮助的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆