制作具有“数百万和数百万”的树。动态节点 [英] Making a tree with "millions and millions" of dynamic nodes

查看:82
本文介绍了制作具有“数百万和数百万”的树。动态节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




我有一堆非常大的地理数据我要导入数据库(选择的数据库是postgres,

虽然可以跳到oracle如有必要)。数据是严格分层的 - 每个节点都有一个,只有一个

父级。深度不应超过6或7级。最初的进口将有大约600万片叶子,以及
$ 300亿支。我希望叶子能够显着增长,数量很容易增加三倍。但是,

分支机构的数量可能会保持不变,但我预计会有一些地方转移一下

(影响多达数千名儿童)。


如需选择,对我来说至关重要:


1.路径生成速度

2.直接兄弟速度

3.儿童计数速度

我已经使用Joe Celko的嵌套集实现了第一次运行(w / a mod用于存储树级别的速度)。

更新预测问题是这种方法的致命弱点。我读过Vadim Tropashko Nested

Interval概念( http:/ /www.dbazine.com/tropashko4.html ),我的大脑痛苦地伸展到足以让b $ b得到一般的想法。我有一种感觉,我会遇到其他报道的算术问题。


所以看来物化路径是我唯一的选择,但是我关注的是正确的LIKE表现

树的手边,路径为8digits x 6 levels = 48 chars。我应该担心吗?我需要

瞬间实时性能,并且无法想象嵌套集算术方法会如此快。

我可以将导入变为平滑保证上面的树有最小的数字,然而,它将节省在路径上最多8个字符的




我一直在谷歌搜索USENET档案馆观看过去几年的大辩论。我已经放弃了很多

的知识,但现在我很好奇是否有人对这么大的数据集有任何想法/建议/战争故事。


(如果任何一位帖子粉丝对一张2000万行

树的实时表现有一些让人放心的话,我很乐意听到;-)

-

[\ /

[> X< Christian Fowler | http://www.steelsun.com/

[ / \



I have a VERY LARGE pile of geographic data that I am importing into a database (db of choice is postgres,
though may hop to oracle if necessary). The data is strictly hierarchical - each node has one, and only one
parent. The depth should not exceed 6 or 7 levels. The initial import will have about 6 million leaves, and
3 million branches. I would expect the leaves to grow significantly, in number easily tripling. However, the
branches will likely stay very constant in number, but I expect there locations to shift around somewhat
(affecting up to thousand of children).

For selection, it is crucial for me to get:

1. path generation speed
2. immediate sibling speed
3. children count speed
I have implemented a first run using Joe Celko''s Nested Sets (w/ a mod to store tree level for speed). The
update propigation issue is the achilles heel of this approach for me. I have read Vadim Tropashko Nested
Interval concept ( http://www.dbazine.com/tropashko4.html ) , and my brain is painfully stretched enough to
get the general idea. I have a feeling i will hit the arithmetic issues others of reported.

So it seems Materialized Path is my only option, however I am concerned about LIKE performance for the right
hand side of the tree, where the path is 8digits x 6 levels = 48 chars. Should I be concerned? I need
split-second real-time performance, and can''t imagine it will be as fast the Nested Set arithmatic approach.
I can flatten out the import to insure the upper tree has the smallest numbers, however, it will save at
most 8 chars on the path.

I''ve been googling through USENET archives watching the big debates of years gone by. I''ve grazed much
knowledge, but now am curious if anyone has any thoughts/advice/war stories about a data set this large.

(and if any fellow postgres fans have some reassuring words about real-time performance of a 20 million row
tree, i''d love to hear ;-)
--
[ \ /
[ >X< Christian Fowler | http://www.steelsun.com/
[ / \

推荐答案



" Christian Fowler" <去**** @ NOSPAM.gravesweeper.com>在留言中写道

news:6b ******************** @ speakeasy.net ...

"Christian Fowler" <go****@NOSPAM.gravesweeper.com> wrote in message
news:6b********************@speakeasy.net...
For选择,对我来说至关重要:

1.路径生成速度


根路径?它在物化路径中明确编码。所有你需要做的就是解析路径并生成一个你在查询中使用的前缀列表,如下所示:


从层次结构中选择*其中路径为(''1'','1.2'',''1.2.1'',''1.2.1.1'')


如果您有索引,并且您的rdbms优化器足够智能,则查询

应该快速执行。

2.直接兄弟速度


以下是kludge:


select * from hierarchy where path in(''1.2.1.1'',''1.2.1.2'',''1.2.1.3' ',

''1.2.1.4'',''1.2.1.5'')


再次,我假设您的查询将在此使用索引。


如果你得到5条记录,那么你就不能确定你的节点有多少孩子。你必须重复这样的查询:

select * from hierarchy in path in(

''1.2.1.1'',''1.2.1.2'','' 1.2.1.3'',''1.2.1.4'',''1.2.1.5''

,''1.2.1.6'',''1.2.1.7'',''1.2。 1.8'',''1.2.1.9'',''1.2.1.10'')

,''1.2.1.11'',''1.2.1.12'',''1.2.1.13 '',''1.2.1.4'',''1.2.1.15'')

,''1.2.1.16'',''1.2.1.17'',''1.2.1.18' ',''1.2.1.19'',''1.2.1.20'')

,''1.2.1.21'',''1.2.1.22'',''1.2.1.23'' ,''1.2.1.4'',''1.2.1.25'')


又一次,可能会有超过25个孩子,所以你必须要跑步

再次更多generos查询。


这里的缺陷是,在某些时候,优化者可能会厌倦去做b $ b或扩展,以便在某些时候你最终可能会进行全表扫描。

但是,这是依赖于实现的,所以你可能能够影响优化器的决定。


如你所见,我并不担心物化路径有多少字节,或者解析它需要比乘以2个整数花费更多时间。我担心的是,你的查询是否可以利用索引。

3.儿童计数速度
For selection, it is crucial for me to get:

1. path generation speed
Path to the root? It is coded explicitly in the materialized path. All you
need to do is parsing the path and generating a list of prefixes that you
use within your query like this:

select * from hierarchy where path in (''1'',''1.2'',''1.2.1'', ''1.2.1.1'')

If you have an index, and your rdbms optimiser is smart enough, the query
should execute fast.
2. immediate sibling speed
Here is the kludge:

select * from hierarchy where path in (''1.2.1.1'',''1.2.1.2'',''1.2.1.3'',
''1.2.1.4'',''1.2.1.5'')

Again, I assume that your query would use an index here.

If you get exactly 5 records, then you can''t be sure that your node has that
many children. You have to repeat query like this:
select * from hierarchy where path in (
''1.2.1.1'',''1.2.1.2'',''1.2.1.3'', ''1.2.1.4'',''1.2.1.5''
,''1.2.1.6'',''1.2.1.7'',''1.2.1.8'', ''1.2.1.9'',''1.2.1.10'')
,''1.2.1.11'',''1.2.1.12'',''1.2.1.13'', ''1.2.1.4'',''1.2.1.15'')
,''1.2.1.16'',''1.2.1.17'',''1.2.1.18'', ''1.2.1.19'',''1.2.1.20'')
,''1.2.1.21'',''1.2.1.22'',''1.2.1.23'', ''1.2.1.4'',''1.2.1.25'')

Yet again, there might be more than 25 children, so you''ll have to run yet
again more "generos" query.

The pitfall here is that at some point the optimiser may get tired to do
"or" expansion, so that at some point you might end up with full table scan.
But, this is implementation dependent, so that you might be able to
influence optimizer decision.

As you see, I''m not worried how many bytes materialized path has, or if
parsing it takes more time than multiplying 2 integers. My concern is if
your query can leverage index or not.
3. children count speed




儿童还是后代?我想没有办法可以击败Celko的后代'

公式

#descendants =(rgt-lft + 1)/ 2

计数隐式存储在根节点,因此即使对于层次结构,我们也会立即回答有多少节点。这也是Achiles'方法的脚跟:对层次结构的任何更新触发刷新

的计数器。然而,一个聚合是永远不够的:它通常也很有用

来了解总薪水:-)


如果你的意思不是后代,而是孩子,然后使用类似于

bullet#2的方法。




Children, or descendants? I guess no method can beat Celko''s descendant''s
formula
#descendants=(rgt-lft+1)/2
The count is implicitly stored at the root node, so that even for hierarchy
1M nodes we answer how many nodes are there instantly. This is also an
Achiles'' heel of the method: any update to the hierarchy triggers refresh
of the "counter". One aggregate is never enough, though: it''s often useful
to know total salary too:-)

If you meant not descendants, but children, then use a method similar to
bullet #2.




" Mikito Harakiri" ; < MI ********* @ iahu.com>写在消息< news:3k ************** @ news.oracle.com> ...
"Mikito Harakiri" <mi*********@iahu.com> wrote in message <news:3k**************@news.oracle.com>...
" Christian Fowler" <去**** @ NOSPAM.gravesweeper.com>在消息中写道
新闻:6b ******************** @ speakeasy.net ...
"Christian Fowler" <go****@NOSPAM.gravesweeper.com> wrote in message
news:6b********************@speakeasy.net...
如需选择,它是对我来说至关重要:

1.路径生成速度
For selection, it is crucial for me to get:

1. path generation speed



根路径?它在物化路径中明确编码。您需要做的就是解析路径并生成一个在查询中使用的前缀列表,如下所示:

从层次结构中选择*其中路径为('' 1'',''1.2'',''1.2.1'',''1.2.1.1'')

如果您有索引,并且您的rdbms优化器足够智能,那么查询
应该快速执行。



Path to the root? It is coded explicitly in the materialized path. All you
need to do is parsing the path and generating a list of prefixes that you
use within your query like this:

select * from hierarchy where path in (''1'',''1.2'',''1.2.1'', ''1.2.1.1'')

If you have an index, and your rdbms optimiser is smart enough, the query
should execute fast.

2.直接兄弟速度



这是kludge:

select * from hierarchy in path in(''1.2.1.1'',''1.2.1.2'',''1.2.1.3'',
''1.2.1.4'',''1.2.1.5' ')

再次,我假设你的查询会在这里使用索引。

如果你得到5条记录,那么你不能确定你的节点有很多孩子。你必须重复这样的查询:
select * from hierarchy where path in(
''1.2.1.1'',''1.2.1.2'',''1.2.1.3'','' 1.2.1.4'',''1.2.1.5''
,''1.2.1.6'',''1.2.1.7'',''1.2.1.8'',''1.2.1.9'', ''1.2.1.10'')
,''1.2.1.11'',''1.2.1.12'',''1.2.1.13'',''1.2.1.4'',''1.2.1.15 '')
,''1.2.1.16'',''1.2.1.17'',''1.2.1.18'',''1.2.1.19'',''1.2.1.20'')
,''1.2.1.21'',''1.2.1.22'',''1.2.1.23'',''1.2.1.4'',''1.2.1.25'')
然而,可能会有超过25个孩子,所以你必须再次运行
更多generos查询。



Here is the kludge:

select * from hierarchy where path in (''1.2.1.1'',''1.2.1.2'',''1.2.1.3'',
''1.2.1.4'',''1.2.1.5'')

Again, I assume that your query would use an index here.

If you get exactly 5 records, then you can''t be sure that your node has that
many children. You have to repeat query like this:
select * from hierarchy where path in (
''1.2.1.1'',''1.2.1.2'',''1.2.1.3'', ''1.2.1.4'',''1.2.1.5''
,''1.2.1.6'',''1.2.1.7'',''1.2.1.8'', ''1.2.1.9'',''1.2.1.10'')
,''1.2.1.11'',''1.2.1.12'',''1.2.1.13'', ''1.2.1.4'',''1.2.1.15'')
,''1.2.1.16'',''1.2.1.17'',''1.2.1.18'', ''1.2.1.19'',''1.2.1.20'')
,''1.2.1.21'',''1.2.1.22'',''1.2.1.23'', ''1.2.1.4'',''1.2.1.25'')

Yet again, there might be more than 25 children, so you''ll have to run yet
again more "generos" query.




这里有一个替代方案可能效果不是很好,但可能比b / b b $ b b b b b b b b b b b b ...


从层次结构中选择*

其中路径如''1.2.1。%''

和路径没有比如''1.2.1。%。%''


如果每个级别使用固定数量的字符怎么办?兄弟姐妹

和直接的孩子查询然后变成更像的东西,


从层次结构中选择*

其中路径如'0001.0002 .0001 .____''


从层次结构中选择计数(*)

其中路径如'0001.0002.0001.1234 .____''


由于每个级别的字符是固定的,你也可以摆脱

点,解析变成长度和子串。

查找所有后代可以使用BETWEEN在

实现物化路径方案:


select * from hierarchy

where path between' '1.2.1.1234。''和''1.2.1.1234~''


从层次结构中选择*

其中'0001.0002.0001.1234'之间的路径。 '和''0001.0002.0001.1234~''


从层次结构中选择*

其中''0001000200011234。''和''0001000200011234~''之间的路径


'。''和''〜''可能需要根据整理顺序进行更改。


-

Joe Foster< mailto:jlfoster%40znet。 COM> Spaace的DC8:< http://www.xenu.net/>

警告:我不能对以上内容负责他们会来

因为我的猫显然学会了打字。把我带走,哈哈!



Here''s an alternative which may not perform very well but may
still be better than risking a full table-scan...

select * from hierarchy
where path like ''1.2.1.%''
and path not like ''1.2.1.%.%''

What if you use a fixed number of chars per level? The sibling
and immediate children queries then become something more like,

select * from hierarchy
where path like ''0001.0002.0001.____''

select count(*) from hierarchy
where path like ''0001.0002.0001.1234.____''

Since the characters per level is fixed, you can get rid of the
dots too, and parsing becomes a matter of length and substring.
Finding all descendants could be implemented using BETWEEN in
either materialized path scheme:

select * from hierarchy
where path between ''1.2.1.1234.'' and ''1.2.1.1234~''

select * from hierarchy
where path between ''0001.0002.0001.1234.'' and ''0001.0002.0001.1234~''

select * from hierarchy
where path between ''0001000200011234.'' and ''0001000200011234~''

The ''.'' and ''~'' may need changing depending on collating order.

--
Joe Foster <mailto:jlfoster%40znet.com> DC8s in Spaace: <http://www.xenu.net/>
WARNING: I cannot be held responsible for the above They''re coming to
because my cats have apparently learned to type. take me away, ha ha!


-----开始PGP签名消息-----

哈希:SHA1





- - Joe \Nuke Me Xemu \"福斯特" < BF ******** @ news.hub.org>写道:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

- -- "Joe \"Nuke Me Xemu\" Foster" <bf********@news.hub.org> wrote:
这里有一个替代方案可能效果不是很好,但可能比冒一个完整的表扫描更好......
Here''s an alternative which may not perform very well but may
still be better than risking a full table-scan...




我在论坛软件中使用了这种方法,并且表现非常好。我用
测试了大约一百万行,而且非常快。甚至

有深树(1000个子分支)。


唯一的区别是我使用基数为255的编码文本列而不是

只有0-9。但注意:字符集必须是ASCII(排序!)...


我想将其更改为bytea以避免base255编码和字符

设置问题,但仍然有一些带有bytea等的错误

运算符。 :-(

Ciao

Alvar


- -

** Alvar CH Freude - - http://alvar.a-blast.org/ - http://odem.org/

** Berufsverbot? http://odem.org/aktuelles/staatsanwalt.de.html

** ODEM.org-旅游: http://tour.odem.org /

** Informationsgesellschaft: http:// www.wsis-koordinierungskreis.de/


-----开始PGP SIGNATURE -----

版本:GnuPG v1。 2.3(FreeBSD)


iD4DBQE / z59nOndlH63J86wRAsrEAJ9OjO5fXhnw2mmLoB7YNHJFYO / X8QCXc31M

FWdV8T92N3kzctSgOOkVMw ==

= Uwtm

-----结束PGP SIGNATURE -----

---------------------------(广播结束)------------- --------------

提示7:别忘了增加免费空间地图设置



I use exactly this method in a forum software, and it performs VERY good. I
tested it with about a million of rows, and it is really very fast. Even
with deep trees (1000 sub branches).

The only difference is that I use a base 255 encoded text column instead of
only 0-9. But attention: the character set must be ASCII (ordering!) ...

I want to change this to bytea to avoid base255 encoding and the character
set problems, but there are still some Bugs with bytea and the like
operator. :-(
Ciao
Alvar

- --
** Alvar C.H. Freude -- http://alvar.a-blast.org/ -- http://odem.org/
** Berufsverbot? http://odem.org/aktuelles/staatsanwalt.de.html
** ODEM.org-Tour: http://tour.odem.org/
** Informationsgesellschaft: http://www.wsis-koordinierungskreis.de/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (FreeBSD)

iD4DBQE/z59nOndlH63J86wRAsrEAJ9OjO5fXhnw2mmLoB7YNHJFYO/X8QCXc31M
FWdV8T92N3kzctSgOOkVMw==
=Uwtm
-----END PGP SIGNATURE-----
---------------------------(end of broadcast)---------------------------
TIP 7: don''t forget to increase your free space map settings


这篇关于制作具有“数百万和数百万”的树。动态节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆