使用Hadoop进行相似性连接 [英] Similarity join using Hadoop

查看:239
本文介绍了使用Hadoop进行相似性连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的hadoop。我想和你合作一些方法。

问题:

2个数据集:A和B.

这两个数据集都代表歌曲:一些顶级属性,标题1 .. ),performers(1 .. )。
我需要使用相等或模糊算法(例如levenshtein,jaccard,jaro-winkler等)匹配这些数据集)基于标题和表演者。

数据集大小为:A = 20-30M,B〜= 1-6M。



因此,我提出了一些方法:


  1. 将数据集B(最小)加载到HDFS中。对数据集A(最大)使用mapreduce,其中:

    map阶段:对于A访问HDFS中的每个记录并将记录B用于匹配;

    reduce阶段:写入id对<

  2. 以优化的形式将数据集A加载到distirubted缓存(即jboss缓存)中,以加快搜索速度。对数据集B使用mapreduce,其中:

    map phase:对于B查询中的每个记录分配缓存以进行匹配

    reduce:写入id对


  3. 使用mapreduce连接两个数据集,其中

    map phase:从集合A获取记录并设置B,匹配

    reduce phase:same <
    (我对这个模糊不清第一:加入将会是笛卡尔产品,有万亿记录;第二:不知道hadoop如何在群集中实现同步)


  4. 使用hive(我现在正在研究如何插入自定义函数来执行字符串匹配)


我很喜欢指针,这种方法可能是最好的选择,也可能有一些其他的方法我没有看到。

解决方案

您可能会发现本文和代码很有用:

高效并行集-S imilarity加入使用MapReduce



我亲自在实施它级联,效果很好。不幸的是,这个代码太特定领域的发布。



上述工作的重点是减少很可能类似的候选对的数量,然后候选人可以使用任何相关算法的混合来直接比较对(在MR连接中)。一个好的副作用是,这个连接可以在整个集群中均匀地执行,而无需进行重复比较。



最终,这是在两个独立集之间进行交叉连接的优化,相同的集合(第二种情况与第一种情况略有不同)。

披露:我是级联的作者


I'm new to hadoop. I'd like to run some approaches with you that I came up with.

Problem:
2 datasets : A and B.
Both datasets represent songs: some top level attributes, titles (1..), performers (1..).
I need to match these datasets either using equality or fuzzy algorithms (such as levenshtein , jaccard, jaro-winkler, etc) based on titles and performer.
The dataset sizes are: A=20-30M , B~=1-6M.

So here there are approaches that I came up with:

  1. Load dataset B(smallest) into HDFS. Use mapreduce against dataset A(biggest) , where:
    map phase : for each record in A access HDFS and pull records B for matching;
    reduce phase : writes id pairs

  2. load dataset A into distirubted cache (i.e. jboss cache) in optimized form to speed up searching. Use mapreduce against dataset B, where :
    map phase: for each record in B query distributed cache for matching
    reduce : writes id pairs

  3. use mapreduce to join both datasets, where
    map phase: gets a record from set A and set B , does matching
    reduce phase: same
    (I'm fuzzy about ths one. 1st: join will be the cartesian product with trillion of records; 2nd: not sure how hadoop can parallize that across cluster)

  4. use hive (i'm looking at right now trying to figure out how to plugin custom functions that will do string matching)

I'm loooking for a pointers, which approach would be the best candidate or maybe there are some other approaches that I do not see.

解决方案

You might find this paper and code useful:

Efficient Parallel Set-Similarity Joins Using MapReduce

I've personally implemented it in Cascading with good results. Unfortunately the code is too domain specific to release.

The point of the above work is to reduce the number joins to the candidate pairs that are very likely similar, then the candidate pairs can be compared directly (in a MR join) using any cocktail of algorithms that are relevant. A good side effect is that this join can be performed evenly across the cluster without duplicate comparisons.

Ultimately this is an optimization on performing a cross join between two independent sets or within the same set (the second case implemented slightly differently than the first).

disclosure: I'm the author of Cascading

这篇关于使用Hadoop进行相似性连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆