Apache Spark 中的 Jaro-Winkler 分数计算 [英] Jaro-Winkler score calculation in Apache Spark

本文介绍了Apache Spark 中的 Jaro-Winkler 分数计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们需要在 Apache Spark Dataset 中实现跨字符串的 Jaro-Winkler 距离计算.我们是 Spark 的新手,在网上搜索后我们找不到太多东西.如果您能指导我们,那就太好了.我们想过使用 flatMap 然后意识到它无济于事,然后我们尝试使用几个 foreach 循环但无法弄清楚如何继续.因为每个字符串都必须与所有字符串进行比较.就像下面的数据集一样.

We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset. We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset.

RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));

上述数据框中所有字符串的 jaro winkler 得分示例.

Example jaro winkler score across all string found in the above dataframe.

标签之间的距离得分,0,1 -> 0.56
标签之间的距离得分标签,0,2 -> 0.77
标签之间的距离得分,0,3 -> 0.45
标签之间的距离得分,1,2 -> 0.77
之间的距离得分标签,2,3 -> 0.79

Distance score between label, 0,1 -> 0.56
Distance score between label, 0,2 -> 0.77
Distance score between label, 0,3 -> 0.45
Distance score between label, 1,2 -> 0.77
Distance score between label, 2,3 -> 0.79

    import java.util.Arrays;
    import java.util.Iterator;
    import java.util.List;

    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaSparkContext;
    import org.apache.spark.api.java.function.FlatMapFunction;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.RowFactory;
    import org.apache.spark.sql.SQLContext;
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.types.DataTypes;
    import org.apache.spark.sql.types.Metadata;
    import org.apache.spark.sql.types.StructField;
    import org.apache.spark.sql.types.StructType;

    import info.debatty.java.stringsimilarity.JaroWinkler;

    public class JaroTestExample {
     public static void main( String[] args )
        {
      System.setProperty("hadoop.home.dir", "C:\\winutil");
      JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
      SQLContext sqlContext = new SQLContext(sc);
      SparkSession spark = SparkSession.builder()
        .appName("JavaTokenizerExample").getOrCreate();
       JaroWinkler jw = new JaroWinkler();

            // substitution of s and t
            System.out.println(jw.similarity("My string", "My tsring"));

            // substitution of s and n
            System.out.println(jw.similarity("My string", "My ntrisg"));

            List<Row> data = Arrays.asList(
        RowFactory.create(0, "Hi I heard about Spark"),
        RowFactory.create(1,"I wish Java could use case classes"),
        RowFactory.create(2,"Logistic,regression,models,are,neat"));

            StructType schema = new StructType(new StructField[] {
      new StructField("label", DataTypes.IntegerType, false,
        Metadata.empty()),
      new StructField("sentence", DataTypes.StringType, false,
        Metadata.empty()) });

            Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);

            sentenceDataFrame.foreach();

        }

    }

推荐答案

spark 中的交叉连接可以使用以下代码完成Dataset2Object=Dataset1Object.crossJoin(Dataset2Object)在 Dataset2Object 中,您可以获得所需的所有记录对组合.在这种情况下,平面地图不会有帮助.请记得使用版本 spark-sql_2.11 版本 2.1.0

Cross join in spark can be done using the below code Dataset2Object=Dataset1Object.crossJoin(Dataset2Object) In Dataset2Object you get all combination of recordpair which is your need. In this case flatmap wont be helpfull. Please remember to use version spark-sql_2.11 version 2.1.0

这篇关于Apache Spark 中的 Jaro-Winkler 分数计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆