Apache Spark中的Jaro-Winkler分数计算 [英] Jaro-Winkler score calculation in Apache Spark

本文介绍了Apache Spark中的Jaro-Winkler分数计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们需要在Apache Spark 数据集中跨字符串实现Jaro-Winkler距离计算.我们是新兴的,在网络上搜索之后,我们找不到很多东西.如果您能指导我们,那就太好了.我们考虑使用 flatMap ,然后意识到这无济于事,然后我们尝试使用几个foreach循环,但无法弄清楚如何进行.因为每个字符串都必须与所有字符串进行比较.就像下面的数据集一样.

We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset. We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset.

RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));

在上述数据框中找到的所有字符串上的示例jaro winkler得分.

Example jaro winkler score across all string found in the above dataframe.

标签之间的距离分数0,1-> 0.56
标签之间的距离分数 标签0,2-> 0.77
标签之间的距离得分0,3-> 0.45
标签之间的距离得分1,2-> 0.77
之间的距离得分 标签,2,3-> 0.79

Distance score between label, 0,1 -> 0.56
Distance score between label, 0,2 -> 0.77
Distance score between label, 0,3 -> 0.45
Distance score between label, 1,2 -> 0.77
Distance score between label, 2,3 -> 0.79

    import java.util.Arrays;
    import java.util.Iterator;
    import java.util.List;

    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaSparkContext;
    import org.apache.spark.api.java.function.FlatMapFunction;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.RowFactory;
    import org.apache.spark.sql.SQLContext;
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.types.DataTypes;
    import org.apache.spark.sql.types.Metadata;
    import org.apache.spark.sql.types.StructField;
    import org.apache.spark.sql.types.StructType;

    import info.debatty.java.stringsimilarity.JaroWinkler;

    public class JaroTestExample {
     public static void main( String[] args )
        {
      System.setProperty("hadoop.home.dir", "C:\\winutil");
      JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
      SQLContext sqlContext = new SQLContext(sc);
      SparkSession spark = SparkSession.builder()
        .appName("JavaTokenizerExample").getOrCreate();
       JaroWinkler jw = new JaroWinkler();

            // substitution of s and t
            System.out.println(jw.similarity("My string", "My tsring"));

            // substitution of s and n
            System.out.println(jw.similarity("My string", "My ntrisg"));

            List<Row> data = Arrays.asList(
        RowFactory.create(0, "Hi I heard about Spark"),
        RowFactory.create(1,"I wish Java could use case classes"),
        RowFactory.create(2,"Logistic,regression,models,are,neat"));

            StructType schema = new StructType(new StructField[] {
      new StructField("label", DataTypes.IntegerType, false,
        Metadata.empty()),
      new StructField("sentence", DataTypes.StringType, false,
        Metadata.empty()) });

            Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);

            sentenceDataFrame.foreach();

        }

    }

推荐答案

Spark中的交叉连接可以使用以下代码完成 Dataset2Object = Dataset1Object.crossJoin(数据集2Object) 在Dataset2Object中,您将获得记录对的所有组合.在这种情况下,平面图将无济于事. 请记住使用版本spark-sql_2.11 2.1.0

Cross join in spark can be done using the below code Dataset2Object=Dataset1Object.crossJoin(Dataset2Object) In Dataset2Object you get all combination of recordpair which is your need. In this case flatmap wont be helpfull. Please remember to use version spark-sql_2.11 version 2.1.0

这篇关于Apache Spark中的Jaro-Winkler分数计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆