Apache Spark 中的 Jaro-Winkler 分数计算 [英] Jaro-Winkler score calculation in Apache Spark

查看：18 发布时间：2021/11/14 21:11:25 apache-spark apache-spark-mllib apache-spark-ml apache-spark-2.0 apache-spark-dataset

本文介绍了Apache Spark 中的 Jaro-Winkler 分数计算的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们需要在 Apache Spark Dataset 中实现跨字符串的 Jaro-Winkler 距离计算.我们是 Spark 的新手，在网上搜索后我们找不到太多东西.如果您能指导我们，那就太好了.我们想过使用 flatMap 然后意识到它无济于事，然后我们尝试使用几个 foreach 循环但无法弄清楚如何继续.因为每个字符串都必须与所有字符串进行比较.就像下面的数据集一样.

We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset. We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset.

RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));

上述数据框中所有字符串的 jaro winkler 得分示例.

Example jaro winkler score across all string found in the above dataframe.

标签之间的距离得分，0,1 -> 0.56
标签之间的距离得分标签，0,2 -> 0.77
标签之间的距离得分，0,3 -> 0.45
标签之间的距离得分，1,2 -> 0.77
之间的距离得分标签，2,3 -> 0.79

Distance score between label, 0,1 -> 0.56
Distance score between label, 0,2 -> 0.77
Distance score between label, 0,3 -> 0.45
Distance score between label, 1,2 -> 0.77
Distance score between label, 2,3 -> 0.79

    import java.util.Arrays;
    import java.util.Iterator;
    import java.util.List;

    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaSparkContext;
    import org.apache.spark.api.java.function.FlatMapFunction;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.RowFactory;
    import org.apache.spark.sql.SQLContext;
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.types.DataTypes;
    import org.apache.spark.sql.types.Metadata;
    import org.apache.spark.sql.types.StructField;
    import org.apache.spark.sql.types.StructType;

    import info.debatty.java.stringsimilarity.JaroWinkler;

    public class JaroTestExample {
     public static void main( String[] args )
        {
      System.setProperty("hadoop.home.dir", "C:\\winutil");
      JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
      SQLContext sqlContext = new SQLContext(sc);
      SparkSession spark = SparkSession.builder()
        .appName("JavaTokenizerExample").getOrCreate();
       JaroWinkler jw = new JaroWinkler();

            // substitution of s and t
            System.out.println(jw.similarity("My string", "My tsring"));

            // substitution of s and n
            System.out.println(jw.similarity("My string", "My ntrisg"));

            List<Row> data = Arrays.asList(
        RowFactory.create(0, "Hi I heard about Spark"),
        RowFactory.create(1,"I wish Java could use case classes"),
        RowFactory.create(2,"Logistic,regression,models,are,neat"));

            StructType schema = new StructType(new StructField[] {
      new StructField("label", DataTypes.IntegerType, false,
        Metadata.empty()),
      new StructField("sentence", DataTypes.StringType, false,
        Metadata.empty()) });

            Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);

            sentenceDataFrame.foreach();

        }

    }

Apache Spark 中的 Jaro-Winkler 分数计算 [英] Jaro-Winkler score calculation in Apache Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark 中的 Jaro-Winkler 分数计算 [英] Jaro-Winkler score calculation in Apache Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭