Apache Spark中的Jaro-Winkler分数计算 [英] Jaro-Winkler score calculation in Apache Spark

查看：223 发布时间：2020/9/4 18:43:02 apache-spark apache-spark-mllib apache-spark-ml apache-spark-2.0 apache-spark-dataset

本文介绍了Apache Spark中的Jaro-Winkler分数计算的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们需要在Apache Spark 数据集中跨字符串实现Jaro-Winkler距离计算.我们是新兴的，在网络上搜索之后，我们找不到很多东西.如果您能指导我们，那就太好了.我们考虑使用 flatMap ，然后意识到这无济于事，然后我们尝试使用几个foreach循环，但无法弄清楚如何进行.因为每个字符串都必须与所有字符串进行比较.就像下面的数据集一样.

We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset. We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset.

RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));

在上述数据框中找到的所有字符串上的示例jaro winkler得分.

Example jaro winkler score across all string found in the above dataframe.

标签之间的距离分数0,1-> 0.56
标签之间的距离分数标签0,2-> 0.77
标签之间的距离得分0,3-> 0.45
标签之间的距离得分1,2-> 0.77
之间的距离得分标签，2,3-> 0.79

Distance score between label, 0,1 -> 0.56
Distance score between label, 0,2 -> 0.77
Distance score between label, 0,3 -> 0.45
Distance score between label, 1,2 -> 0.77
Distance score between label, 2,3 -> 0.79

    import java.util.Arrays;
    import java.util.Iterator;
    import java.util.List;

    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaSparkContext;
    import org.apache.spark.api.java.function.FlatMapFunction;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.RowFactory;
    import org.apache.spark.sql.SQLContext;
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.types.DataTypes;
    import org.apache.spark.sql.types.Metadata;
    import org.apache.spark.sql.types.StructField;
    import org.apache.spark.sql.types.StructType;

    import info.debatty.java.stringsimilarity.JaroWinkler;

    public class JaroTestExample {
     public static void main( String[] args )
        {
      System.setProperty("hadoop.home.dir", "C:\\winutil");
      JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
      SQLContext sqlContext = new SQLContext(sc);
      SparkSession spark = SparkSession.builder()
        .appName("JavaTokenizerExample").getOrCreate();
       JaroWinkler jw = new JaroWinkler();

            // substitution of s and t
            System.out.println(jw.similarity("My string", "My tsring"));

            // substitution of s and n
            System.out.println(jw.similarity("My string", "My ntrisg"));

            List<Row> data = Arrays.asList(
        RowFactory.create(0, "Hi I heard about Spark"),
        RowFactory.create(1,"I wish Java could use case classes"),
        RowFactory.create(2,"Logistic,regression,models,are,neat"));

            StructType schema = new StructType(new StructField[] {
      new StructField("label", DataTypes.IntegerType, false,
        Metadata.empty()),
      new StructField("sentence", DataTypes.StringType, false,
        Metadata.empty()) });

            Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);

            sentenceDataFrame.foreach();

        }

    }

Apache Spark中的Jaro-Winkler分数计算 [英] Jaro-Winkler score calculation in Apache Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark中的Jaro-Winkler分数计算 [英] Jaro-Winkler score calculation in Apache Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭