分割.ttl或.nt文件-Spark Scala [英] Splitting .ttl or .nt file - Spark Scala

查看:527
本文介绍了分割.ttl或.nt文件-Spark Scala的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是scala的新手,我需要一行一行地读取ttl文件,并拆分特定的定界符,然后提取值以放入数据帧的各个列中.

I'm new to scala and I need to read line by line a ttl file and split on particular delimiter and extract values to put in respective columns in dataframe.

< http://website/Jimmy_Carter> <http://web/name> "James Earl Carter, Jr."@ko .
< http://website/Jimmy_Car> <http://web/country> <http://website/United_States> .
< http://website/Jimmy_Car> <http://web/birthPlace> <http://web/Georgia_(US)> .

我想要这个输出

+-------------------------------+---------------------------+-----------------------------
|S                              |P                          |O                                                             |
+-------------------------------+---------------------------+-----------------------------

|http://website/Jimmy_Car       |http://web/name            |"James Earl Carter                                                       |
|http:///website/Jimmy_Car      |http://web/country         |http://web/country                   |
|http://website/Jimmy_Car       |http://web/birthPlace      |http://web/Georgia_(US)             |
|

我尝试了此代码

case class T(S: Option[String], P: Option[String],O:Option[String])


 val triples = sc.textFile("triples_test.ttl").map(_.split(" |\\< |\\> |\\ . ")).map(p => 
  T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()

我得到了这个结果

    +-------------------------------+---------------------------+-----------------------------
|S                              |P                          |O                                                             |
+-------------------------------+---------------------------+-----------------------------

|<http://website/Jimmy_Car       |<http://web/name            |"James                                                       |
|<http:///website/Jimmy_Car      |<http://web/country         |<http://web/country                   |
|<http://website/Jimmy_Car       |<http://web/birthPlace      |<http://web/Georgia_(US) 

要删除分隔符<"在每个三元组的开头,我添加了"|<"到

To remove the separator "<" in the begin of each triple I added "|<" to the split

 val triples = sc.textFile("triples_test.ttl").map(_.split(" |\\< |\\> |\\ . |<")).map(p => 
  T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()

我得到了这个结果

    +-------------------------------+---------------------------+-----------------------------
|S                              |P                          |O                                                             |
+-------------------------------+---------------------------+-----------------------------

|                                |http://web/name            |                                                      |
|                                |http://web/country         |                   |
|                                |http://web/birthPlace      | 

我该如何解决这个问题

推荐答案

在不清楚如何用Spark中的内置正则表达式功能替换代码的情况下,请在下面找到答案.尽管您需要确保在使用此方法之前了解正则表达式的工作原理.

Please find below the answer in the case that is not clear how to replace your code with the build-in regex functionality in Spark. Although you need to be sure that you understand how regex work before using this approach.

val df = Seq(
        ("< http://website/Jimmy_Carter>", "<http://web/name>", "\"James Earl Carter, Jr.\"@ko .\""),
        ("< http://website/Jimmy_Car>", "<http://web/country>", "<http://website/United_States> ."),
        ("< http://website/Jimmy_Car>", "<http://web/birthPlace>", "<http://web/Georgia_(US)> .")
    ).toDF("S", "P", "O")

val url_regex = """^(?:"|<{1}\s?)(.*)(?:>(?:\s\.)?|,\s.*)$"""
val dfA = df.withColumn("S", regexp_extract($"S", url_regex, 1))
            .withColumn("P", regexp_extract($"P", url_regex, 1))
            .withColumn("O", regexp_extract($"O", url_regex, 1))

这将输出:

+---------------------------+---------------------+----------------------------+
|S                          |P                    |O                           |
+---------------------------+---------------------+----------------------------+
|http://website/Jimmy_Carter|http://web/name      |James Earl Carter           |
|http://website/Jimmy_Car   |http://web/country   |http://website/United_States|
|http://website/Jimmy_Car   |http://web/birthPlace|http://web/Georgia_(US)     |
+---------------------------+---------------------+----------------------------+

只需稍作解释,即使这不是本文的主题,则正则表达式的工作原理.

Just a little explanation how the regex works even if this is not the subject of the post.

  1. (?:"||< {1} \ s?)标识以" < <
  2. (.*)将匹配的内容提取到第一组中
  3. (?:>(?:\ s \.)?|,\ s.*)标识以> 结尾的行>.,\ s.* 詹姆斯·厄尔案的最后一个案子
  1. (?:"|<{1}\s?) Identify rows that start with " or < or <
  2. (.*) extract content of the matches into the 1st group
  3. (?:>(?:\s\.)?|,\s.*) Identify rows that end either with > or > . or ,\s.* the last for the James Earl case

这篇关于分割.ttl或.nt文件-Spark Scala的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆