分割.ttl或.nt文件-Spark Scala [英] Splitting .ttl or .nt file - Spark Scala
问题描述
我是scala的新手,我需要一行一行地读取ttl文件,并拆分特定的定界符,然后提取值以放入数据帧的各个列中.
I'm new to scala and I need to read line by line a ttl file and split on particular delimiter and extract values to put in respective columns in dataframe.
< http://website/Jimmy_Carter> <http://web/name> "James Earl Carter, Jr."@ko .
< http://website/Jimmy_Car> <http://web/country> <http://website/United_States> .
< http://website/Jimmy_Car> <http://web/birthPlace> <http://web/Georgia_(US)> .
我想要这个输出
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
|http://website/Jimmy_Car |http://web/name |"James Earl Carter |
|http:///website/Jimmy_Car |http://web/country |http://web/country |
|http://website/Jimmy_Car |http://web/birthPlace |http://web/Georgia_(US) |
|
我尝试了此代码
case class T(S: Option[String], P: Option[String],O:Option[String])
val triples = sc.textFile("triples_test.ttl").map(_.split(" |\\< |\\> |\\ . ")).map(p =>
T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()
我得到了这个结果
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
|<http://website/Jimmy_Car |<http://web/name |"James |
|<http:///website/Jimmy_Car |<http://web/country |<http://web/country |
|<http://website/Jimmy_Car |<http://web/birthPlace |<http://web/Georgia_(US)
要删除分隔符<"在每个三元组的开头,我添加了"|<"到
To remove the separator "<" in the begin of each triple I added "|<" to the split
val triples = sc.textFile("triples_test.ttl").map(_.split(" |\\< |\\> |\\ . |<")).map(p =>
T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()
我得到了这个结果
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
| |http://web/name | |
| |http://web/country | |
| |http://web/birthPlace |
我该如何解决这个问题
推荐答案
在不清楚如何用Spark中的内置正则表达式功能替换代码的情况下,请在下面找到答案.尽管您需要确保在使用此方法之前了解正则表达式的工作原理.
Please find below the answer in the case that is not clear how to replace your code with the build-in regex functionality in Spark. Although you need to be sure that you understand how regex work before using this approach.
val df = Seq(
("< http://website/Jimmy_Carter>", "<http://web/name>", "\"James Earl Carter, Jr.\"@ko .\""),
("< http://website/Jimmy_Car>", "<http://web/country>", "<http://website/United_States> ."),
("< http://website/Jimmy_Car>", "<http://web/birthPlace>", "<http://web/Georgia_(US)> .")
).toDF("S", "P", "O")
val url_regex = """^(?:"|<{1}\s?)(.*)(?:>(?:\s\.)?|,\s.*)$"""
val dfA = df.withColumn("S", regexp_extract($"S", url_regex, 1))
.withColumn("P", regexp_extract($"P", url_regex, 1))
.withColumn("O", regexp_extract($"O", url_regex, 1))
这将输出:
+---------------------------+---------------------+----------------------------+
|S |P |O |
+---------------------------+---------------------+----------------------------+
|http://website/Jimmy_Carter|http://web/name |James Earl Carter |
|http://website/Jimmy_Car |http://web/country |http://website/United_States|
|http://website/Jimmy_Car |http://web/birthPlace|http://web/Georgia_(US) |
+---------------------------+---------------------+----------------------------+
只需稍作解释,即使这不是本文的主题,则正则表达式的工作原理.
Just a little explanation how the regex works even if this is not the subject of the post.
-
(?:"||< {1} \ s?)
标识以"
或<
或<
-
(.*)
将匹配的内容提取到第一组中 -
(?:>(?:\ s \.)?|,\ s.*)
标识以>
或结尾的行>.
或,\ s.*
詹姆斯·厄尔案的最后一个案子
(?:"|<{1}\s?)
Identify rows that start with"
or<
or<
(.*)
extract content of the matches into the 1st group(?:>(?:\s\.)?|,\s.*)
Identify rows that end either with>
or> .
or,\s.*
the last for the James Earl case
这篇关于分割.ttl或.nt文件-Spark Scala的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!