我们如何使用Dataframe解析Spark中的日志? [英] How we can parse logs in Spark using Dataframe?

查看:139
本文介绍了我们如何使用Dataframe解析Spark中的日志?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用稍后可以查询的DataFrame/Spark SQL表解析以下日志?

How to parse the below log using DataFrame/Spark SQL table that can be queried later?

66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

推荐答案

您基本上可以通过查看有效的定界符来分割字符串,然后继续创建这样的新列.(假设您正在寻找这样的东西)

You can basically split the string by looking at the valid delimiter and then keep on creating the new columns like this.(Assuming you are looking for something like this)

scala> val DF = spark.sparkContext.textFile("/Users/goldie/code/files/sampleStackOverflow.txt").toDF
DF: org.apache.spark.sql.DataFrame = [value: string]

scala> DF.show
+--------------------+
|               value|
+--------------------+
|66.249.69.97 - - ...|
|71.19.157.174 - -...|
|71.19.157.174 - -...|
|71.19.157.174 - -...|
|71.19.157.174 - -...|
+--------------------+

scala> DF.withColumn("IP Address",split(col("value")," - - ")(0)).show
+--------------------+-------------+
|               value|   IP Address|
+--------------------+-------------+
|66.249.69.97 - - ...| 66.249.69.97|
|71.19.157.174 - -...|71.19.157.174|
|71.19.157.174 - -...|71.19.157.174|
|71.19.157.174 - -...|71.19.157.174|
|71.19.157.174 - -...|71.19.157.174|
+--------------------+-------------+

添加了四列(根据您以前的文件)

Added four columns (as per your previous file)

scala> :paste
// Entering paste mode (ctrl-D to finish)

DF.withColumn("IP Address",split(col("value")," - - ")(0)).
withColumn("temp1",split(col("value")," - - ")(1)).
withColumn("Time",concat(split(col("temp1")," ")(0),split(col("temp1")," ")(1))).
withColumn("col3",substring(split(col("temp1")," ")(2),2,3)).
withColumn("Col4",split(col("temp1")," ")(3)).
select(col("IP Address"),col("Time"),col("col3"),col("col4")).show

// Exiting paste mode, now interpreting.

+-------------+--------------------+----+-------------------+
|   IP Address|                Time|col3|               col4|
+-------------+--------------------+----+-------------------+
| 66.249.69.97|[24/Sep/2014:22:2...| GET|     /071300/242153|
|71.19.157.174|[24/Sep/2014:22:2...| GET|             /error|
|71.19.157.174|[24/Sep/2014:22:2...| GET|       /favicon.ico|
|71.19.157.174|[24/Sep/2014:22:2...| GET|                  /|
|71.19.157.174|[24/Sep/2014:22:2...| GET|/jobmineimg.php?q=m|
+-------------+--------------------+----+-------------------+

这篇关于我们如何使用Dataframe解析Spark中的日志?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆