spark csv 包中的 inferSchema [英] inferSchema in spark csv package

查看:89
本文介绍了spark csv 包中的 inferSchema的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过启用 inferSchema 将 csv 文件作为 spark df 读取,但随后无法获取 fv_df.columns.下面是错误信息

<预><代码>>>>fv_df = spark.read.option("header", "true").option("delimiter", "\t").csv('/home/h212957/FacilityView/datapoints_FV.csv', inferSchema=True)>>>fv_df.columns回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/home/h212957/spark/python/pyspark/sql/dataframe.py",第 687 行,列中返回 [f.name for f in self.schema.fields]文件/home/h212957/spark/python/pyspark/sql/dataframe.py",第 227 行,在模式中self._schema = _parse_datatype_json_string(self._jdf.schema().json())文件/home/h212957/spark/python/pyspark/sql/types.py",第 894 行,在 _parse_datatype_json_string 中返回 _parse_datatype_json_value(json.loads(json_string))文件/home/h212957/spark/python/pyspark/sql/types.py",第 911 行,在 _parse_datatype_json_value返回 _all_complex_types[tpe].fromJson(json_value)文件/home/h212957/spark/python/pyspark/sql/types.py",第 562 行,在 fromJsonreturn StructType([StructField.fromJson(f) for f in json["fields"]])文件/home/h212957/spark/python/pyspark/sql/types.py",第 428 行,在 fromJson_parse_datatype_json_value(json["type"]),文件/home/h212957/spark/python/pyspark/sql/types.py",第 907 行,在 _parse_datatype_json_valueraise ValueError("无法解析数据类型:%s" % json_value)ValueError:无法解析数据类型:十进制(7,-31)

但是,如果我不推断架构,我就可以获取列并进行进一步的操作.我无法理解为什么会以这种方式工作.谁能帮我解释一下.

解决方案

我建议你使用函数 '.load' 而不是 '.csv',像这样:

data = sc.read.load(path_to_file,格式='com.databricks.spark.csv',标题='真',inferSchema='true').cache()

当然,您可以添加更多选项.然后你可以简单地得到你想要的:

data.columns

另一种方法(获取列)是这样使用它:

data = sc.textFile(path_to_file)

要获取标题(列),只需使用

data.first()

看起来您正在尝试从您的 csv 文件中获取您的架构,而无需打开它!以上应该可以帮助你获得它们,从而操纵你喜欢的任何东西.

注意:要使用.columns",您的sc"应配置为:

spark = SparkSession.builder \.master("纱线") \.appName("experiment-airbnb") \.enableHiveSupport() \.getOrCreate()sc = SQLContext(火花)

祝你好运!

i am trying to read a csv file as a spark df by enabling inferSchema, but then am unable to get the fv_df.columns. below is the error message

>>> fv_df = spark.read.option("header", "true").option("delimiter", "\t").csv('/home/h212957/FacilityView/datapoints_FV.csv', inferSchema=True)
>>> fv_df.columns
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 687, in columns
    return [f.name for f in self.schema.fields]
  File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 227, in schema
    self._schema = _parse_datatype_json_string(self._jdf.schema().json())
  File "/home/h212957/spark/python/pyspark/sql/types.py", line 894, in _parse_datatype_json_string
    return _parse_datatype_json_value(json.loads(json_string))
  File "/home/h212957/spark/python/pyspark/sql/types.py", line 911, in _parse_datatype_json_value
    return _all_complex_types[tpe].fromJson(json_value)
  File "/home/h212957/spark/python/pyspark/sql/types.py", line 562, in fromJson
    return StructType([StructField.fromJson(f) for f in json["fields"]])
  File "/home/h212957/spark/python/pyspark/sql/types.py", line 428, in fromJson
    _parse_datatype_json_value(json["type"]),
  File "/home/h212957/spark/python/pyspark/sql/types.py", line 907, in _parse_datatype_json_value
    raise ValueError("Could not parse datatype: %s" % json_value)
ValueError: Could not parse datatype: decimal(7,-31)

However If i don't infer the Schema than I am able to fetch the columns and do further operations. I am unable to get as why this is working in this way. Can anyone please explain me.

解决方案

I suggest you use the function '.load' rather than '.csv', something like this:

data = sc.read.load(path_to_file,
                    format='com.databricks.spark.csv', 
                    header='true', 
                    inferSchema='true').cache()

Of you course you can add more options. Then you can simply get you want:

data.columns

Another way of doing this (to get the columns) is to use it this way:

data = sc.textFile(path_to_file)

And to get the headers (columns) just use

data.first()

Looks like you are trying to get your schema from your csv file without opening it! The above should help you to get them and hence manipulate whatever you like.

Note: to use '.columns' your 'sc' should be configured as:

spark = SparkSession.builder \
            .master("yarn") \
            .appName("experiment-airbnb") \
            .enableHiveSupport() \
            .getOrCreate()
sc = SQLContext(spark)

Good luck!

这篇关于spark csv 包中的 inferSchema的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆