读取SAS文件时, pandas 无法提供正确的数据类型 [英] Pandas fails with correct data type while reading a SAS file

查看:284
本文介绍了读取SAS文件时, pandas 无法提供正确的数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 SAS数据集,当我运行它时,得到以下输出在SAS上:

I have a SAS dataset and when I run it I get the following output on SAS:

我还具有以下Python代码,该代码获取.sas7bdat文件并显示输出,即此处的前五个观察结果.

I also have the following Python code which gets the .sas7bdat file and displays the output, i.e. here the first five observations.

import pandas as pd
file_name = "cars.sas7bdat"
my_df = pd.read_sas(file_name)
my_df = my_df.head()
print(my_df)

如您所见,当涉及整数数据类型时,它无法正常工作. CYL和WGT变量是整数,但是如果我使用pandas的

As you can see, it doesn't work correct when it comes to integer data types. CYL and WGT variables are integers but are not displaying correctly if I use pandas' read_sas function.

你知道这到底是怎么回事吗?

Any idea what heck is going on with this?

推荐答案

SAS将所有数字表示为64位(8字节)浮点数.但是,您可以通过告诉它存储少于8个字节来节省磁盘空间.您发布的数据集对CYL和WGT做到了这一点.

SAS represents all numbers as 64bit (8 byte) floating point numbers. But you can save disk space by telling it to store less than 8 bytes. The dataset you posted did this for CYL and WGT.

当SAS从磁盘读回数据集以使用时,它将丢失的最低有效字节设置为二进制零.显然read_sas不了解这一点,并且没有将丢失的字节设置为二进制零,而是做了其他事情.因此,看似随机的数据.

When SAS reads the dataset back from disk to use it sets the missing least significant bytes to binary zeros. Apparently read_sas didn't understand this and instead of setting the missing bytes to binary zeros it did something else. Hence the seemingly random data.

CYL的第一个值为8,在IEEE浮点数中为十六进制代码

The first value of CYL is 8 which in IEEE floating point would be the hexcode

40 20 00 00 00 00 00 00

您显示的8.00046值将改为该值.

The value you displayed of 8.00046 would be this value instead.

40 20 00 06 07 80 FD C1

这篇关于读取SAS文件时, pandas 无法提供正确的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆