阅读CSV数组,数组上进行线性回归,并根据梯度用Python写的csv [英] Reading csv to array, performing linear regression on array and writing to csv in Python depending on gradient

查看:582
本文介绍了阅读CSV数组,数组上进行线性回归,并根据梯度用Python写的csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有应付远远超过对于Python我当前的编程技能的问题。我有困难结合不同的模块(CSV读卡器,numpy的等)到一个脚本。我的数据中包含的天气变量的跨越时间为多天的大名单(含分钟分辨率)。我的目标是确定每天都在名单早上9点12点之间的风速的趋势。如果风速的梯度是积极的,我希望写上发生这种情况到一个新的CSV文件的日期,与风向是沿着什么

该数据为数千行的延伸,是这样的:

  HD,站号,年月日时分在YYYY,MM,DD,HH24,在当地时间MI格式,年月日时分在YYYY,MM,DD,HH24在当地标准时间MI格式,年月日时分在YYYY,MM,DD,HH24,在协调世界时MI格式,自从上次在毫米,precipitation质量的最后(AWS)观测precipitation (AWS)观测值,空气温度为摄氏度,空气温度,空气温度(1分钟的最大值)为摄氏度,质量空气温度(1分钟的最大值),在度气温(1分钟以上)的质量摄氏,空气温度(1分钟以上),湿球温度度的质量摄氏,湿球温度的质量,湿球温度(1分钟的最大值)为摄氏度,湿球温度(1最大分钟)的质量,湿以摄氏度球温度(1分钟以上),湿球温度(1分钟以上),单位为度的露点温度,露点温度(1分钟的最大值)度的摄氏,质量摄氏,质量露点温度的质量露点温度(1分钟的最大值),以度露点温度(最低1分钟),相对湿度百分比%的左右,质量露点温度(1分钟以上),相对湿度质量,相对湿度(1分最高)百分比%,以百分比%相对湿度(1最高分),相对湿度(1分钟以上)的质量,相对湿度(1分钟以上),风(1分钟)的质量速度公里/小时,风(1分钟)速度素质,最小风速(1分钟)的公里/小时,最小风速(1分钟)的质量,风(1分钟)以度方向如此,风(1分钟)方向质量,风的标准偏差(1分钟),在公里/小时,最大阵风风速(1分钟)方向的品质,最大阵风(超过1分钟)的标准偏差(超过1分钟)的质量,知名度(自动 - 一分数据)公里数,可见质量(自动 - 一分数据),平均在百帕,平均海平面pressure质量,在百帕,站级pressure质量,QNH站级pressure海平面pressure pressure在百帕,QNH pressure质量,#
高清,40842,2000,03,20,10,50,2000,03,20,10,50,2000,03,20,00,50,N,25.7,N,25.7,N,25.6,N,21.5 ,N,21.5,N,21.4,N,19.2 N,19.2 N,19.0,N,67,N 68,N 66,N 13,N,9,N,100 N,4,N 15,N,N,1018.6,N,1017.5,N,1018.6,N,#
高清,40842,2000,03,20,10,51,2000,03,20,10,51,2000,03,20,00,51,0.0,N,25.6,N,25.8,N,25.6,N, 21.5,N,21.6 N,21.5,N,19.2 N,19.4,N,19.2 N,68,N 68,N 66,N 11,N,9,N,107 N,11, N,13,N​​,N,1018.6,N,1017.5,N,1018.6,N,#
高清,40842,2000,03,20,10,52,2000,03,20,10,52,2000,03,20,00,52,0.0,N,25.8,N,25.8,N,25.6,N, 21.7,N,21.7,N,21.5,N,19.5,N,19.5,N,19.2 N,68 N,69,N 66,N 11,N,9,N 83,N 13, N,13,N​​,N,1018.6,N,1017.5,N,1018.6,N,#
高清,40842,2000,03,20,10,53,2000,03,20,10,53,2000,03,20,00,53,0.0,N,25.8,N,25.9,N,25.8,N, 21.6,N,21.8,N,21.6 N,19.3 N,19.6,N,19.3 N,67,N 68,N 66,N,9,N,8,N,87 N,14, N,11,N,N,1018.6,N,1017.5,N,1018.6,N,#
高清,40842,2000,03,20,10,54,2000,03,20,10,54,2000,03,20,00,54,0.0,N,25.8,N,25.8,N,25.8,N, 21.6,N,21.6 N,21.6 N,19.3 N,19.3 N,19.2 N,67 N,67 N,67,N,8,N,4,N,98 N,23, N,9,N,N,1018.6,N,1017.5,N,1018.6,N,#
高清,40842,2000,03,20,10,55,2000,03,20,10,55,2000,03,20,00,55,0.0,N,25.7,N,25.8,N,25.7,N, 21.5,N,21.6 N,21.5,N,19.2 N,19.3 N,19.2 N,67,N 68,N 66,N,8,N,4,N,68 N,15, N,9,N,N,1018.6,N,1017.5,N,1018.6,N,#
高清,40842,2000,03,20,10,56,2000,03,20,10,56,2000,03,20,00,56,0.0,N,25.9,N,25.9,N,25.7,N, 21.7,N,21.7,N,21.5,N,19.4,N,19.4,N,19.2 N,67,N 68,N 66,N,8,N,5,N 69,N 16, N,9,N,N,1018.6,N,1017.5,N,1018.6,N,#
高清,40842,2000,03,20,10,57,2000,03,20,10,57,2000,03,20,00,57,0.0,N,26.0,N,26.0,N,25.9,N, 21.8 N,21.8,N,21.7,N,19.5,N,19.5,N,19.4 N,67 N,68 N,66 N,9,N 5,N 72,N 10, N,11,N,N,1018.6,N,1017.5,N,1018.6,N,#
高清,40842,2000,03,20,10,58,2000,03,20,10,58,2000,03,20,00,58,0.0,N,26.0,N,26.1,N,26.0,N, 21.7,N,21.8,N,21.7,N,19.4,N,19.5,N,19.3 N,66 N,67,N 66,N,8,N,5,N 69,N 13, N,11,N,N,1018.6,N,1017.5,N,1018.6,N,#

完成的文件,其中包含仅日期在风速从上午9时上升至中午12时将有希望下面的形式:

 日期,风向,gradient_of_wind_speed,
2000/3 / 25,108,0.7,
2000/4 / 17,67,0.4,
...

梯度的精确值并不重要,只是它是否是正的,所以这将是细构造形式的第二阵列(1,2,3,4,5 ...),因为使用阵列的线性回归的第二维。我们面临的挑战在于,很多天都丢失的数据,所以虽然数组应具有长度180(即180上午9时至中午12分钟之间),它会在现实中有一个变长。

时的这一挑战,通过多个脚本更容易解决(铭记我必须为100 +文件做到这一点),或者是有在一个脚本应对这一挑战的一些简单的方法是什么?

尝试code:

 进口水珠
进口大熊猫作为PD
导入numpy的是NP在glob.glob文件(X:/ brisbaneweatherdata / * txt的'):
    DF = pd.read_csv(文件)
    日期,组df.groupby(['年月日时分在YYYY,MM,DD):
        morning_data =组[group.HH24.between(09','12')]
        #这里计算的线性回归
        梯度,截距= np.polyfit(morning_data.HH24,morning_data ['风(1分钟)速度公里/小时'],1)
         - 风向= np.average(morning_data.HH24,morning_data ['度真风(1分钟)方向'])
        如果梯度GT; 0:
            打印(日期+,+渐变+,+ - 风向)

即收到

错误消息:

 运行文件(X:/python/linearregression.py',WDIR ='X:/ Python的)
X:/python/linearregression.py:1:DtypeWarning:列(17,25,27,29,31,33,35,37,55,57,59)具有混合类型。指定进口或设置low_memory =假DTYPE选项。
  进口水珠
回溯(最近通话最后一个):  文件< IPython的输入-26-ace8af14da2c>中,1号线,上述<&模块GT;
    运行文件(X:/python/linearregression.py',WDIR ='X:/ Python的)  文件\"C:\\Users\\kirkj\\AppData\\Local\\Continuum\\Anaconda2\\lib\\site-packages\\spyderlib\\widgets\\externalshell\\sitecustomize.py\",线699,在RUNFILE
    的execfile(文件名,命名空间)  文件\"C:\\Users\\kirkj\\AppData\\Local\\Continuum\\Anaconda2\\lib\\site-packages\\spyderlib\\widgets\\externalshell\\sitecustomize.py\",第74行中的execfile
    EXEC(编译(scripttext,文件名,'EXEC'),水珠,LOC)  文件X:/python/linearregression.py,8号线,上述<&模块GT;
    morning_data =组[group.HH24.between(09','12')]  文件C:\\用户\\ kirkj \\应用程序数据\\本地\\连续\\ Anaconda2 \\ LIB \\站点包\\大熊猫\\核心\\ series.py,线路2486,在两者之间
    lmask =自> =左  文件C:\\用户\\ kirkj \\应用程序数据\\本地\\连续\\ Anaconda2 \\ LIB \\站点包\\大熊猫\\核心\\ ops.py,761线,在包装
    RES = na_op(值,其他)  文件C:\\用户\\ kirkj \\应用程序数据\\本地\\连续\\ Anaconda2 \\ LIB \\站点包\\大熊猫\\核心\\ ops.py,线路716,在na_op
    提高类型错误(无效的类型比较)类型错误:无效的类型比较


解决方案

我想你应该可以使用水珠来迭代为此在一个相当简单的脚本,通过你的文件和熊猫,以便在数据读取。下面是我将如何构建它一个基本的轮廓。

 进口水珠
进口大熊猫作为PD
在glob.glob文件('数据/ *'):
    DF = pd.read_csv(文件)
    日期,组df.groupby(['年','月','天']:
        morning_data =组[group.HH24.between(09','12')]
        #这里计算的线性回归
        梯度,截距= np.polyfit(morning_data.HH24,morning_data ['风速'],1)
        如果梯度GT; 0:
            打印(梯度+,+ - 风向+,+梯度)

I am having to tackle a problem that far exceeds my current programming skill for Python. I am having difficulty combining different modules (csv reader, numpy etc.) into a single script. My data contains a large list of weather variables across time (with minute resolution) for many days. My objective is to determine the trend of the wind speed between 9am and 12pm of every day in the list. If the gradient of the wind speed is positive, I wish to write the date on which this occurred to a new csv file, along with what the wind direction was.

The data extends for thousands of rows and looks like this:

hd,Station Number,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Local time,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Local standard time,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Universal coordinated time,Precipitation since last (AWS) observation in mm,Quality of precipitation since last (AWS) observation value,Air Temperature in degrees Celsius,Quality of air temperature,Air temperature (1-minute maximum) in degrees Celsius,Quality of air temperature (1-minute maximum),Air temperature (1-minute minimum) in degrees Celsius,Quality of air temperature (1-minute minimum),Wet bulb temperature in degrees Celsius,Quality of Wet bulb temperature,Wet bulb temperature (1 minute maximum) in degrees Celsius,Quality of wet bulb temperature (1 minute maximum),Wet bulb temperature (1 minute minimum) in degrees Celsius,Quality of wet bulb temperature (1 minute minimum),Dew point temperature in degrees Celsius,Quality of dew point temperature,Dew point temperature (1-minute maximum) in degrees Celsius,Quality of Dew point Temperature (1-minute maximum),Dew point temperature (1 minute minimum) in degrees Celsius,Quality of Dew point Temperature (1 minute minimum),Relative humidity in percentage %,Quality of relative humidity,Relative humidity (1 minute maximum) in percentage %,Quality of relative humidity (1 minute maximum),Relative humidity (1 minute minimum) in percentage %,Quality of Relative humidity (1 minute minimum),Wind (1 minute) speed in km/h,Wind (1 minute) speed quality,Minimum wind speed (over 1 minute) in km/h,Minimum wind speed (over 1 minute) quality,Wind (1 minute) direction in degrees true,Wind (1 minute) direction quality,Standard deviation of wind (1 minute),Standard deviation of wind (1 minute) direction quality,Maximum wind gust (over 1 minute) in km/h,Maximum wind gust (over 1 minute) quality,Visibility (automatic - one minute data) in km,Quality of visibility (automatic - one minute data),Mean sea level pressure in hPa,Quality of mean sea level pressure,Station level pressure in hPa,Quality of station level pressure,QNH pressure in hPa,Quality of QNH pressure,#
hd, 40842,2000,03,20,10,50,2000,03,20,10,50,2000,03,20,00,50,      ,N, 25.7,N, 25.7,N, 25.6,N, 21.5,N, 21.5,N, 21.4,N, 19.2,N, 19.2,N, 19.0,N, 67,N, 68,N, 66,N, 13,N,  9,N,100,N,  4,N, 15,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,51,2000,03,20,10,51,2000,03,20,00,51,   0.0,N, 25.6,N, 25.8,N, 25.6,N, 21.5,N, 21.6,N, 21.5,N, 19.2,N, 19.4,N, 19.2,N, 68,N, 68,N, 66,N, 11,N,  9,N,107,N, 11,N, 13,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,52,2000,03,20,10,52,2000,03,20,00,52,   0.0,N, 25.8,N, 25.8,N, 25.6,N, 21.7,N, 21.7,N, 21.5,N, 19.5,N, 19.5,N, 19.2,N, 68,N, 69,N, 66,N, 11,N,  9,N, 83,N, 13,N, 13,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,53,2000,03,20,10,53,2000,03,20,00,53,   0.0,N, 25.8,N, 25.9,N, 25.8,N, 21.6,N, 21.8,N, 21.6,N, 19.3,N, 19.6,N, 19.3,N, 67,N, 68,N, 66,N,  9,N,  8,N, 87,N, 14,N, 11,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,54,2000,03,20,10,54,2000,03,20,00,54,   0.0,N, 25.8,N, 25.8,N, 25.8,N, 21.6,N, 21.6,N, 21.6,N, 19.3,N, 19.3,N, 19.2,N, 67,N, 67,N, 67,N,  8,N,  4,N, 98,N, 23,N,  9,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,55,2000,03,20,10,55,2000,03,20,00,55,   0.0,N, 25.7,N, 25.8,N, 25.7,N, 21.5,N, 21.6,N, 21.5,N, 19.2,N, 19.3,N, 19.2,N, 67,N, 68,N, 66,N,  8,N,  4,N, 68,N, 15,N,  9,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,56,2000,03,20,10,56,2000,03,20,00,56,   0.0,N, 25.9,N, 25.9,N, 25.7,N, 21.7,N, 21.7,N, 21.5,N, 19.4,N, 19.4,N, 19.2,N, 67,N, 68,N, 66,N,  8,N,  5,N, 69,N, 16,N,  9,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,57,2000,03,20,10,57,2000,03,20,00,57,   0.0,N, 26.0,N, 26.0,N, 25.9,N, 21.8,N, 21.8,N, 21.7,N, 19.5,N, 19.5,N, 19.4,N, 67,N, 68,N, 66,N,  9,N,  5,N, 72,N, 10,N, 11,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,58,2000,03,20,10,58,2000,03,20,00,58,   0.0,N, 26.0,N, 26.1,N, 26.0,N, 21.7,N, 21.8,N, 21.7,N, 19.4,N, 19.5,N, 19.3,N, 66,N, 67,N, 66,N,  8,N,  5,N, 69,N, 13,N, 11,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#

The completed file which contains only dates in which the wind speed increased from 9am to 12pm will hopefully be of the form below:

date,wind direction,gradient_of_wind_speed,
2000/3/25,108,0.7,
2000/4/17,67,0.4,
...

The exact value of the gradient is not of importance, only whether it is positive, so it would be fine to construct a second array of the form (1,2,3,4,5...) to use as the second dimension of the array for the linear regression. The challenge lies in the fact that many days have missing data, so although the array should have length 180 (i.e. 180 minutes between 9am and 12pm) it will in actuality have a varying length.

Is this challenge more easily tackled through multiple scripts (bearing in mind I have to do this for 100+ files) or is there some easy way of tackling this challenge in a single script?

Attempted code:

import glob
import pandas as pd
import numpy as np

for file in glob.glob('X:/brisbaneweatherdata/*.txt'):
    df = pd.read_csv(file)
    for date, group in df.groupby(['Year Month Day Hours Minutes in YYYY','MM','DD']):
        morning_data = group[group.HH24.between('09','12')]
        # calculate your linear regression here
        gradient, intercept = np.polyfit(morning_data.HH24,morning_data['Wind (1 minute) speed in km/h'], 1)
        wind_direction= np.average(morning_data.HH24,morning_data['Wind (1 minute) direction in degrees true'])
        if gradient > 0 :
            print(date + "," + gradient + "," + wind_direction)

error message that is recieved:

runfile('X:/python/linearregression.py', wdir='X:/python')
X:/python/linearregression.py:1: DtypeWarning: Columns (17,25,27,29,31,33,35,37,55,57,59) have mixed types. Specify dtype option on import or set low_memory=False.
  import glob
Traceback (most recent call last):

  File "<ipython-input-26-ace8af14da2c>", line 1, in <module>
    runfile('X:/python/linearregression.py', wdir='X:/python')

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "X:/python/linearregression.py", line 8, in <module>
    morning_data = group[group.HH24.between('09','12')]

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\series.py", line 2486, in between
    lmask = self >= left

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\ops.py", line 761, in wrapper
    res = na_op(values, other)

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\ops.py", line 716, in na_op
    raise TypeError("invalid type comparison")

TypeError: invalid type comparison

解决方案

I think you should be able to do this in a fairly simple script using glob to iterate through your files, and pandas to read in your data. Here is a basic outline of how I would structure it

import glob
import pandas as pd
for file in glob.glob('data/*'):
    df = pd.read_csv(file)
    for date, group in df.groupby(['year','month','day']:
        morning_data = group[group.HH24.between('09','12')]
        # calculate your linear regression here
        gradient, intercept = np.polyfit(morning_data.HH24,morning_data['wind speed'], 1)
        if gradient > 0 :
            print(gradient + "," + wind_direction + "," + gradient)

这篇关于阅读CSV数组,数组上进行线性回归,并根据梯度用Python写的csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆