尝试解析python中的文本文件以进行数据分析 [英] Trying to parse text files in python for data analysis

查看:48
本文介绍了尝试解析python中的文本文件以进行数据分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在perl中进行了很多数据分析,我试图使用pandas,numpy,matplotlib等在python中复制这项工作.

I do a lot of data analysis in perl and I am trying to replicate this work in python using pandas, numpy, matplotlib, etc.

一般工作流程如下:

1)遍历目录中的所有文件

1) glob all the files in a directory

2)解析文件,因为它们具有元数据

2) parse the files because they have metadata

3)使用正则表达式隔离给定文件中的相关行(它们通常以诸如"LOOPS"之类的标签开头)

3) use regex to isolate relevant lines in a given file (They usually begin with a tag such as 'LOOPS')

4)将与代码匹配的行分开,并将数据加载到散列中

4) split the lines that match the tag and load data into hashes

5)做一些数据分析

6)绘制一些图

以下是我在perl中通常做的一个示例:

Here is a sample of what I typically do in perl:

print"Reading File:\n";                              # gets data
foreach my $vol ($SmallV, $LargeV) {
  my $base_name = "${NF}flav_${vol}/BlockedWflow_low_${vol}_[0-9].[0-9]_-0.25_$Mass{$vol}.";
  my @files = <$base_name*>;                         # globs for file names
  foreach my $f (@files) {                           # loops through matching files
    print"... $f\n";
    my @split = split(/_/, $f);
    my $beta = $split[4];
    if (!grep{$_ eq $beta} @{$Beta{$vol}}) {         # constructs Beta hash
      push(@{$Beta{$vol}}, $split[4]);
    }
    open(IN, "<", "$f") or die "cannot open < $f: $!"; # reads in the file
    chomp(my @in = <IN>);
    close IN;
    my @lines = grep{$_=~/^LOOPS/} @in;       # greps for lines with the header LOOPS
    foreach my $l (@lines) {                  # loops through matched lines
      my @split = split(/\s+/, $l);           # splits matched lines
      push(@{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash
      if (!grep{$_ eq $split[1]} @smearingt) {# fills the smearing time array
        push(@smearingt, $split[1]);
      }
      if (!grep{$_ eq $split[4]} @{$block{$vol}}) {# fills the number of blockings
        push(@{$block{$vol}}, $split[4]);
      }
    }
  }
  foreach my $beta (@{$Beta{$vol}}) {
    foreach my $loop (0,1,2,3,4) {         # loops over observables
      foreach my $b (@{$block{$vol}}) {    # beta values
        foreach my $t (@smearingt) {       # and smearing times
          $avg{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::avg(@{$val{$vol}{$beta}{$t}{$loop}{$b}});     # to find statistics
          $err{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::stdev(@{$val{$vol}{$beta}{$t}{$loop}{$b}});
        }
      }
    }
  }
}
print"File Read in Complete!\n";

我希望将这些数据加载到Hierarchical Indexed数据结构中,而perl哈希的索引成为我的python数据结构的指标.到目前为止,我所遇到的每个示例都非常注重熊猫的数据结构,其中整个结构(索引和值)是通过一个命令手动分配的,然后进行操作以演示数据结构的所有功能.不幸的是,我无法一次分配所有数据,因为我不知道要分析的数据中包含的质量,β,大小等.我做错了吗?有谁知道这样做的更好方法?数据文件是不可变的,我将不得不使用正则表达式来解析它们,我知道该怎么做.我需要帮助的是将数据放入适当的数据结构中,以便可以取平均值,标准差,执行数学运算并绘制数据.

My hope is to load this data into a Hierarchical Indexed data structure with indices of the perl hash becoming indicies of my python data structure. Every example I have come across so far of pandas data structures has been highly contrived where the whole structure (indicies and values) was assigned manually in one command and then manipulated to demonstrate all the features of the data structure. Unfortunately I can not assign the data all at once because I don't know what mass, beta, sizes, etc are in the data that is going to be analyzed. Am I doing this the wrong way? Does anyone know a better way of doing this? The data files are immutable, I will have to parse through them using regex which I understand how to do. What I need help with is putting the data into an appropriate data structure so that I can take averages, standard deviations, perform mathematical operations, and plot the data.

典型数据的标头长度不知道行数,但是我关心的东西是这样的:

Typical data has a header that is an unknown number of lines long but the stuff I care about looks like this:

Alpha 0.5 0.5 0.4
Alpha 0.5 0.5 0.4
LOOPS 0 0 0 2 0.5 1.7800178
LOOPS 0 1 0 2 0.5 0.84488326
LOOPS 0 2 0 2 0.5 0.98365135  
LOOPS 0 3 0 2 0.5 1.1638834
LOOPS 0 4 0 2 0.5 1.0438407
LOOPS 0 5 0 2 0.5 0.19081102
POLYA NHYP 0 2 0.5 -0.0200002 0.119196 -0.0788721 -0.170488 
BLOCKING COMPLETED
Blocking time 1.474 seconds
WFLOW 0.01 1.57689 2.30146 0.000230146 0.000230146 0.00170773 -0.0336667
WFLOW 0.02 1.66552 2.28275 0.000913101 0.00136591 0.00640552 -0.0271222
WFLOW 0.03 1.75 2.25841 0.00203257 0.00335839 0.0135 -0.0205722
WFLOW 0.04 1.83017 2.22891 0.00356625 0.00613473 0.0224607 -0.0141664
WFLOW 0.05 1.90594 2.19478 0.00548695 0.00960351 0.0328218 -0.00803792
WFLOW 0.06 1.9773 2.15659 0.00776372 0.0136606 0.0441807 -0.00229793
WFLOW 0.07 2.0443 2.1149 0.010363 0.018195 0.0561953 0.00296648

我想(我想)的原因是,因为我是python的新手,并且专家可能知道更好的数据结构,所以它是一个Hierarchical Indexed Series,它看起来像这样:

What I (think) I want, I preface this with think because I am new to python and an expert may know a better data structure, is a Hierarchical Indexed Series that would look like this:

volume   mass   beta   observable   t   value

1224     0.0    5.6    0            0   1.234
                                    1   1.490
                                    2   1.222
                       1            0   1.234
                                    1   1.234
2448     0.0    5.7    0            1   1.234

依此类推: http://pandas. pydata.org/pandas-docs/dev/indexing.html#indexing-hierarchical

对于那些不了解perl的人:

For those of you who don't understand the perl:

我需要的肉和土豆是这样的:

The meat and potatoes of what I need is this:

push(@{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash

我在这里有一个称为"val"的哈希.这是数组的哈希.我相信用python讲这将是列表的决定.在这里,每个看起来像这样的东西:'{$ something}'是哈希值'val'中的键,我将存储在变量$ split [6]中的值附加到指定哈希值数组的末尾.通过所有5个键.这是我数据的根本问题,我感兴趣的每个数量都有很多键.

What I have here is a hash called 'val'. This is a hash of arrays. I believe in python speak this would be a dict of lists. Here each thing that looks like this: '{$something}' is a key in the hash 'val' and I am appending the value stored in the variable $split[6] to the end of the array that is the hash element specified by all 5 keys. This is the fundamental issue with my data is there are a lot of keys for each quantity that I am interested in.

==========

==========

我想出了以下导致此错误的代码:

I have come up with the following code which results in this error:

Traceback (most recent call last):
  File "wflow_2lattice_matching.py", line 39, in <module>
    index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
NameError: name 'MultiIndex' is not defined

代码:

#!/usr/bin/python

from pandas import Series, DataFrame
import pandas as pd
import glob
import re
import numpy

flavor = 4
mass = 0.0

vol = []
b = []
m_t = []
w_t = []
val = []

#tup_vol = (1224, 1632, 2448)
tup_vol = 1224, 1632
for v in tup_vol:
  filelist = glob.glob(str(flavor)+'flav_'+str(v)+'/BlockedWflow_low_'+str(v)+'_*_0.0.*')
  for filename in filelist:
    print 'Reading filename:  '+filename
    f = open(filename, 'r')
    junk, start, vv, beta, junk, mass, mont_t = re.split('_', filename)
    ftext = f.readlines()
    for line in ftext:
      if re.match('^WFLOW.*', line):
        line=line.strip()
        junk, smear_t, junk, junk, wilson_flow, junk, junk, junk = re.split('\s+', line)
        vol.append(v)
        b.append(beta)
        m_t.append(mont_t)
        w_t.append(smear_t)
        val.append(wilson_flow)
zipped = zip(vol, beta, m_t, w_t)
index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
data = Series(val, index=index)

推荐答案

您将获得以下内容:

NameError: name 'MultiIndex' is not defined

因为在导入Series和DataFrame时不会直接导入MultiIndex.

because you are not importing MultiIndex directly when you import Series and DataFrame.

您有-

from pandas import Series, DataFrame

您需要-

from pandas import Series, DataFrame, MultiIndex

或者您可以改用pd.MultiIndex来引用MultiIndex,因为您将熊猫作为pd导入

or you can instead refer to MultiIndex using pd.MultiIndex since you are importing pandas as pd

这篇关于尝试解析python中的文本文件以进行数据分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆