使用Python在一行中选择具有特定值范围的行 [英] Use Python to select rows with a particular range of values in one column

查看:856
本文介绍了使用Python在一行中选择具有特定值范围的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这很简单,但是我是Python的新用户,所以在这里遇到了一些麻烦.顺便说一下,我正在使用Python 3.

我有多个看起来像这样的文件:

姓名日期年龄性别

Name Date Age Sex Color
Ray  May  25.1 M  Gray
Alex Apr  22.3 F  Green
Ann  Jun  15.7 F  Blue

(假装这是制表符分隔的.我应该补充一点,实际文件将包含约3,000行和17-18列)

我想做的是选择在age列中具有小于23的值的所有行.

在此示例中,输出为:

Name Date Age Sex Color
Alex Apr  22.3 F  Green
Ann  Jun  15.7 F  Blue

这就是我想要做的:

f = open("addressbook1.txt",'r')
line = f.readlines()
file_data =[line.split("\t")]
f.close()

for name, date, age, sex, color in file_data:
    if age in line_data < 23:
        g = open("college_age.txt",'a')
        g.write(line)
    else:
        h = open("adult_age.txt",'a')
        h.write(line)

现在,理想情况下,我有20-30个这些地址簿"输入文件,并且我希望此脚本循环遍历它们,并将所有23岁以下的条目添加到同一输出文件("college_age.txt") .我真的不需要保留其他内容,但是我不知道该如何处理.

该脚本在运行时会产生错误.

AttributeError: 'list' object has no attribute 'split'

然后将第三行更改为:

file_data=[line.split("\t") for line in f.readlines()]

它不再给我一个错误,而是根本不执行任何操作.它只是开始,然后开始.

有帮助吗? :)记住我对Python很傻.

我应该补充说,我的实际数据有小数而不是整数.我已经编辑了上面的数据以反映这一点.

解决方案

这里的问题是您使用readlines()两次,这意味着第一次读取数据,然后第二次没有任何内容. /p>

您可以直接在文件上进行迭代,而无需使用readlines()-实际上,这是更好的方法,因为它不会一次读取整个文件.

虽然可以通过使用str.split()来完成您想做的事情,但是更好的选择是使用为任务设计的csv模块.

import csv

with open("addressbook1.txt") as input, open("college_age.txt", "w") as college, open("adult_age.txt", "w") as adult:
   reader = csv.DictReader(input, dialect="excel-tab")
   fieldnames = reader.fieldnames
   writer_college = csv.DictWriter(college, fieldnames, dialect="excel-tab")
   writer_adult = csv.DictWriter(adult, fieldnames, dialect="excel-tab")
   writer_college.writeheader()
   writer_adult.writeheader()
   for row in reader:
       if int(row["Age"]) < 23:
          writer_college.writerow(row)
       else:
          writer_adult.writerow(row)

那么我们在这里做什么?首先,我们将 with语句用于打开文件.它不仅更具Python可读性,而且即使在发生异常的情况下,也可以为您处理关闭操作.

接下来,我们创建一个DictReader,该文件自动将第一行用作字段名称,从文件中读取行作为字典.然后,我们让编写者写回我们的拆分文件,并将标头写入其中.使用DictReader是优先考虑的问题.通常在访问大量数据的地方(以及当您不知道列的顺序)使用更多的代码,但这会使代码在此处易于阅读.但是,您可以只使用标准的csv.reader().

接下来,我们遍历文件中的行,检查使用期限(将其转换为int以便进行数值比较),以了解要写入哪个文件. with语句为我们关闭了文件.

对于多个输入文件:

import csv

fieldnames = ["Name", "Date", "Age", "Sex", "Color"]
filenames = ["addressbook1.txt", "addressbook2.txt", ...]

with open("college_age.txt", "w") as college, open("adult_age.txt", "w") as adult:
   writer_college = csv.DictWriter(college, fieldnames, dialect="excel-tab")
   writer_adult = csv.DictWriter(adult, fieldnames, dialect="excel-tab")
   writer_college.writeheader()
   writer_adult.writeheader()
   for filename in filenames:
       with open(filename, "r") as input:
           reader = csv.DictReader(input, dialect="excel-tab")
           for row in reader:
               if int(row["Age"]) < 23:
                  writer_college.writerow(row)
               else:
                  writer_adult.writerow(row)

我们只是添加一个循环来处理多个文件.请注意,我还添加了一个字段名称列表.在我只使用文件中的字段和顺序之前,但是由于我们有多个文件,因此我认为在此处执行此操作更为明智.一种替代方法是使用第一个文件来获取字段名称.

I know this is simple, but I'm a new user to Python so I'm having a bit of trouble here. I'm using Python 3 by the way.

I have multiple files that look something like this:

NAME DATE AGE SEX COLOR

Name Date Age Sex Color
Ray  May  25.1 M  Gray
Alex Apr  22.3 F  Green
Ann  Jun  15.7 F  Blue

(Pretend this is tab delimited. I should add that the real file will have about 3,000 rows and 17-18 columns)

What I want to do is select all the rows which have a value in the age column which is less than 23.

In this example, the output would be:

Name Date Age Sex Color
Alex Apr  22.3 F  Green
Ann  Jun  15.7 F  Blue

Here's what I tried to do:

f = open("addressbook1.txt",'r')
line = f.readlines()
file_data =[line.split("\t")]
f.close()

for name, date, age, sex, color in file_data:
    if age in line_data < 23:
        g = open("college_age.txt",'a')
        g.write(line)
    else:
        h = open("adult_age.txt",'a')
        h.write(line)

Now, ideally, I have 20-30 of these "addressbook" inputfiles and I wanted this script to loop through them all and add all the entries with an age under 23 to the same output file ("college_age.txt"). I really don't need to keep the other lines, but I didn't know what else to do with them.

This script, when I run it, generates an error.

AttributeError: 'list' object has no attribute 'split'

Then I change the third line to:

file_data=[line.split("\t") for line in f.readlines()]

And it no longer gives me an error, but simply does nothing at all. It just starts and then starts.

Any help? :) Remember I'm dumb with Python.

I should have added that my actual data has decimals and are not integers. I have edited the data above to reflect that.

解决方案

The issue here is that you are using readlines() twice, which means that the data is read the first time, then nothing is left the second time.

You can iterate directly over the file without using readlines() - in fact, this is the better way, as it doesn't read the whole file in at once.

While you could do what you are trying to do by using str.split() as you have, the better option is to use the csv module, which is designed for the task.

import csv

with open("addressbook1.txt") as input, open("college_age.txt", "w") as college, open("adult_age.txt", "w") as adult:
   reader = csv.DictReader(input, dialect="excel-tab")
   fieldnames = reader.fieldnames
   writer_college = csv.DictWriter(college, fieldnames, dialect="excel-tab")
   writer_adult = csv.DictWriter(adult, fieldnames, dialect="excel-tab")
   writer_college.writeheader()
   writer_adult.writeheader()
   for row in reader:
       if int(row["Age"]) < 23:
          writer_college.writerow(row)
       else:
          writer_adult.writerow(row)

So what are we doing here? First of all we use the with statement for opening files. It's not only more pythonic and readable but handles closing for you, even when exceptions occur.

Next we create a DictReader that reads rows from the file as dictionaries, automatically using the first row as the field names. We then make writers to write back to our split files, and write the headers in. Using the DictReader is a matter of preference. It's generally used more where you access the data a lot (and when you don't know the order of the columns), but it makes the code nice a readable here. You could, however, just use a standard csv.reader().

Next we loop through the rows in the file, checking the age (which we convert to an int so we can do a numerical comparison) to know what file to write to. The with statement closes out files for us.

For multiple input files:

import csv

fieldnames = ["Name", "Date", "Age", "Sex", "Color"]
filenames = ["addressbook1.txt", "addressbook2.txt", ...]

with open("college_age.txt", "w") as college, open("adult_age.txt", "w") as adult:
   writer_college = csv.DictWriter(college, fieldnames, dialect="excel-tab")
   writer_adult = csv.DictWriter(adult, fieldnames, dialect="excel-tab")
   writer_college.writeheader()
   writer_adult.writeheader()
   for filename in filenames:
       with open(filename, "r") as input:
           reader = csv.DictReader(input, dialect="excel-tab")
           for row in reader:
               if int(row["Age"]) < 23:
                  writer_college.writerow(row)
               else:
                  writer_adult.writerow(row)

We just add a loop in to work over multiple files. Please note that I also added a list of field names. Before I just used the fields and order from the file, but as we have multiple files, I figured it would be more sensible to do that here. An alternative would be to use the first file to get the field names.

这篇关于使用Python在一行中选择具有特定值范围的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆