Python 从文件中提取数据 [英] Python Extract data from file

查看:111
本文介绍了Python 从文件中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件只是说

text1 text2 text text文字文字文字文字

我希望首先计算文件中字符串的数量(全部以空格分隔),然后输出前两个文本.(文本 1 文本 2)

有什么想法吗?

预先感谢您的帮助

这是我目前所拥有的:

<预><代码>>>>f=open('test.txt')>>>对于 f 中的行:印刷线 text1 text2 text text text text 你好>>>单词=line.split()>>>字['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']>>>len(字)7如果 len(words) >2:打印有两个以上的单词"

我遇到的第一个问题是,我的文本文件是:text1 text2 text text text

但是当我拉出单词的输出时,我得到:['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']

'\xef\xbb\xbf 是从哪里来的?

解决方案

要逐行读取文件,只需在 for 循环中遍历打开的文件对象:

 for line in open(filename):# 用线做点什么

要通过空格将一行拆分为一个单独的单词列表,请使用 str.split():

words = line.split()

要计算python列表中的项目数,请使用len(yourlist):

count = len(words)

要从 python 列表中选择前两项,请使用切片:

firsttwo = words[:2]

我会把构建完整的程序留给你,但你不需要比上面更多的东西,再加上一个 if 语句,看看你是否已经有了你的两个词.

您在文件开头看到的三个额外字节是 UTF-8 BOM(字节顺序标记);它将您的文件标记为 UTF-8 编码,但它是多余的,仅在 Windows 上真正使用.

您可以通过以下方式将其删除:

导入编解码器如果 line.startswith(codecs.BOM_UTF8):行 = 行 [3:]

您可能希望使用该编码将字符串解码为 un​​icode:

line = line.decode('utf-8')

您也可以使用 codecs.open()<打开文件/code>:

file = codecs.open(filename, encoding='utf-8')

请注意,codecs.open() 不会为您剥离 BOM;最简单的方法是使用 .lstrip():

导入编解码器BOM = codecs.BOM_UTF8.decode('utf8')使用 codecs.open(filename, encoding='utf-8') 作为 f:对于 f 中的行:line = line.lstrip(BOM)

I have a text file just say

text1 text2 text text
text text text text

I am looking to firstly count the number of strings in the file (all deliminated by space) and then output the first two texts. (text 1 text 2)

Any ideas?

Thanks in advance for the help

Edit: This is what I have so far:

>>> f=open('test.txt')
>>> for line in f:
    print line
text1 text2 text text text text hello
>>> words=line.split()
>>> words
['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']
>>> len(words)
7
if len(words) > 2:
    print "there are more than 2 words"

The first problem I have is, my text file is: text1 text2 text text text

But when i pull the output of words I get: ['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']

Where does the '\xef\xbb\xbf come from?

解决方案

To read a file line by line, just loop over the open file object in a for loop:

for line in open(filename):
    # do something with line

To split a line by whitespace into a list of separate words, use str.split():

words = line.split()

To count the number of items in a python list, use len(yourlist):

count = len(words)

To select the first two items from a python list, use slicing:

firsttwo = words[:2]

I'll leave constructing the complete program to you, but you won't need much more than the above, plus an if statement to see if you already have your two words.

The three extra bytes you see at the start of your file are the UTF-8 BOM (Byte Order Mark); it marks your file as UTF-8 encoded, but it is redundant and only really used on Windows.

You can remove it with:

import codecs
if line.startswith(codecs.BOM_UTF8):
    line = line[3:]

You may want to decode your strings to unicode using that encoding:

line = line.decode('utf-8')

You could also open the file using codecs.open():

file = codecs.open(filename, encoding='utf-8')

Note that codecs.open() will not strip the BOM for you; the easiest way to do that is to use .lstrip():

import codecs
BOM = codecs.BOM_UTF8.decode('utf8')
with codecs.open(filename, encoding='utf-8') as f:
    for line in f:
        line = line.lstrip(BOM)

这篇关于Python 从文件中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆