使用Python将纯文本文件解析为CSV文件 [英] Parse a plain text file into a CSV file using Python

查看:257
本文介绍了使用Python将纯文本文件解析为CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列HTML文件,使用Beautiful Soup解析为单个文本文件。 HTML文件的格式使得它们的输出总是在文本文件中的三行,因此输出将类似于:

 你好! 
你好吗?
好​​吧,再见!

但它可以很容易地

  83957 
我不回来了!
hgu39hgd

换句话说,HTML文件的内容并不是真正标准的但它们总是产生三行。



所以,我想知道我应该开始,如果我想然后拿从美丽的汤生产的文本文件并使用列(如上所示)将其解析为CSV文件:

 标题简介标语
Hello !你好吗?那就再见啦!
83957我不回来了! hgu39hgd

从文本文件中剥离HTML的Python代码是:

  import os 
import glob
import codecs
import csv
from bs4 import BeautifulSoup

path =c:\\users\\me\\downloads\\

for infile in glob.glob(os.path.join路径* .html)):
markup =(infile)
soup = BeautifulSoup(codecs.open(markup,r,utf-8)read())
with open(extracted.txt,a)as myfile:
myfile.write(soup.get_text())

我可以用这个来设置我的CSV文件中的列:

  
csv.SetColumnName(0,title)
csv.SetColumnName(1,intro)
csv.SetColumnName tagline)

我绘制空白的是如何迭代文本文件txt),一次一行,当我到一个新行,将其设置为CSV文件中的正确单元格。文件的前几行是空白的,并且在每组文本之间有许多空白行。所以,首先我需要打开文件并阅读它:

  file = open(extracted.txt)

for file.xreadlines():
pass#csv.SetCell(0,0 X)(显然,我不知道在X中放什么)

此外,我不知道如何告诉Python只是继续读取文件,并添加到CSV文件,直到它完成。换句话说,没有办法知道在HTML文件中有多少总行,因此我不能只是 csv.SetCell(0,0)到cdv.SetCell(999,999)

解决方案

我不完全确定您使用的是什么CSV库,看起来像 Python的内置版。无论如何,这里是我会这样做:

  import csv 
import itertools

with open('extracted.txt','r')as in_file:
stripped =(line.strip()for in in_file)
lines = $ b grouped = itertools.izip(* [lines] * 3)
with open('extracted.csv','w')as out_file:
writer = csv.writer(out_file)
writer.writerow(('title','intro','tagline'))
writer.writerows(已分组)

这种做法是一个管道。它首先从文件中获取数据,然后从行中删除所有空格,然后删除任何空行,然后将它们分成三个组,然后(在写入CSV头之后)将这些组写入CSV文件。 p>

要在注释中提到的最后两列合并,您可以以明显的方式更改 writerow 调用, writerows 到:

  writer.writerows((title,intro + tagline)for title,intro,tagline in grouped)


I have a series of HTML files that are parsed into a single text file using Beautiful Soup. The HTML files are formatted such that their output is always three lines within the text file, so the output will look something like:

Hello!
How are you?
Well, Bye!

But it could just as easily be

83957
And I ain't coming back!
hgu39hgd

In other words, the contents of the HTML files are not really standard across each of them, but they do always produce three lines.

So, I was wondering where I should start if I want to then take the text file that is produced from Beautiful Soup and parse that into a CSV file with columns such as (using the above examples):

Title   Intro   Tagline
Hello!    How are you?    Well, Bye!
83957    And I ain't coming back!    hgu39hgd

The Python code for stripping the HTML from the text files is this:

import os
import glob
import codecs
import csv
from bs4 import BeautifulSoup

path = "c:\\users\\me\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(codecs.open(markup, "r", "utf-8").read())
    with open("extracted.txt", "a") as myfile:
        myfile.write(soup.get_text())

And I gather I can use this to set up the columns in my CSV file:

csv.put_HasColumnNames(True)

csv.SetColumnName(0,"title")
csv.SetColumnName(1,"intro")
csv.SetColumnName(2,"tagline")

Where I'm drawing blank is how to iterate through the text file (extracted.txt) one line at a time and, as I get to a new line, set it to the correct cell in the CSV file. The first several lines of the file are blank, and there are many blank lines between each grouping of text. So, first I would need to open the file and read it:

file = open("extracted.txt")

for line in file.xreadlines():
    pass # csv.SetCell(0,0 X) (obviously, I don't know what to put in X)

Also, I don't know how to tell Python to just keep reading the file, and adding to the CSV file until it's finished. In other words, there's no way to know exactly how many total lines will be in the HTML files, and so I can't just csv.SetCell(0,0) to cdv.SetCell(999,999)

解决方案

I'm not entirely sure what CSV library you're using, but it doesn't look like Python's built-in one. Anyway, here's how I'd do it:

import csv
import itertools

with open('extracted.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line for line in stripped if line)
    grouped = itertools.izip(*[lines] * 3)
    with open('extracted.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('title', 'intro', 'tagline'))
        writer.writerows(grouped)

This sort of makes a pipeline. It first gets data from the file, then removes all the whitespace from the lines, then removes any empty lines, then groups them into groups of three, and then (after writing the CSV header) writes those groups to the CSV file.

To combine the last two columns as you mentioned in the comments, you could change the writerow call in the obvious way and the writerows to:

writer.writerows((title, intro + tagline) for title, intro, tagline in grouped)

这篇关于使用Python将纯文本文件解析为CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆