将一系列字符串(加上数字)写入一行csv [英] Write series of strings (plus a number) to a line of csv

查看:72
本文介绍了将一系列字符串(加上数字)写入一行csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这不是漂亮的代码,但是我有一些代码可以从HTML文件中获取一系列字符串,并为我提供一系列字符串:authortitledatelength.我有2000多个html文件,我想浏览所有这些文件并将此数据写入单个csv文件.我知道所有这些最终都将必须包裹在for循环中,但是在此之前,我很难理解如何从获取这些值到将其写入csv文件.我的想法是先创建一个列表或元组,然后将其写入csv文件中的一行:

It's not pretty code, but I have some code that grabs a series of strings out of an HTML file and gives me a series of strings: author, title, date, length, text. I have 2000+ html files and I want go through all of them and write this data to a single csv file. I know all of this will have to be wrapped into a for loop eventually, but before then I am having a hard time understanding how to go from getting these values to writing them to a csv file. My thinking was to create a list or a tuple first and then write that to a line in a csv file:

the_file = "/Users/john/Code/tedtalks/test/transcript?language=en.0"
holding = soup(open(the_file).read(), "lxml")
at = holding.find("title").text
author = at[0:at.find(':')]
title  = at[at.find(":")+1 : at.find("|") ]
date = re.sub('[^a-zA-Z0-9]',' ', holding.select_one("span.meta__val").text)
length_data = holding.find_all('data', {'class' : 'talk-transcript__para__time'})
(m, s) = ([x.get_text().strip("\n\r") 
      for x in length_data if re.search(r"(?s)\d{2}:\d{2}", 
                                        x.get_text().strip("\n\r"))][-1]).split(':')
length = int(m) * 60 + int(s)
firstpass = re.sub(r'\([^)]*\)', '', holding.find('div', class_ = 'talk-transcript__body').text)
text = re.sub('[^a-zA-Z\.\']',' ', firstpass)
data = ([author].join() + [title] + [date] + [length] + [text])
with open("./output.csv", "w") as csv_file:
        writer = csv.writer(csv_file, delimiter=',')
        for line in data:
            writer.writerow(line)

我一辈子都想不通如何让Python尊重这些事实,即它们是字符串,应该存储为字符串而不是字母列表. (上面的.join()是我试图解决的问题.)

I can't for the life of me figure out how to get Python to respect the fact that these are strings and should be stored as strings and not as lists of letters. (The .join() above is me trying to figure this out.)

展望未来:以这种方式处理2000个文件是否更好/效率更高,将它们剥离为我想要的并一次写入一行CSV,还是在pandas中构建数据框架更好?然后将其写入CSV? (所有2000个文件= 160MB,因此精简后,最终数据不能超过100MB,因此此处没有足够的大小,但期待大小最终可能会成为问题.)

Looking ahead: is it better/more efficient to handle 2000 files this way, stripping them down to what I want and writing one line of the CSV at a time or is it better to build a data frame in pandas and then write that to CSV? (All 2000 files = 160MB, so stripped down, the eventual data can't be more than 100MB, so no great size here, but looking forward size may eventually become an issue.)

推荐答案

这将抓取所有文件并将数据放入csv中,您只需要将路径传递到包含html文件和文件名的文件夹即可您的输出文件:

This will grab all the files and put the data into a csv, you just need to pass the path to the folder that contains the html files and the name of your output file:

import re
import csv
import os
from bs4 import BeautifulSoup
from glob import iglob


def parse(soup):
    # both title and author are can be parsed in separate tags.
    author = soup.select_one("h4.h12.talk-link__speaker").text
    title = soup.select_one("h4.h9.m5").text
    # just need to strip the text from the date string, no regex needed.
    date = soup.select_one("span.meta__val").text.strip()
    # we want the last time which is the talk-transcript__para__time previous to the footer.
    mn, sec = map(int, soup.select_one("footer.footer").find_previous("data", {
        "class": "talk-transcript__para__time"}).text.split(":"))
    length = (mn * 60 + sec)
    # to ignore time etc.. we can just pull from the actual text fragment and remove noise i.e (Applause).
    text = re.sub(r'\([^)]*\)',"", " ".join(d.text for d in soup.select("span.talk-transcript__fragment")))
    return author.strip(), title.strip(), date, length, re.sub('[^a-zA-Z\.\']', ' ', text)

def to_csv(patt, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer.
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["author", "title", "date", "length", "text"])
        # get all our html files.
        for html in iglob(patt):
            with open(html, as f:
                # parse the file are write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "lxml")))

to_csv("./test/*.html","output.csv")

这篇关于将一系列字符串(加上数字)写入一行csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆