CSV标头的文件自动检测presence [英] Autodetect Presence of CSV Headers in a File

查看:138
本文介绍了CSV标头的文件自动检测presence的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

短的问题:如何自动检测一个CSV文件是否有在第一行标题

Short question: How do I automatically detect whether a CSV file has headers in the first row?

详细信息:我已经写了,数据被放到我可以为(大约)的内存数据库访问对象的小CSV解析引擎。原来的code写来解析第三方CSV用predictable格式,但我希望能更广泛地使用code。

Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.

我试图找出一个可靠的方式来自动检测CSV头的presence,所以该脚本可以决定是否使用CSV文件作为键/列名的第一行或立即开始分析数据。由于所有我需要的是一个布尔测试,我可以检查CSV文件后,自己很容易地指定参数,但我宁愿没有(去走自动化)。

I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).

我想我不得不解析第3〜?行的CSV文件,并查找某种模式来比较的头。我有三个特别恶劣的案件噩梦中:

I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:

  1. 的头文件包括数字数据由于某种原因,
  2. 在该排第几(或CSV的大部分)都是空
  3. 有头和数据看起来太相似,告诉他们分开

如果我能得到一个最佳猜测,并有解析器失败,错误或吐出,如果它不能决定一个警告,那也无妨。如果这是东西是要在时间和计算方面极其昂贵的(并且需要更多的时间比它应该救我),我会高兴地打消这个念头,回到工作的重要的事情。

If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".

我正在使用PHP,但是这给我的印象更多的是算法/计算的问题不是东西是特定于实现的。如果有一个简单的算法,我可以使用,极大的。如果你能指出我的一些相关的理论/讨论,那简直太好了,太。如果有一个巨大的图书馆,做自然语言处理或300种不同的解析,我没兴趣。

I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.

推荐答案

正如其他人所指出的那样,你不能100%的可靠性做到这一点。然而,有些情况下得到它基本上是正确的是很有用的, - 例如,使用CSV导入功能小号preadsheet工具通常揣摩了这一点自己。下面的几个启发式往往会显示第一行的不是的一个标题:

As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:

  • 在第一行有一个不是字符串或者空列
  • 在第一行的列是不是所有的独特
  • 在第一行似乎包含日期或其他通用的数据格式(如XX-XX-XX)

这篇关于CSV标头的文件自动检测presence的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆