CSV文件中的列索引 [英] Indexing columns in a csv file

查看:174
本文介绍了CSV文件中的列索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的csv文件,每一行都有不同的列,例如ID,用户名,电子邮件,工作位置等。

I have a large csv file which each row has different columns, such as ID, username, email, job position, etc.

我要搜索一个通过完全匹配(用户名== David)或通配符(jobPosition ==%admin)来排行。

I want to search for a row by exact matches (username == David), or wildcard (jobPosition == %admin).

我想索引此文件中的列以加快搜索速度,但是我不知道应该选择哪种算法(特别是通配符)。

I want to index columns in this file to make searches faster, but I don't know which algorithm should I choose (specially for wildcards).

推荐答案

您可以为文件建立索引。 但是您需要将其作为二进制文件而不是文本文件读取。使用128或256个块大小。要建立索引,您可以扫描文件以查找每个记录的开头,然后创建一个索引文件,如下所示:

You can index the file. But you need to read it as a binary file instead of a text file. Use 128 or 256 block size. To build the index, you scan your file looking for the beginning of each record and then create an index file like this:

  key, 0, 0
   ........
   ........
  key, block, offset

是您建立索引的键。可以是复合键。 block 是记录开始的块号(请注意,您的记录可以跨越多个块),而 offset 是记录之间的数字 0 block-size-1 (该块的偏移量)。要检索您的记录,您可以在索引文件上查找关键字(可能使用 二进制搜索 > ),然后使用区块偏移量直接访问您的记录 直接

key is the key you are indexing on. Can be a composite key. block is the block number the record starts at (be aware that your records can span more than one block), and offset is a number between 0 and block-size-1 which is the offset into that block. To retrieve your record you look up the key on the index file (using perhaps binary search) and then use the block-offset to access your record directly.

如果需要搜索其他条件,也可以同时创建多个索引文件。

You can also create multiple index files at the same time if you need to search for different criteria.

具有独特的行尾字符会有所帮助,但 CR-LF 会起作用。如果您使用 CR-LF ,请注意, CR 可以位于代码块的确切结尾,而 LF 将在下一个开始。一旦创建了这个索引文件(一个或多个),就可以按关键字对其进行排序,您就可以开始了。

Having a distinct end-of-line character would help but CR-LF would do. If you use CR-LF be aware that the CRcan be at the exact end of the block while LF will be at the very beginning of the next. Once you have created this index file (or files) you can sort it by the key and you are good to go.

或者,如果您的软件允许快速移动内存块(例如C ++ 内存 ),则可以使用 插入排序与二进制搜索结合。这样,在您完成建立索引之后,它们就已经排序了。如果索引条目是从使用慢速输入设备(例如,键盘)捕获的文件中添加的,则这特别有效。如果您要管理大量记录,请考虑使用 B树 索引的结构。

Alternatively, if your software allows fast memory block moving (like C++ memmove), you can use insertion sort in combination with binary search. That way, after you finish building your index(es) they are already sorted. This is particularly efficient if the index entries are being added from a file that is being captured using a slow input device (eg. keyboard). If you are managing large amounts of records consider using a B-Tree structure for your index(es).

此架构允许您的csv 数据库接受记录添加 >,删除更新添加位于文件末尾。要删除记录,只需使用唯一字符(例如 0x0 )更改记录的第一个字符,然后从索引文件中删除该条目即可。 更新可以通过删除然后在文件末尾添加更新的记录来实现。

This schema, allows your csv database to accept record additions, deletions and updates. Additions are made at the end of the file. To delete a record, just change the first character of the record with a unique character like 0x0 and of course delete the entry from the index file. Updates can be achieved by deleting and then adding the updated record at the end of the file.

这将导致对 数据库中的垃圾收集 ,但是如果不是全部,大多数 DBMS 。定期重建索引并删除已删除的记录。

This will create some need for garbage collection on your database but most DBMS, if not all, do so. Periodically rebuild your index and get rid of the deleted records.

希望获得帮助。

这篇关于CSV文件中的列索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆