寻找一个非云 RDBMS 来导入分区表(CSV 格式)及其目录结构 [英] Looking for a non-cloud RDBMS to import partitioned tables (in CSV format) with their directory structure

查看：56 发布时间：2021/7/3 18:50:18 sql csv rdbms impala apache-drill

本文介绍了寻找一个非云 RDBMS 来导入分区表(CSV 格式)及其目录结构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

上下文:我一直在研究 Cloudera/Impala，以便使用大型数据库并创建更易于管理的聚合".包含更少信息的表格.这些更易于管理的表的数量级为数十到数百 GB，大约有两打表.我正在查看大约 500 GB 的数据，这些数据可以放在我实验室的一台计算机上.

Context: I have been working on Cloudera/Impala in order to use a big database and create more manageable "aggregate" tables which contain substantially less information. These more manageable tables are of the order of tens to hundreds of gigabytes, and there are about two dozen tables. I am looking at about 500 gigabytes of data which will fit on a computer in my lab.

问题:我希望使用非云 RDBMS，以便在我的实验室本地进一步处理这些表.原始 Impala 表(其中大部分按日期分区)已导出为 CSV，这样表"就可以被导出到 CSV 文件中.文件夹包含每个日期的子文件夹，每个子文件夹包含一个唯一的 csv 文件(其中不存在分区的日期"列，因为它位于其带有日期的子文件夹中).哪个是合适的 RDBMS，我将如何导入这些表?

Question: I wish to use a non-cloud RDBMS in order to further work on these tables locally from my lab. The original Impala tables, most of them partitioned by date, have been exported to CSV, in such a way that the "table" folder contains a subfolder for each date, each subfolder containing a unique csv file (in which the partitioned "date" column is absent, since it is in its dated subfolder). Which would be an adequate RDBMS and how would I import these tables?

到目前为止我发现了什么:MySQL 似乎有几个 GUI 或命令可以简化导入，例如:

What I've found so far: there seem to be several GUIs or commands for MySQL which simplify importing, e.g.:

然而，这些并没有解决我的具体情况，因为 1. 我只能访问集群上的 Impala，即我不能添加任何工具，因此必须在实验室计算机上完成繁重的工作，并且 2. 他们没有说一下关于导入一个已经分区的表和现有的目录/分区结构.

However these do not address my specific situation since 1. I only have access to Impala on the cluster, i.e. I cannot add any tools, so the heavy-lifting must be done on the lab computer, and 2. they do not say anything about importing an already partitioned table with the existing directory/partition structure.

限制条件:

实验室计算机运行的是 Ubuntu 20.04
理想情况下，我希望避免手动加载每个 csv/分区，因为我有数万个日期.我希望有一个已经识别分区目录结构的 RDBMS...
RDBMS 本身应该有一组相当新的可用函数，包括前导/滞后/第一个/最后一个窗口函数.除此之外，它不必太花哨.

我愿意将 Spark 用作矫枉过正的 SQL 引擎"，如果这是最好的方法，我只是不太确定这是否是唯一计算机(而不是集群)的最佳方法.此外，如果需要(尽管我希望避免这种情况)，我可以以另一种格式导出我的 Impala 表，以简化导入阶段.例如.基于文本的表格、镶木地板等的不同格式.

I'm open to using Spark as an "overkill SQL engine", if that's the best way, I'm just not too sure if this is the best approach for a unique computer (not a cluster). Also, if need be (though I would ideally like to avoid this), I can export my Impala tables in another format in order to ease the import phase. E.g. a different format for text-based tables, parquet, etc.

编辑 1正如评论中所建议的，我目前正在研究 Apache Drill.它已正确安装，并且我已成功运行文档/教程中的基本查询.但是，我现在被困在如何实际导入"(实际上，我只需要使用"它们，因为drill 似乎能够直接在文件系统上运行查询)我的表.澄清:

Edit 1 As suggested in the comments, I am currently looking at Apache Drill. It is correctly installed, and I have successfully run the basic queries from the documentation / tutorials. However, I am now stuck at how to actually "import" (actually, I only need to "use" them since drill seems able to run queries directly on the filesystem) my tables. To clarify:

我目前有两个桌子"在目录/data/table1 和/data/table2 中.
这些目录包含对应于不同分区的子目录，例如:/data/table1/thedate=1995 、/data/table1/thedate=1996 等，table2 也是如此.
在每个子目录中，我都有一个文件(没有扩展名)，其中包含 CSV 数据，没有标题.

我的理解(我还是 Apache-Drill 的新手)是我需要以某种方式创建一个文件系统存储插件，以便钻取了解在哪里查看以及它在查看什么，所以我创建了一个非常基本的插件(一个准复制/粘贴来自 this one) 使用 Web 界面插件管理页面.这样做的最终结果是，现在我可以输入 use data; 并且drill 理解了这一点.然后我可以说 show files in data 并且它正确地将 table1 和 table2 列为我的两个目录.不幸的是，我仍然缺少能够成功查询这些表的两个关键事项:

My understanding (I'm still new to Apache-Drill) is that I need to create a File System Storage Plugin somehow for drill to understand where to look and what it's looking at, so I created a pretty basic plugin (a quasi copy/paste from this one) using the web interface on the Plugin Management page. The net result of that is that now I can type use data; and drill understands that. I can then say show files in data and it correctly lists table1 and table2 as my two directories. Unfortunately, I am still missing two key things to successfully be able to query these tables:

running select * from data.table1 失败并出现错误，我尝试过 table1 或 dfs.data.table1 并且每个命令都有不同的错误(找不到对象数据"，找不到对象table1"，架构 [[dfs,data]] 分别对于根架构或当前默认架构无效).我怀疑这是因为 table1 中有子目录?
我仍然没有对 CSV 文件的结构进行任何说明，该结构需要包含日期"这一事实.子目录名称中的字段和值...

running select * from data.table1 fails with an error, and I've tried table1 or dfs.data.table1 and I get different errors for each command (object 'data' not found, object 'table1' not found, schema [[dfs,data]] isnot valid with respect to either root schema or current default schema, respectively). I suspect this is because there are sub-directories within table1?
I still have not said anything about the structure of the CSV files, and that structure would need to incorporate the fact that there is "thedate" field and value in the sub-directory name...

编辑 2在尝试了很多事情之后，使用基于文本的文件仍然没有运气，但是使用镶木地板文件有效:

Edit 2 After trying a bunch of things, still no luck using text-based files, however using parquet files worked:

我可以查询镶木地板文件

I can query a parquet file

我可以查询包含分区表的目录，每个目录的格式如下: thedate=1995 , thedate=1996 如前所述.

I can query a directory containing a partitioned table, each directory being in the format: thedate=1995 , thedate=1996 as stated earlier.

我使用了此处的建议以便能够以通常的方式查询表，即不使用dir0 而是使用日期.本质上，我创建了一个视图:

I used the advice here in order to be able to query a table the usual way, i.e. without using dir0 but using thedate. Essentially, I created a view :

创建视图drill.test as select dir0 as thedate, * from dfs.data/table1_parquet_partitioned

create view drill.test as select dir0 as thedate, * from dfs.data/table1_parquet_partitioned

不幸的是，thedate now 是一个文本，上面写着: thedate=1994 ，而不仅仅是 1994 (int).所以我重命名了目录以便只包含日期，但是这不是一个好的解决方案，因为日期的类型不是整数，因此我不能使用日期来连接 table2(它在列中有日期).所以最后，我所做的是在视图中将日期转换为 int

Unfortunately, thedate now is a text that says: thedate=1994 , rather than just 1994 (int). So I renamed the directories in order to only contain the date, however this was not a good solution as the type for thedate was not an int and therefore I could not use dates to join with table2 (which has thedate in a column). So finally, what I did was cast thedate to an int in the view

=>这一切都很好，虽然不是 csv 文件，但这个替代方案对我来说是可行的.但是我想知道如果使用这样的视图，里面有一个演员，我会从分区修剪中受益吗?引用的 stackoverflow 链接中的答案表明分区修剪是由视图保存的，但是当在公式中使用该列时，我不确定这一点......最后，鉴于我可以完成这项工作的唯一方法是通过镶木地板，它引出一个问题:就性能而言，drill 是最好的解决方案吗?到目前为止，我喜欢它，但是将数据库迁移到这里会很耗时，我想尝试为此选择最佳目的地，而不会进行过多的反复试验......

=> This is all fine as, although not csv files, this alternative is doable for me. However I am wondering if by using such a view, with a cast inside, will I benefit from partition pruning ? The answer in the referenced stackoverflow link suggests partition pruning is conserved by the view, however I am unsure about this when the column is used in a formula... Finally, given that the only way I can make this work is via parquet, it begs the question: is drill the best solution for this in terms of performance? So far, I like it, but migrating the database to this will be time-consuming and I would like to try to choose the best destination for this without too much trial and error...

寻找一个非云 RDBMS 来导入分区表(CSV 格式)及其目录结构 [英] Looking for a non-cloud RDBMS to import partitioned tables (in CSV format) with their directory structure

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

寻找一个非云 RDBMS 来导入分区表(CSV 格式)及其目录结构 [英] Looking for a non-cloud RDBMS to import partitioned tables (in CSV format) with their directory structure

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭