寻找一个非云 RDBMS 来导入分区表(CSV 格式)及其目录结构 [英] Looking for a non-cloud RDBMS to import partitioned tables (in CSV format) with their directory structure

查看:56
本文介绍了寻找一个非云 RDBMS 来导入分区表(CSV 格式)及其目录结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文:我一直在研究 Cloudera/Impala,以便使用大型数据库并创建更易于管理的聚合".包含更少信息的表格.这些更易于管理的表的数量级为数十到数百 GB,大约有两打表.我正在查看大约 500 GB 的数据,这些数据可以放在我实验室的一台计算机上.

Context: I have been working on Cloudera/Impala in order to use a big database and create more manageable "aggregate" tables which contain substantially less information. These more manageable tables are of the order of tens to hundreds of gigabytes, and there are about two dozen tables. I am looking at about 500 gigabytes of data which will fit on a computer in my lab.

问题:我希望使用非云 RDBMS,以便在我的实验室本地进一步处理这些表.原始 Impala 表(其中大部分按日期分区)已导出为 CSV,这样表"就可以被导出到 CSV 文件中.文件夹包含每个日期的子文件夹,每个子文件夹包含一个唯一的 csv 文件(其中不存在分区的日期"列,因为它位于其带有日期的子文件夹中).哪个是合适的 RDBMS,我将如何导入这些表?

Question: I wish to use a non-cloud RDBMS in order to further work on these tables locally from my lab. The original Impala tables, most of them partitioned by date, have been exported to CSV, in such a way that the "table" folder contains a subfolder for each date, each subfolder containing a unique csv file (in which the partitioned "date" column is absent, since it is in its dated subfolder). Which would be an adequate RDBMS and how would I import these tables?

到目前为止我发现了什么:MySQL 似乎有几个 GUI 或命令可以简化导入,例如:

What I've found so far: there seem to be several GUIs or commands for MySQL which simplify importing, e.g.:

然而,这些并没有解决我的具体情况,因为 1. 我只能访问集群上的 Impala,即我不能添加任何工具,因此必须在实验室计算机上完成繁重的工作,并且 2. 他们没有说一下关于导入一个已经分区的表和现有的目录/分区结构.

However these do not address my specific situation since 1. I only have access to Impala on the cluster, i.e. I cannot add any tools, so the heavy-lifting must be done on the lab computer, and 2. they do not say anything about importing an already partitioned table with the existing directory/partition structure.

限制条件:

  • 实验室计算机运行的是 Ubuntu 20.04
  • 理想情况下,我希望避免手动加载每个 csv/分区,因为我有数万个日期.我希望有一个已经识别分区目录结构的 RDBMS...
  • RDBMS 本身应该有一组相当新的可用函数,包括前导/滞后/第一个/最后一个窗口函数.除此之外,它不必太花哨.

我愿意将 Spark 用作矫枉过正的 SQL 引擎",如果这是最好的方法,我只是不太确定这是否是唯一计算机(而不是集群)的最佳方法.此外,如果需要(尽管我希望避免这种情况),我可以以另一种格式导出我的 Impala 表,以简化导入阶段.例如.基于文本的表格、镶木地板等的不同格式.

I'm open to using Spark as an "overkill SQL engine", if that's the best way, I'm just not too sure if this is the best approach for a unique computer (not a cluster). Also, if need be (though I would ideally like to avoid this), I can export my Impala tables in another format in order to ease the import phase. E.g. a different format for text-based tables, parquet, etc.

编辑 1正如评论中所建议的,我目前正在研究 Apache Drill.它已正确安装,并且我已成功运行文档/教程中的基本查询.但是,我现在被困在如何实际导入"(实际上,我只需要使用"它们,因为drill 似乎能够直接在文件系统上运行查询)我的表.澄清:

Edit 1 As suggested in the comments, I am currently looking at Apache Drill. It is correctly installed, and I have successfully run the basic queries from the documentation / tutorials. However, I am now stuck at how to actually "import" (actually, I only need to "use" them since drill seems able to run queries directly on the filesystem) my tables. To clarify:

  • 我目前有两个桌子"在目录/data/table1 和/data/table2 中.
  • 这些目录包含对应于不同分区的子目录,例如:/data/table1/thedate=1995 、/data/table1/thedate=1996 等,table2 也是如此.
  • 在每个子目录中,我都有一个文件(没有扩展名),其中包含 CSV 数据,没有标题.

我的理解(我还是 Apache-Drill 的新手)是我需要以某种方式创建一个文件系统存储插件,以便钻取了解在哪里查看以及它在查看什么,所以我创建了一个非常基本的插件(一个准复制/粘贴来自 this one) 使用 Web 界面插件管理页面.这样做的最终结果是,现在我可以输入 use data; 并且drill 理解了这一点.然后我可以说 show files in data 并且它正确地将 table1 和 table2 列为我的两个目录.不幸的是,我仍然缺少能够成功查询这些表的两个关键事项:

My understanding (I'm still new to Apache-Drill) is that I need to create a File System Storage Plugin somehow for drill to understand where to look and what it's looking at, so I created a pretty basic plugin (a quasi copy/paste from this one) using the web interface on the Plugin Management page. The net result of that is that now I can type use data; and drill understands that. I can then say show files in data and it correctly lists table1 and table2 as my two directories. Unfortunately, I am still missing two key things to successfully be able to query these tables:

  1. running select * from data.table1 失败并出现错误,我尝试过 table1 或 dfs.data.table1 并且每个命令都有不同的错误(找不到对象数据",找不到对象table1",架构 [[dfs,data]] 分别对于根架构或当前默认架构无效).我怀疑这是因为 table1 中有子目录?
  2. 我仍然没有对 CSV 文件的结构进行任何说明,该结构需要包含日期"这一事实.子目录名称中的字段和值...
  1. running select * from data.table1 fails with an error, and I've tried table1 or dfs.data.table1 and I get different errors for each command (object 'data' not found, object 'table1' not found, schema [[dfs,data]] isnot valid with respect to either root schema or current default schema, respectively). I suspect this is because there are sub-directories within table1?
  2. I still have not said anything about the structure of the CSV files, and that structure would need to incorporate the fact that there is "thedate" field and value in the sub-directory name...

编辑 2在尝试了很多事情之后,使用基于文本的文件仍然没有运气,但是使用镶木地板文件有效:

Edit 2 After trying a bunch of things, still no luck using text-based files, however using parquet files worked:

  • 我可以查询镶木地板文件

  • I can query a parquet file

我可以查询包含分区表的目录,每个目录的格式如下: thedate=1995 , thedate=1996 如前所述.

I can query a directory containing a partitioned table, each directory being in the format: thedate=1995 , thedate=1996 as stated earlier.

我使用了此处的建议以便能够以通常的方式查询表,即不使用dir0 而是使用日期.本质上,我创建了一个视图:

I used the advice here in order to be able to query a table the usual way, i.e. without using dir0 but using thedate. Essentially, I created a view :

创建视图drill.test as select dir0 as thedate, * from dfs.data/table1_parquet_partitioned

create view drill.test as select dir0 as thedate, * from dfs.data/table1_parquet_partitioned

不幸的是,thedate now 是一个文本,上面写着: thedate=1994 ,而不仅仅是 1994 (int).所以我重命名了目录以便只包含日期,但是这不是一个好的解决方案,因为日期的类型不是整数,因此我不能使用日期来连接 table2(它在列中有日期).所以最后,我所做的是在视图中将日期转换为 int

Unfortunately, thedate now is a text that says: thedate=1994 , rather than just 1994 (int). So I renamed the directories in order to only contain the date, however this was not a good solution as the type for thedate was not an int and therefore I could not use dates to join with table2 (which has thedate in a column). So finally, what I did was cast thedate to an int in the view

=>这一切都很好,虽然不是 csv 文件,但这个替代方案对我来说是可行的.但是我想知道如果使用这样的视图,里面有一个演员,我会从分区修剪中受益吗?引用的 stackoverflow 链接中的答案表明分区修剪是由视图保存的,但是当在公式中使用该列时,我不确定这一点......最后,鉴于我可以完成这项工作的唯一方法是通过镶木地板,它引出一个问题:就性能而言,drill 是最好的解决方案吗?到目前为止,我喜欢它,但是将数据库迁移到这里会很耗时,我想尝试为此选择最佳目的地,而不会进行过多的反复试验......

=> This is all fine as, although not csv files, this alternative is doable for me. However I am wondering if by using such a view, with a cast inside, will I benefit from partition pruning ? The answer in the referenced stackoverflow link suggests partition pruning is conserved by the view, however I am unsure about this when the column is used in a formula... Finally, given that the only way I can make this work is via parquet, it begs the question: is drill the best solution for this in terms of performance? So far, I like it, but migrating the database to this will be time-consuming and I would like to try to choose the best destination for this without too much trial and error...

推荐答案

我最终使用了 Spark.我目前知道的唯一替代方案是 Apache Drill,它是由 Simon Darr(我再次感谢他!)引起我的注意.就我可以测试而言,每种解决方案的优缺点:

I ended up using Spark. The only alternative I currently know about, which was brought to my attention by Simon Darr (whom I wish to thank again!), is Apache Drill. Pros and cons for each solution, as far as I could test:

  • 当数据库以文本格式(在我的例子中是 CSV 文件)导出时,这两种解决方案都无法提供一种简单的方法来导入现有架构.
  • 两种解决方案都使用 parquet 文件正确导入架构,因此我决定必须从我的源集群(使用 Impala)以 parquet 格式重新创建我的表.
  • 剩下的问题是关于分区:我终于能够弄清楚如何在 Spark 上导入分区文件,并且添加该分区维度的过程是无缝的(我从 这里此处 用于该部分),而我无法找到一种令人信服地使用 Drill 执行此操作的方法(尽管创建了一个视图,如建议的 此处,确实有所帮助):
    • 在 Spark 上.我用过:spark.sql(select * from parquet.file:///mnt/data/SGDATA/sliced_liquidity_parq_part/").请注意,重要的是不要像我第一次那样使用 * 通配符,因为如果使用通配符,则读取每个 parquet 文件时不会查看它所属的目录,因此它不会考虑将这些字段分区或添加到架构中.如果没有通配符,语法 field_name=value 的目录名称会被正确添加到架构中,并且值类型本身也会被正确推断(在我的例子中,是 int,因为我使用了 date=intvalue 语法).
    • 在 Drill 上,创建视图的技巧有点麻烦,因为它首先需要使用 dir0 的子字符串来提取 field_name 和值,其次需要强制转换以便将该字段发送到模式中的正确类型.我真的不确定这种视图是否会在此后进行查询时启用分区修剪,因此我不喜欢这种 hack.注意:可能还有另一种方法可以正确执行此操作,我只是没有找到.
    • Neither solution was great for offering a simple way to import the existing schema when the database is exported in text (in my case, CSV files).
    • Both solutions import the schema correctly using parquet files, so I have decided I must recreate my tables in the parquet format from my source cluster (which uses Impala).
    • The problem remaining is with respect to the partitioning: I was at long last able to figure out how to import partitioned files on Spark and the process of adding that partition dimension is seemless (I got help from here and here for that part), whereas I was not able to find a way to do this convincingly using Drill (although the creation of a view, as suggested here, does help somewhat):
      • On Spark. I used : spark.sql("select * from parquet.file:///mnt/data/SGDATA/sliced_liquidity_parq_part/"). Note that it is important to not use the * wildcard, as I first did, because if you use the wildcard each parquet file is read without looking at the directory it belongs to, so it doesn't take into account the directory structure for the partitioning or adding those fields into the schema. Without the wildcard, the directory name with syntax field_name=value is correctly added to the schema, and the value types themselves are correctly inferred (in my case, int because I use thedate=intvalue syntax).
      • On Drill, the trick of creating a view is a bit messy since it involves, first, using the substring of dir0 in order to extract the field_name and value, and second it requires a cast in order to send that field to the correct type in the schema. I am really not certain this sort of view would enable partition pruning when doing queries thereafter, so I was not fond of this hack. NB: there is likely another way to do this properly, I simply haven't found it.

      一路上我学习了 Drill(这对于日志和没有已知结构的东西来说似乎很棒),并了解到如果数据结构化,Spark 可以做很多 Drill 所做的事情(我不知道它可以在没有底层数据库系统的情况下直接读取 CSV 或镶木地板文件).我也不知道在独立机器上安装 Spark 如此容易:按照以下步骤 此处,我只是在我的 bashrc 中创建了一个脚本,该脚本一次性启动了 master、worker 和 shell(尽管我无法评论使用独立计算机的性能)这个,也许 Spark 不擅长这个).过去使用过一些 spark,考虑到我的选择,这个解决方案对我来说似乎仍然是最好的.如果有任何其他解决方案,请继续提供它们,因为我还不会接受我自己的答案(无论如何,我需要几天时间将所有桌子更改为镶木地板).

      I learned along the way about Drill (which seems great for logs and stuff that don't have a known structure), and learned that Spark could do a lot of what drill does if the data is structured (I had no idea it could read CSVs or parquet files directly without an underlying DB system). I also did not know that Spark was so easy to install on a standalone machine: after following steps here, I simply created a script in my bashrc which launches the master, a worker, and the shell all in one go (although I cannot comment on the performance of using a standalone computer for this, perhaps Spark is bad at this). Having used spark a bit in the past, this solution still seems best for me given my options. If there are any other solutions out there keep them coming as I won't accept my own answer just yet (I have a few days required to change all my tables to parquet anyway).

      这篇关于寻找一个非云 RDBMS 来导入分区表(CSV 格式)及其目录结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆