SQLite:从路径列表创建目录结构表 [英] SQLite: Create Directory Structure Table from A List Of Paths

查看:44
本文介绍了SQLite:从路径列表创建目录结构表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要创建一个目录结构表,如此 Directory_Parent =外键";id字段,它指向同一表中另一个目录的ID
=包含目录/文件夹名称的字符串

鉴于树/水果/苹果/

I want to create a directory structure table as described in this question where:

Directory = "Primary Key" id field, typically an integer
Directory_Parent = "Foreign Key" id field, which points to the id of another Directory in the same table
Value = string containing the directory/folder name

Given Tree/Fruit/Apples/

Directory | Directory_Parent | Value
0           null               Root
1           0                  Tree
2           1                  Fruit
3           2                  Apples

已在主键0上创建了一个 Root 文件夹,其父级为空.

A Root folder has been created at Primary Key 0 with a null parent.

我的路径是从CSV导入的,当前位于具有2列的表格中:

My paths are being imported from CSV and currently are in a table with 2 columns:

 FileID  Path
      1  videos/gopro/father/mov001.mp4
      2  videos/gopro/father/mov002.mp4
      3  pictures/family/father/Oldman.jpg
      4  pictures/family/father/Oldman2.jpg
      5  documents/legal/father/estate/will.doc
      6  documents/legal/father/estate/will2.doc
      7  documents/legal/father/estate/newyork/albany/will.doc
      8  video/gopro/father/newyork/albany/holiday/christmas/2002/mov001.mp4
      9  pictures/family/father/newyork/albany/holiday/christmas/2002/july/Oldman.jpg
      10 pictures/family/father/newyork/albany/holiday/christmas/2002/june/Oldman2.jpg

此表包含100万个文件条目.
什么是快速&解析数据并将文件夹结构移到如上所述的新表中的最佳方法?

在此
演示中,文件夹之间用"/"分隔.并移至新列(如果有帮助).

This table contains 1 million file entries.
What's a fast & optimized way to parse this data and move the folder structure into a new table as described above?

In this demo the folders are separated by "/" and moved into a new column, if that helps.

推荐答案

SQL缺乏编程语言的灵活性和工具,这将为该问题提供快速且优化的解决方案.

SQL lacks the flexibility and tools of a programming language, which would provide a fast and optimized solution to this problem.

此外,在字符串操作方面,SQLite是数据库中最差的,因为它不支持SQL Server的

Moreover SQLite is the poorest of the databases when it comes to string manipulation because it does not support a function like SQL Server's STRING_SPLIT() or MySql's SUBSTRING_INDEX() which would be very helpful.

尽管如此,问题还是很有趣,我试了一下.

Nevertheless the problem is interesting and I gave it a shot.

我使用以下语句创建表 dir_struct :

I create the table dir_struct with this statement:

CREATE TABLE dir_struct (
  Directory INTEGER PRIMARY KEY, 
  Directory_Parent INTEGER REFERENCES dir_struct(Directory), 
  Value TEXT
);

,然后插入'root'行:

INSERT INTO dir_struct (Directory, Directory_Parent, Value) VALUES (0, null, 'root');

此外,我通过以下方法关闭 OFF 外键强制执行:

Also, I turn OFF foreign key enforcement with:

PRAGMA foreign_keys = OFF;

尽管默认情况下已关闭,以防万一.

although it is by default turned off, just in case.

首先,您需要一个递归CTE,该CTE可以将路径拆分到各个目录(很像您上一个问题的答案).
然后在第二CTE中,通过条件聚合,每个目录进入其自己的列(限制为最多10个目录).
3d CTE删除重复项,第4 CTE使用 ROW_NUMBER()窗口函数将唯一的ID分配给目录.
最后,通过对第4个CTE的结果进行自我连接,将行插入到表中:

First you need a recursive CTE that splits the paths to individual directories (much like the answer on your previous question).
Then in the 2nd CTE, with conditional aggregation, each directory goes in its own column (with the limitation of max 10 directories).
The 3d CTE removes duplicates and the 4th CTE assigns, with ROW_NUMBER() window function, unique ids to the directories.
Finally with a self join of the results of the 4th CTE, the rows are inserted in the table:

WITH 
  split AS (
    SELECT 0 idx,
           FileDataID,
           SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
           SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
    FROM listfile
    UNION ALL
    SELECT idx + 1,
           FileDataID,
           SUBSTR(value, 1, INSTR(value, '/') - 1),
           SUBSTR(value, INSTR(value, '/') + 1)
    FROM split
    WHERE value LIKE '%_/_%' 
  ),
  cols AS (
    SELECT DISTINCT
           MAX(CASE WHEN idx = 0 THEN item END) path0,
           MAX(CASE WHEN idx = 1 THEN item END) path1,
           MAX(CASE WHEN idx = 2 THEN item END) path2,
           MAX(CASE WHEN idx = 3 THEN item END) path3,
           MAX(CASE WHEN idx = 4 THEN item END) path4,
           MAX(CASE WHEN idx = 5 THEN item END) path5,
           MAX(CASE WHEN idx = 6 THEN item END) path6,
           MAX(CASE WHEN idx = 7 THEN item END) path7,
           MAX(CASE WHEN idx = 8 THEN item END) path8,
           MAX(CASE WHEN idx = 9 THEN item END) path9
    FROM split
    GROUP BY FileDataID
  ),
  paths AS (
    SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, path9 FROM cols UNION
    SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, null FROM cols UNION
    SELECT path0, path1, path2, path3, path4, path5, path6, path7, null, null FROM cols UNION
    SELECT path0, path1, path2, path3, path4, path5, path6, null, null, null FROM cols UNION
    SELECT path0, path1, path2, path3, path4, path5, null, null, null, null FROM cols UNION
    SELECT path0, path1, path2, path3, path4, null, null, null, null, null FROM cols UNION
    SELECT path0, path1, path2, path3, null, null, null, null, null, null FROM cols UNION
    SELECT path0, path1, path2, null, null, null, null, null, null, null FROM cols UNION
    SELECT path0, path1, null, null, null, null, null, null, null, null FROM cols UNION
    SELECT path0, null, null, null, null, null, null, null, null, null FROM cols
  ), 
  ids AS (
    SELECT *, 
           ROW_NUMBER() OVER (ORDER BY path0, path1, path2, path3, path4, path5, path6, path7, path8, path9) nr,
           COALESCE(path9, path8, path7, path6, path5, path4, path3, path2, path1, path0) last_child,
           path0 || COALESCE('/' || path1, '') ||
                    COALESCE('/' || path2, '') ||
                    COALESCE('/' || path3, '') ||
                    COALESCE('/' || path4, '') ||
                    COALESCE('/' || path5, '') ||
                    COALESCE('/' || path6, '') ||
                    COALESCE('/' || path7, '') ||
                    COALESCE('/' || path8, '') ||
                    COALESCE('/' || path9, '') full_path
    FROM paths       
  )
INSERT INTO dir_struct(Directory, Directory_Parent, Value)
SELECT i1.nr, COALESCE(i2.nr, 0), i1.last_child
FROM ids i1 LEFT JOIN ids i2
ON i1.full_path = i2.full_path || '/' || i1.last_child

在我的由187365行组成的测试数据集中,这些行平均插入时间为9.5-10分钟,这对于较大的数据集来说会更长.

In my test dataset which consists of 187365 rows, the rows were inserted in (average) 9.5-10 minutes, which will be much longer for your larger dataset.

请参见演示.

更有趣的是,代码越简单,性能越差(但您也可以对其进行测试):

What is more interesting is that with simpler code, the performance is worse (but you can also test it):

WITH 
  split AS (
    SELECT Path,
           0 parent_len,
           SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
           SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
    FROM listfile
    UNION ALL
    SELECT Path,
           parent_len + LENGTH(item) + 1, 
           SUBSTR(value, 1, INSTR(value, '/') - 1),
           SUBSTR(value, INSTR(value, '/') + 1)
    FROM split
    WHERE value LIKE '%_/_%' 
  ), 
  row_numbers AS (
    SELECT parent_path, item, 
           ROW_NUMBER() OVER (ORDER BY parent_path, item) rn
    FROM (SELECT DISTINCT SUBSTR(Path, 1, parent_len) parent_path, item FROM split)       
  )
INSERT INTO dir_struct(Directory, Directory_Parent, Value)  
SELECT r1.rn, COALESCE(r2.rn, 0) rn_parent, r1.item 
FROM row_numbers r1 LEFT JOIN row_numbers r2
ON r1.parent_path = r2.parent_path || r2.item || '/'

此查询分配给目录的ID与第一个解决方案分配的ID不同,但它们是正确且唯一的.

The ids assigned to the directories by this query are different than the ones assigned by the first solution, but they are correct and unique.

(平均)运行14到15分钟.
请参见演示.

This runs in (average) 14-15 minutes.
See the demo.

结论是,如果这是一次性的事情,也许您可​​以使用它,但我不建议您将其作为对此要求的解决方案.

The conclusion is that if this is a one time thing, maybe you can use it, but I would not recommend it as a solution to this requirement.

这篇关于SQLite:从路径列表创建目录结构表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆