将大型数据集拆分为更小的部分以供 Web 查看 [英] Splitting up large data sets into smaller pieces for web viewing

查看:62
本文介绍了将大型数据集拆分为更小的部分以供 Web 查看的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

我正在为 Web 应用程序开发 Web 平台.该平台的关键功能之一是用户管理.因此,我实现了一个登录系统,使注册用户能够登录平台和应用程序.这是使用 MySQL 数据库实现的,其中包含登录凭据、用户 ID、联系信息等...

I am developing a web platform for web applications. One of the key functions of this platform is user management. Therefore I have implemented a login system that enables registered users to log into the platform and the application. This is implemented using a MySQL database which contains login credentials, userids, contact information, etc...

问题

我遇到的问题是用户表.这张桌子可能有很大的潜力.因此,如果管理员想要编辑特定用户的信息,根据用户在列表中的位置,管理员将不得不滚动可能的数千条记录以找到正确的记录.我正在寻找一种算法,它将把它分成大小为 N 的组,并将这些组显示为管理员可以点击的范围:#0-B、CF、GL、MR 等....

The issue that I am running in to is with the users table. This table can have the potential to be quite large. So if an admin wants to edit the information for a specific user, depending on where the user is in the list, the admin will have to scroll through potentially thousands of records to find the correct one. I am looking for is an algorithm that will split this up into groups of about size N and display those groups as ranges that the admin can click on: #0-B, C-F, G-L, M-R, etc....

研究

这是我迄今为止进行的研究:

This is the research that I have conducted so far:

数据库分区 - 用 PHP 还是 MySQL 更好?

1 个非常大的表或3 大桌子?MySQL 性能

关于分区,我在网上找到了很多关于对数据库表进行分区的文章,但都没有解决我想要做的事情.

With partitioning, there are many articles I found on the web to partition a database table, but nothing addresses what I'm trying to do.

到目前为止我所做的

我编写了一个从 Web 服务器离线运行的程序,该程序返回每个桶 0-9 和 A-Z 的计数.对于小于阈值的计数,它将桶组合成一个范围并将其保存到数据库中.因此,如果桶 0-9 和 A、B 的总计数小于阈值,则范围变为 #0-B.如果桶的大小大于阈值,我在试图弄清楚如何从桶中获取范围时遇到问题.

I have written a program that runs offline from the web server that returns the counts for each bucket 0-9 and A-Z. For counts less than the threshold, it combines the buckets into a range and saves that to the database. So if total count of buckets 0-9 and A, B are less than the threshold, then the range becomes #0-B. I'm having problems trying to figure out how to get a range from a bucket if the size of the bucket is greater than the threshold.

我考虑了以下解决方案:

I have considered the following solutions:

  1. 一个递归函数,不断深入用户名,直到它在阈值范围内.如果我可以让它工作,这将是理想的.
  2. 由于它是离线运行的,我喜欢的另一个解决方案是将整个存储桶加载到内存中并从那里拆分.这样做的缺点是潜在的内存使用量.
  3. 另一个让我感到有趣的想法是 #2 的变体,即将存储桶的一部分加载到内存中并对其进行处理.这将或多或少地将内存使用量减少到固定数量,但需要更多时间来处理.

编辑 5-11/2020:

EDIT 5-11/2020:

根据一个答案和评论,我可以有一个搜索字段,当条目数缩小到低于阈值时,该字段将使用 JSON 填充列表(这是个好主意).但是,我确实有代码可以与您分享:

Based on the one answer and the comments, I could have a search field that would use JSON to populate the list when the number of entries is narrowed down to something below the threshold value (That is a good idea). But, I do have code which I will share with you:

此代码有效,但对于测试数据,它在分区表中创建了约 17,500 个条目,并且有相当多的条目为零计数.

This code works, but with the test data, it creates ~17,500 entries in the partition table, and there's quite a few entries that have zero counts.

function part_recurse2($prefix)
{
    global $CONFIGVAR;
    global $charList;
    global $charCount;

    $list = dbCountScan($prefix);
    for ($i = 0; $i < count($list); $i++)
    {
        $char = substr($charList, $i, 1);
        if ($list[$i] >= $CONFIGVAR['user_partition_max_size']['value'])
        {
            part_recurse2($prefix . $char);
        }
        else
        {
            writeRange($prefix . $char, $prefix . $char, $list[$i],
                substr($prefix . $char, 0, 1));
        }
    }
}

以及来自数据库的结果...

And the results from the database...

mysql> select sum(`count`) from userpart;
+--------------+
| sum(`count`) |
+--------------+
|       100004 |
+--------------+
1 row in set (0.01 sec)

mysql> select count(*) from userpart where count = 0;
+----------+
| count(*) |
+----------+
|     1139 |
+----------+
1 row in set (0.01 sec)

此代码部分有效,但计数不相加,并且其中也有零.正确的用户计数是 100,004,但下面的函数产生的总计数为 105,234,这比用户表中的实际条目多 5,230.这个版本的好处是它会合并范围,直到达到阈值.这就是我想要的工作.

This code partially works, but the counts do not add up, and there are zeros in it as well. The correct user count is 100,004 but the below function produces a total count that is 105,234 which is 5,230 more entries in the users table than there actually is. The nice thing about this version is that it will combine ranges until the threshold is met. This is what I would like to get working.

function part_recurse($prefix)
{
    global $charCount;
    global $charList;
    global $CONFIGVAR;

    $list = dbCountScan($prefix);
    $acc = 0;
    $start = substr($charList, 0, 1);
    for ($i = 0; $i < count($list); $i++)
    {
        if ($list[$i] == 0) continue;
        if ($list[$i] > $CONFIGVAR['user_partition_max_size']['value'])
        {
            // If the number of entries > threshold...
            if ($i == 0)
            {
                // Only for the first bucket.
                $start = substr($charList, 1, 1);
            }
            else
            {
                // All other entries.
                if ($i >= $charCount - 1)
                {
                    // Last entry
                    $end = substr($charList, $i - 1, 1);
                    $acc += $list[$i];
                    writeRange($prefix . $start, $prefix . $end, $acc,
                        substr($prefix . substr($charList, $i, 1), 0, 1));
                    $acc = 0;
                }
                else
                {
                    $end = substr($charList, $i - 1, 1);
                    writeRange($prefix . $start, $prefix . $end, $acc,
                        substr($prefix . substr($charList, $i + 1, 1), 0, 1));
                    $acc = 0;
                    $start = substr($charList, $i, 1);
                }
            }
            part_recurse($prefix . substr($charList, $i, 1));
        }
        else
        {
            if (($acc + $list[$i]) >= $CONFIGVAR['user_partition_max_size']['value'])
            {
                $end = substr($charList, $i - 1, 1);
                writeRange($prefix . $start, $prefix . $end, $acc,
                    substr($prefix . substr($charList, $i, 1), 0, 1));
                $start = substr($charList, $i, 1);
                $acc = $list[$i];
            }
            else
            {
                $acc += $list[$i];
            }
        }
    }

    // Write the final entry.
    if ($acc > 0)
    {
        $end = substr($charList, $charCount - 1, 1);
        $bucket = substr($prefix . substr($charList, $i, 1), 0, 1);
        writeRange($prefix . $start, $prefix . $end, $acc, $bucket);
    }
}

以及它的数据库结果:

mysql> select sum(`count`) from userpart;
+--------------+
| sum(`count`) |
+--------------+
|       105234 |
+--------------+
1 row in set (0.00 sec)
mysql> select count(*) from userpart where count = 0;
+----------+
| count(*) |
+----------+
|      316 |
+----------+
1 row in set (0.00 sec)

正确的条目数是 100,004,没有零计数.我会不断完善代码,但如果有人发现我做错了什么,请赐教.书外函数具有以下属性:

The correct number of entries is 100,004 with no zero counts. I will keep refining the code, but if someone sees something that I am doing wrong, please enlighten me. The off book functions have the following properties:

dbCountScan($prefix): 此函数使用 SELECT COUNT(*) FROM USERS WHERE username 像 '?%';在哪里 ?是 $prefix 和 for 循环中当前字母的串联.

dbCountScan($prefix): This function steps through the characters 0-9 and A-Z using SELECT COUNT(*) FROM USERS WHERE username like '?%'; where ? is a concatenation of $prefix and the current letter in the for loop.

WriteRange($start, $end, $count, $bucket): 此函数将范围写入数据库.$start 和 $end 是范围的开始和结束.$count 是范围内的条目数.而 $bucket 是这个范围所属的顶级bucket(0-9,A-Z).

WriteRange($start, $end, $count, $bucket): This function writes the range to the database. $start and $end is the start and end of the range. $count is the number of entries in the range. And $bucket is the top level bucket (0-9, A-Z) that this range belongs to.

这里是数据库表用户部分:

And here's the database table userpart:

CREATE TABLE `userpart` (
  `rangest` varchar(16) NOT NULL COMMENT 'Database query starting range.',
  `rangeed` varchar(16) NOT NULL COMMENT 'Database query ending range.',
  `count` int unsigned NOT NULL COMMENT 'Count in range.',
  `bucket` varchar(5) DEFAULT NULL COMMENT 'Primary bucket.'
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Table that is used to partition usernames.'

推荐答案

滚动可能有数千条记录

scroll through potentially thousands of records

编写查询,可能是临时的,以过滤到更少的记录.如果管理员需要滚动浏览超过几十条记录,那么您(和数据库)就会让他失望.

Write queries, possibly ad hoc, to filter down to fewer records. If the admin needs to scroll through more than a few dozen records, you (and the database) are failing him.

分页——不要使用OFFSET;记住你离开的地方":http://mysql.rjweb.org/doc.php/分页

Pagination -- Don't use OFFSET; "remember where you left off": http://mysql.rjweb.org/doc.php/pagination

分区——否;你没有提出一个可以从中受益的案例.您可以简单地说 WHERE name LIKE 'J%'WHERE name >= 'P' AND name <;'T'(对于 p,q,r,s)并且有一个以 name 开头的 INDEX;但我怀疑这是否真的对管理员有帮助.

Partitioning -- No; you have not presented a case where it would benefit. You can simply say WHERE name LIKE 'J%' or WHERE name >= 'P' AND name < 'T' (for p,q,r,s) and have an INDEX starting with name; but I doubt if that really helps the admin.

一个不断向下钻取的递归函数"——这就是 BTree 索引已经为你做的事情.只需执行 WHERE name >$leftoff ORDER BY name LIMIT 20.即使用 LIMIT 作为桶大小;同时不要费心预先定义桶边界.

"A recursive function that keeps drilling down" -- That's sort of what a BTree index already does for you. Simply do WHERE name > $leftoff ORDER BY name LIMIT 20. That is, use LIMIT for the bucket size; meanwhile don't bother predefining the bucket boundaries.

潜在内存使用"——通常最好让数据库拥有大部分可用内存.

"potential memory usage" -- It is usually better to simply let the database have most of the available memory.

1 个表 vs 3 个 -- 请详细说明这些表中的内容.

1 table vs 3 -- Please elaborate on what would be in those tables.

搜索

正如其他人所说,使用某种搜索机制可能是获得所需记录的最快方式.不要让管理员滚动数千行.

As others have said, using some search mechanism is likely to be the fastest way to get to the desired record(s). Do not make the admin scroll through thousands of rows.

提供一个表单,其中包含一项或多项内容供管理员填写

Provide a form with one or more things for the admin to fill in

  • 完整的用户名.(问题:拼写错误/不知道确切的拼写)
  • 通配符部分名称.示例:'丹%';然后显示所有以 Dan 开头的用户名.
  • 大量超链接,每个用户一个.这是可行的高达几千;我建议不要这样做.
  • 用户的一个属性——开始日期、没有活动等等.然后搜索该属性.

对于这些部分情况,拒绝显示超过100个用户名;如果超过,要求管理员提供更多详细信息.不要打扰增加的分页复杂性.

For these partial situations, refuse to show more than 100 usernames; if more than that, demand that the admin provide more details. Don't bother the added complexity of pagination.

我已经实施了各种这些和其他机制.回想起来,我唯一一次对几十个项目使用分页是当我需要对所有项目采取行动时.示例:从旅行中挑选一千张照片中的哪一张移动到相册"中.这涉及查看每张图片足够长的时间来挑选或拒绝每一张.此外,我使用 AJAX 只需单击一下即可完成所有需要的操作.

I have implemented a variety of those, and other, mechanisms. Thinking back, the only time I used pagination for more than a few dozen items was when I need to take action on all items. Example: picking which of a thousand pictures from a trip to move into an "album". This involved looking at each picture long enough to pick or reject each one. Also, I used AJAX to make a single click all the action needed.

这篇关于将大型数据集拆分为更小的部分以供 Web 查看的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆