将大数据集加载到crossfilter / dc.js中 [英] Load large dataset into crossfilter/dc.js

查看:207
本文介绍了将大数据集加载到crossfilter / dc.js中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我建立了一个带有多个维度和组的交叉过滤器,以便使用dc.js可视地显示数据。可视化的数据是自行车行程数据,每个行程将被加载。现在,有超过75万件数据。我使用的JSON文件是70 mb大,只需要增长,因为我在未来几个月收到更多的数据。

I built a crossfilter with several dimensions and groups to display the data visually using dc.js. The data visualized is bike trip data, and each trip will be loaded in. Right now, there's over 750,000 pieces of data. The JSON file I'm using is 70 mb large, and will only need to grow as I receive more data in the months to come.

所以我的问题是,如何我可以使数据更精益,以便它可以扩展好吗?现在它在我的互联网连接上加载大约15秒,但我担心,一旦我有太多的数据,它将需要太长时间。此外,我尝试(未成功)在数据加载时显示进度条/微调框,但我未成功。

So my question is, how can I make the data more lean so it can scale well? Right now it is taking approximately 15 seconds to load on my internet connection, but I'm worried that it will take too long once I have too much data. Also, I've tried (unsuccessfully) to get a progress bar/spinner to display while the data loads, but I'm unsuccessful.

我需要的列数据为 start_date,start_time,usertype,gender,tripduration,meters,age 。我已经将我的JSON中的这些字段缩短为 start_date,start_time,u,g,dur,m,age 在交叉过滤器顶部有一个折线图,显示每天的总旅行次数。下面是一个星期几(根据数据计算),月份(也计算)和用户类型,性别和年龄的饼图的行图表。下面是两个条形图的start_time(向下舍入到小时)和tripduration(向上舍入到分钟)。

The columns I need for the data are start_date, start_time, usertype, gender, tripduration, meters, age. I have shortened these fields in my JSON to start_date, start_time, u, g, dur, m, age so the file is smaller. On the crossfilter there is a line chart at the top showing the total # of trips per day. Below that there are row charts for the day of week (calculated from the data), month (also calculated), and pie charts for usertype, gender, and age. Below that there are two bar charts for the start_time (rounded down to the hour) and tripduration (rounded up to the minute).

项目在GitHub上: https://github.com/shaunjacobsen/divvy_explorer (数据集位于data2.json中)。我试图创建一个jsfiddle但它不工作(可能是由于数据,甚至只收集1000行,并加载到HTML与< pre> 标签): http://jsfiddle.net/QLCS2/

The project is on GitHub: https://github.com/shaunjacobsen/divvy_explorer (the dataset is in data2.json). I tried to create a jsfiddle but it is not working (likely due to the data, even gathering only 1,000 rows and loading it into the HTML with <pre> tags): http://jsfiddle.net/QLCS2/

理想情况下,因此只有顶部图表的数据首先加载:这将加载快速,因为它只是一天的数据计数。然而,一旦它进入其他图表,它需要逐渐更多的数据,以深入到更细的细节。关于如何让这个功能的任何想法?

Ideally it would function so that only the data for the top chart would load in first: this would load quickly since it's just a count of data by day. However, once it gets down into the other charts it needs progressively more data to drill down into finer details. Any ideas on how to get this to function?

推荐答案

我建议将JSON中的所有字段名称缩短为1字符(包括start_date和start_time)。这应该有点帮助。此外,请确保您的服务器上已开启压缩。这样,发送到浏览器的数据将在传输过程中自动压缩,如果尚未打开,则会加快速度。

I'd recommend shortening all of your field names in the JSON to 1 character (including "start_date" and "start_time"). That should help a little bit. Also, make sure that compression is turned on on your server. That way the data sent to the browser will be automatically compressed in transit, which should speed things up a ton if it's not already turned on.

为了更好地响应, d还建议首先设置您的Crossfilter(空),所有的维度和组以及所有的dc.js图表​​,然后使用Crossfilter.add()在块中添加更多的数据到您的Crossfilter。最简单的方法是将数据分成小块(每个几MB)和串行加载它们。所以如果你使用的是d3.json,那么在上一个文件加载的回调中启动下一个文件加载。这导致一堆嵌套的回调,这是一个有点讨厌,但应该允许用户界面在数据加载时响应。

For better responsiveness, I'd also recommend first setting up your Crossfilter (empty), all your dimensions and groups, and all your dc.js charts, then using Crossfilter.add() to add more data into your Crossfilter in chunks. The easiest way to do this is to divide your data up into bite-sized chunks (a few MBs each) and load them serially. So if you are using d3.json, then start the next file load in the callback of the previous file load. This results in a bunch of nested callbacks, which is a bit nasty, but should allow the user interface to be responsive while the data is loading.

最后,数据我相信你将开始在浏览器中遇到性能问题,而不仅仅是在加载数据时。我怀疑你已经看到这一点,你看到的15秒的暂停至少部分在浏览器。您可以在浏览器的开发人员工具中通过分析来检查。为了解决这个问题,您需要剖析和识别性能瓶颈,然后尝试优化这些瓶颈。此外 - 一定要在较慢的计算机上测试,如果他们在您的观众。

Lastly, with this much data I believe you will start running into performance issues in the browser, not just while loading the data. I suspect you are already seeing this and that the 15 second pause you are seeing is at least partially in the browser. You can check by profiling in your browser's developer tools. To address this, you'll want to profile and identify performance bottlenecks, then try to optimize those. Also - be sure to test on slower computers if they are in your audience.

这篇关于将大数据集加载到crossfilter / dc.js中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆