大型公共数据集? [英] Large public datasets?
问题描述
我正在寻找一些大型的公开数据集,特别是:
I am looking for some large public datasets, in particular:
-
已匿名化的大型示例Web服务器日志。
Large sample web server logs that have been anonymized.
用于数据库绩效基准化的数据集。
Datasets used for database performance benchmarking.
任何其他链接到大型公共数据集将不胜感激。我已经了解亚马逊的公共数据集,网址为: http://aws.amazon.com/publicdatasets/
Any other links to large public datasets would be appreciated. I already know about Amazon's public datasets at: http://aws.amazon.com/publicdatasets/
推荐答案
1。
1. Large sample web server logs that have been anonymized.
这些工作以:
- UCI机器学习库
- UCI Machine Learning Repository
- Anonymous Microsoft Web Data
- MSNBC.com Anonymous Web Data
- Syskill and Webert Web Page Ratings
有许多,更多的数据集可以比这些(见其他答案的范围),但这是最低的悬挂水果,满足原来的标准。作为奖励,如果您有具体的需求,他们可以联系链接的
There are many, many more data sets available than these (see the gamut of other answers), but this is the lowest hanging fruit that meets your original criteria. As a bonus, they have a contact link if you have specific needs they may know of.
2。用于数据库性能基准化的数据集。
2. Datasets used for database performance benchmarking.
这听起来像是一个错误的名词,因为你要求的经验数据集描述明确定义 算法 问题。具体来说,它听起来像你试图找到的数据集,您可以使用测试和基准各种数据库系统实时,使用良好定义的,规范化的关系数据,可以用作一组测试用例来确定满足您需求的最有效的解决方案。
This sounds like a misnomer, because you're asking for empirical data sets that describe well-defined algorithmic problems. Specifically, it sounds like you're trying to find sets of data that you can use to test and benchmark various database systems in real time, using well-defined, normalized relational data that can be used as a set of test cases for determining the most efficient solution that meets your needs.
我不同意这种方法。不要找到一系列数据库系统及其实施,最好探索算法将这些系统的保证作为您的第一个停靠港。一旦你确定了满足你的需求的算法约束,你可以磨练一套罐头解决方案,你可以基于效率,例如索引,排序,搜索,插入,删除和检索。
I don't agree with this approach. Instead of finding a litany of database systems and their canned implementations, it's far better to explore the algorithmic guarantees of these systems as your first port of call. Once you've determined the algorithmic constraints that meet your needs, you can hone in on a set of canned solutions that you can benchmark on efficiency of, for example, indexing, sorting, searching, insertion, deletion, and retrieval.
维基百科提供关于数据库测试概念的简短文章您可以使用它来确定和编写测试用例来进行基准测试性能。例如,您可以使用一个不可知的数据访问接口,如 JDBC 和 JDBC Benchmark 来确定每个操作的相对时间。
Wikipedia provides a terse article on database testing concepts that you can use to determine and write test cases for benchmarking performance. For example, you might use an agnostic data access interface like JDBC and JDBC Benchmark to determine the relative timings of each operation. From here, you can hone in on a correct solution.
总之,请转到首先用于确定数据库保证的研究。一旦已经识别了一组候选解决方案,您可以通过测试(或以其他方式确定)每个期望操作的恒定时间性能来选择那些候选解决方案。
In short, go to the research first for determining database guarantees. Once a set of candidate solutions has been identified, you can select amongst those by testing (or otherwise determining) the constant time performance of each desired operation.
这篇关于大型公共数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- UCI Machine Learning Repository