数据库经过优化,可搜索大量具有不同属性的对象 [英] Database optimized for searching in large number of objects with different attributes

查看:94
本文介绍了数据库经过优化,可搜索大量具有不同属性的对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用EAV方法寻找我们老化的MySQL数据库的替代方法.当前的项目似乎已经超出了传统的面向表的数据库结构,尤其是在此类数据库中的搜索. 我负责各种NoSQL数据库系统并进行了研究,但找不到任何我正在寻找的东西.也许你可以帮忙.

我将向您展示一个通用示例,说明我拥有哪些类型的数据以及我想对它们执行哪些操作:

我有一个具有少量META属性的对象.我的所有对象实例共有的属性.例如这些

DataObject通用(META)属性

  • 唯一ID(某种包含唯一标识符的字符串)
  • 创建日期(显示对象创建时间的日期时间)
  • 类型(某些类型的标识符,例如"Article","News","Image"或"Video"之类的
  • ...我想你明白了

然后,我的每个对象都有可变数量的其他属性.最有可能的是,许多对象将共享许多这些属性,但是没有规则.对于我的示例,我们说每个Object实例都具有5到20个这样的属性.这是一些样品

数据对象变量属性

  • 颜色(某些CSS,例如颜色字符串)
  • 名称(字符串)
  • 类别(此项的类别或标签)(也许我们还有多个?)
  • URL(包含某些网站的URL)
  • 费用(带小数的数字
  • ...还有很多其他的东西,大多数都是普通的列类型

引用其他数据是一个主意,但目前还不是必须的.如果需要,我可以在应用程序逻辑中提供这些内容.

一个小样本:

图片

  • 唯一ID ="0s987tncsgdfb64s5dxnt"
  • 创建日期="2013-11-21 12:23:11"
  • 类型=图像"
  • 标题=可爱的猫"
  • 类别=动物"
  • 大小="10234"
  • Mime ="image/jpeg"
  • 文件名="cat_123.jpg"
  • 版权=无"

典型操作

一个普通的存储大概有1-5百万个这样的对象,每个对象具有5-20个属性.

除了像将一个对象写入数据库或通过其uid读取对象之类的常用方法外,最有问题的操作是:

  • 按几个属性搜索-选择标题为"News"且Titel包含"blue"且创建日期在2012年之后的每个DataObject.
  • 分页批量读取-从元素100开始到250结束的搜索中获取大量对象(见上文)
  • 获取许多具有其所有属性的对象-读取大量对象时,我需要在一次调用中获取具有其所有属性的每个对象.

存储要求

  • 持久性-存储必须是持久性的,而不仅仅是内存.如果服务器重新启动,则数据必须与之前关闭时处于同一时间点.没有内存的系统.
  • 完整性-所有数据都很重要,不能忽略任何内容.因此,每个写入操作都必须安全地存储.系统(Redis?)可能会时不时松开某些东西,然后不再可用.具有巨大异步性的系统也是有问题的.如果数据发生变化,每个负责任的节点都应该看到.
  • 复杂性-系统应该相当容易设置和维护.因此,迫使管理员参加为期数周的课程的系统在这里实际上不能使用任何解决方案.具有节点负载的大型数据仓库也是如此.集群很好,但是也应该有可能获得一个只有一个节点的廉价系统.

tl;博士

需要具有面向对象数据并且即使有成千上万个项目也可以快速搜索的超快速数据库系统.

可以在这里找到关于为什么我正在寻找更好的mysql替代品的原因:解决方案

我不知道您是否会找到比我更复杂的答案.但是也许我可以给你一点启发.

MySql是可伸缩的,可以完全用于您的课程.我认为如果您的数据库运行缓慢,则更多的是优化和服务器问题.许多系统使用MySql可以处理大量数据,并且可以完美运行,尽管NoSql(非仅限SQL)是为具有不同属性的大量数据而构建的.

有许多不同的NoSql提供程序,它们使用不同的方式来处理数据. 在选择NoSql平台之前,请考虑一下.

可能性是

  • 键值存储-例如. Redis,Voldemort,Oracle BDB
  • 列存储-例如Cassandra,HBase
  • 文档存储-例如CouchDB,MongoDb
  • 图形数据库-例如Neo4J,InfoGrid,无限图

大多数网站使用基于文档的存储,但例如.由于有许多动态属性,facebook使用基于列的方法.

您可以在 http://try.mongodb.org/

最后,尽管选择正确的技术可以节省大量时间,但这实际上取决于您如何构建和优化数据库,而不取决于您选择哪种技术.

我们开发的系统正在使用MySql和NoSql的组合,具体取决于我们正在处理的数据.系统本身使用MySql,我们通过API导入的所有数据都使用NoSql.

希望这对您有所启发,可以随时提出任何疑问

Im am currently searching for an alternative to our aging MySQL database using an EAV approach. Current projects seem to have outgrown traditional table oriented database structures and especially searches in such database. I head and researched about various NoSQL database systems but I can't find anything that seems to be what Im looking for. Maybe you can help.

I'll show you a generalized example on what kind of data I have and what operations I want to execute on them:

I have an object that has a small number of META attributes. Attributes that are common to all instanced of my objects. For example these

DataObject Common (META) Attributes

  • Unique ID (Some kind of string containing a unique identifier)
  • Created Date (A date time showing creation time of the object)
  • Type (Some kind of type identifier, maybe something like "Article", "News", "Image" or "Video"
  • ... I think you get the Idea

Then each of my Objects has a variable number of other attributes. Most probably, many Objects will share a number of these attributes, but there is no rule. For my sample, we say each Object instance has between 5 to 20 such attributes. Here are some samples

Data Object variable Attributes

  • Color (Some CSS like color string)
  • Name (A string)
  • Category (The category or Tag of this item) (Maybe we also have more than one of these?)
  • URL (a url containing some website)
  • Cost (a number with decimals
  • ... And a whole lot of other stuff mostly being of the usual column types

References to other data is an idea, but not a MUST at the moment. I could provide those within my application logic if needed.

A small sample:

Image

  • Unique ID = "0s987tncsgdfb64s5dxnt"
  • Created Date = "2013-11-21 12:23:11"
  • Type = "Image"
  • Title = "A cute cat"
  • Category = "Animal"
  • Size = "10234"
  • Mime = "image/jpeg"
  • Filename = "cat_123.jpg"
  • Copyright = "None"

Typical Operations

An average storage would probably have around 1-5 million such objects, each with 5-20 attributes.

Apart from the usual stuff like writing one object to database or readin it by it's uid, the most problematic operations are these:

  • Search by several attributes - Select every DataObject that has Type "News" the Titel contains "blue" and the Created Date is after 2012.
  • Paged bulk read - Get a large number of objects from a search (see above) starting at element 100 and ending at 250
  • Get many objects with all of their attributes - When reading larger numbers of objects, I need to get every object with all of it's attributes in one call.

Storage Requirements

  • Persistance - The storage needs to be persistance and not in memory only. If the server reboots, the data has to be at the same point in time as when it shut down before. No memory only systems.
  • Integrity - All data is important, nothing can be ignored. So every single write action has to be securely stored. Systems (Redis?) that tend to loose something now and then arent usable. Systems with huge asynchronity are also problematic. If data changes, every responsible node should see that.
  • Complexity - The system should be fairly easy to setup and maintain. So, systems that force the admin to take many week long courses in it's use arent really a solution here. Same goes for huge data warehouses with loads of nodes. Clustering is nice, but it should also be possible to get a cheap system with one node.

tl;dr

Need super fast database system with object oriented data and fast searched even with hundreds of thousands of items.

A reason as to why I am searching for a better alternative to mysql can be found here: Need MySQL optimization for complex search on EAV structured data


Update

Key-Value stores like Redis weren't an option as we need to do some heavy searching insode our data. Somethng which isnt possible in a typical Key-Value store.

In the end, we are using MongoDB with a slightly optimized scheme to make best use of MongoDBs use of indizes.

Some small drawback still remain but are acceptable at the moment: - MongoDBs aggregate function can not wotk with very large result sets. We have to use find (and refine our data structure to make that one sufficient) - You can not sort large datasets on specific values as it would take up to much memory. You also cant create indizes on those values as they are schema free.

解决方案

I don't know if you wan't a more sophisticated answer than mine. But maybe i can inspire you a little.

MySql are scaleable and can be used for exactly your course. I think it's more of an optimization and server problem if you database i slow. Many system with massive amount of data i using MySql and works perfectly, Though NoSql (Not-Only SQL) is built for large amount of data with different attributes.

There's many diffrent NoSql providers and they have different ways of handling you data. Think about that before you choose a NoSql platform.

The possibilities are

  • Key–value Stores - ex. Redis, Voldemort, Oracle BDB
  • Column Store - ex. Cassandra, HBase
  • Document Store - ex. CouchDB, MongoDb
  • Graph Database - ex. Neo4J, InfoGrid, Infinite Graph

Most website uses document based storing, but ex. facebook are using the column based, because of the many dynamic atrribute.

You can try the Document based NoSql at http://try.mongodb.org/

In the end, it really depends on how you build and optimize you database, and not from which technology you choose, though chossing the right technology can save a bunch of time.

The system we have developed are using a a combination of MySql and NoSql depending on what data we are working with. MySql for the system itself and NoSql for all the data we import via API's.

Hope this inspires a little and feel free to ask any westions

这篇关于数据库经过优化,可搜索大量具有不同属性的对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆