如何在不丢失对重复行计数进行分析的能力的情况下删除重复行? [英] How to remove duplicate rows without losing the ability to do analytics on the count of those duplicate rows?

查看:57
本文介绍了如何在不丢失对重复行计数进行分析的能力的情况下删除重复行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个分析应用程序,其中包含来自已爬网网站的数据。



现在,已爬网数据存储在单独的行中,从而导致很多重复。 / p>

我将举一个例子来说明这种情况:



抓取工具转到Google.com和Yahoo.com ,并找出每个网站上的主要关键字。



对于Google来说,关键字是搜索和互联网,对于Yahoo来说,关键字是搜索和新闻。 (在此示例中,关键字的准确性无关紧要。)



然后将这些关键字插入关键字表格:

 关键字
------------------ ---------
ID
关键字
Created_at

插入关键字(Keyword,Created_at)值('Search',now)
插入关键字(关键字,Created_at)值('Internet',现在)
插入关键字(关键字,Created_at)值('Search',现在)
关键字插入(关键字,Created_at)值值(现在是新闻)

如您所见,Google和Yahoo的关键字被插入数据库。但是问题是数据库中现在已经有两次搜索。



出于分析目的,我需要保留一条记录,即Google和Yahoo都有关键字 Search (为使此示例简单起见,我不包括任何外键),所以如果数据库中已经存在搜索,我就不能简单地不添加它。



因此我的问题是:



我只想将关键字存储一次(第一次找到该关键字),并且我还需要一条相应的记录来跟踪每个关键字



所以几乎就像我需要这样的东西:

  KEYWORDS KEYWORDS_FREQUENCY 
--------------------------- ------------- --------------
ID ID
关键字(唯一)关键字ID
Created_at

但是我不确定是否感到困惑和思考这是错误的方式。我还担心由于需要连接,对此的分析查询将变得非常缓慢,而简单的冗余设计(上述原始设计)可能会很快,因为它是一个笨拙的表。



如果您能理解我要达到的目标,能否请您给我一些设计建议?



谢谢。

解决方案

TL; DR 您没有理由引入字符串ID,但是避免重复始终是一个好方法






在相同值出现多次的意义上,具有冗余本身并不是问题在数据库中。在您自己的建议中,引入的 id 值恰好出现在其相应的 string 值曾经出现的位置,因此,还添加了从id到字符串的映射表,实际上在数据库中添加了非必要数据,这在另一种意义上显然是冗余的。 (其中包括随之而来的冗余联接。)当多行在它们关于应用程序状态的陈述中重叠时,就会产生有意义的冗余;而如果没有按照良好的设计原则进行组织来控制,则会产生这种有问题的冗余。



您对ID的关注似乎正在尝试解决您通过数据压缩解决的预期性能问题。对于您特定的DBMS和更新模式在查询中,一个设计可能会改进某些实现性能问题(以复杂性,可维护性和不同的权衡取舍为代价)。时间和空间)。但是,假设它是不合理的。



您可能想要为每个值的出现(通常使用自动增量ID DBMS工具)提供唯一的标识符,或者为其计数发生。与允许重复相比,这样做的好处在于您的表是关系,因为DBMS根据逻辑条件表示的通用属性来查询数据的实现是基于表之间的关系。 (本质上,要使用逻辑条件进行查询,您必须将具有重复项的表示形式转换为不包含重复项的表示形式,然后进行查询,然后如果要转换回具有重复项的表示形式。)另一方面,性能再次可能会导致您存储以下内容:值而不是关系,插入会更快,而查询会更慢。



总是从最简单的设计。关系表包含从该表中填入空白语句模板( predicate )构成真实语句( proposition )的行。选择足够的表/谓词以描述可能出现的任何应用情况。在发现问题时进行优化,并证明您的优化是更好。


I am working on an analytics application which contains data from crawled websites.

Right now the crawled data is stored in separate rows, causing a lot of duplication.

I will give an example to explain the situation:

The crawler goes to Google.com and Yahoo.com, and figures out the main keywords on each website.

For Google the keywords are "Search" and "Internet", and for Yahoo the keywords are "Search" and "News". (The accuracy of the keywords don't matter in this example).

These keywords are then inserted into the keywords table:

KEYWORDS
---------------------------
ID
Keyword
Created_at

insert into keywords (Keyword, Created_at) values ('Search', now)
insert into keywords (Keyword, Created_at) values ('Internet', now)
insert into keywords (Keyword, Created_at) values ('Search', now)
insert into keywords (Keyword, Created_at) values ('News', now)

So as you can see, the keywords for both Google and Yahoo were inserted into the database. But the problem is 'Search' is now in the database twice.

For analytics purposes, I need to keep a record that both Google and Yahoo had the keyword 'Search' (to keep this example simple I am excluding any foreign keys), so I can't simply not add 'Search' if it already exists in the database.

So this is my question:

I would like to only store the keyword once (the first time it is found), and I also need a corresponding record somewhere which keeps track of every time it is found.

So it's almost like I need something like this:

KEYWORDS                          KEYWORDS_FREQUENCY                    
---------------------------       ---------------------------
ID                                ID
Keyword (unique)                  Keyword_ID
                                  Created_at

But I'm not sure if I'm confused and thinking about this in a wrong way. I'm also concerned analytics queries on this will become very slow as there needs to be joins, whereas the simple redundant design (the original design described above) would probably be quite fast as it's one big dumb table.

If you can understand what I am trying to achieve, could you please give me advice on how to design this?

Thank you.

解决方案

TL;DR You have not justified introducing string ids, but avoiding duplicates is always a good start.


It is not a problem per se to have "redundancy" in the sense of the same value appearing multiple times in a database. In your own proposal, introduced id values appear in exactly the places that their correponding string values used to appeared, so having also added a mapping table from id to string you have actually added inessential data to the database, which is clearly "redundant" in another sense. (Which includes consequent "redundant" joins.) Meaningful redundancy arises when multiple rows overlap in what they state about the application situation, and problematic redundancy of that kind arises when it is not controlled by organizing according to good design principles.

Your concern with ids seem to be trying to address anticipated performance problems that you are solving via data compression. For your particular DBMS and patterns of update & querying, one design might have some kind of implementation performance problem that might be improved (at the expense of complexity, maintainability and merely different tradeoffs of time and space) by another. But assuming it is unjustified.

You probably either want a unique identifier for each occurrence of a value (typically using auto-increment id DBMS facilities), or a count of its occurrences. The benefit of this vs allowing duplicates is that your tables are relations, because DBMS implementation of querying data in terms of generic properties expressed by logical conditions is based on tables being relations. (Essentially, to query using logical conditions you have to convert a representation with duplicates to one without, then query, then if wanted convert back to a representation with duplicates.) On the other hand performance, again, may lead to your storing lists of values rather than relations, with inserts being faster at the expense of queries being slower.

Always start with the most straightforward design. A relational table holds the rows that make a true statement (proposition) from that tables fill-in-the-blanks sentence template (predicate). Choose sufficient tables/predicates to be able to describe any application situation that can arise. Optimize when a problem is demonstrated, and demonstrate that your "optimization" is "better".

这篇关于如何在不丢失对重复行计数进行分析的能力的情况下删除重复行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆