如何识别从多个Feed收集的重复项目并在数据库中链接到它们 [英] How to identify duplicate items gathered from multiple feeds and link to them in a Database

查看:179
本文介绍了如何识别从多个Feed收集的重复项目并在数据库中链接到它们的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库存储从许多网站获取的产品的详细信息,并通过各个网站API收集。当我调用feed时,详细信息存储在数据库表中。



我遇到的问题是,因为完全相同的产品在许多网站上被卖家列出,我最终在我的数据库中有重复的项目,然后当我显示在一个网页上有很多重复。



问题是,该项目没有任何明显的唯一标识符,它有项目的具体细节其中可能有很多),然后是来自卖方的项目的描述。



我想要的是让项目显示一次,然后给用户详细信息列出项目的其他地方。



如何识别进入的重复项,而不会拖慢整个数据库?然后我如何选择一个广告从所有重复,然后存储广告显示其他网站。



感谢您的帮助。

解决方案

问题是双重的,两者都在你身边。当你找出如何处理,编写代码到一个程序(Java或SQL将很容易)。


  1. 由于某些未知原因,您假定收集产品说明


  2. 您习惯于常见和无意义的 Id 列,这是很好,当您使用电子表格原型功能;但它不在数据库或开发级功能所需的附近。您的用户(或老板)自然希望数据库的数据库功能,而您没有提供任何。 (不,不需要模糊字符串逻辑或任何类型的魔法。)


解决方案



这是的精简版本 IDEF1X 关系数据库建模标准;


  1. 您需要考虑数据库术语,并考虑执行函数所需的数据库表,这意味着您不能使用自动递增 Id 列。该列给出了一个 RowId 的电子表格,但并不意味着表格的内容或标识产品的列。


  2. 你不能简单地从另一个网站上窃取数据,你需要考虑你的网站对产品的要求。


  3. 识别列的所有列和数据类型。


  4. 确定哪些列是强制性的

  5. 。例如。 制造商模型;短产品名称,而不是长说明(或可能是贵公司的长说明 标识符)。


  6. 您会发现您实际上有产品,例如制造商 ProductType ,也许供应商等。


  7. 整理这些表格,并规范化,以避免重复数据。


  8. 请务必尊重这些标识符。选择将是唯一的。这些是候选键。您需要每个表至少一个,并且产品中将有多个。将搜索的所有标识符都需要索引(唯一或不是)。请注意,唯一索引不能为Nullable,因此您不能选择可选列。


  9. code>可能不是单列。没关系,我们可以为数据库中的键评估多个列;


  10. 取最好的,最稳定的(不会改变的)唯一标识符,其中一个候选键,并将其设为主键


  11. 如果且仅当唯一标识符(可能是复合键)的主键非常长,因此不适用于主键,将其迁移到子表,然后添加 代理键。这将是 Id 列。注意,这是一个附加列和附加索引。它不能替代产品的标识符,候选键;


到目前为止,我们在您公司的网站上有一个产品数据库,到它。现在我们可以评估网络另一方的产品;当我们这样做时,我们在我们这边有一个强大的框架,我们可以衡量我们从网络的另一端获得的垃圾。



strong> Feed


$ b


  • 产品和 WebSite 。让我们把它叫做 ProductSite 。它将只包含我们的 ProductId WebSiteCode。它可能包含 Price`。


  • 将每个Feed加载到临时数据库或模式中,传入 ProductIn 表,也许每个源网站一个。这只是来自外部源的平面文件。添加一列 IsValid 并将Default设置为true。


  • code> ProductIn 表及其松散和软盘内容,以及带有强标识符的 Product 表。




    • 我会这样做的方式是,几波单独的检查,每个标记失败的行, IsValid 更改为false。最后将 IsValid 行插入到我们的 ProductSite 中。


    • 你可能很幸运,并以乐观的方式逃避。也就是说,只要在几个重要的列上找到匹配项,匹配就是有效的。 (与 IsValid 布尔值的默认和更新相反)。


    • 一些来回工作,直到它安定下来。这就是为什么你需要与您的用户重新使用标识符。目标是排除没有外部产品,但你的出发点将排除许多。这将包括返回到我们的 Product 表,并改进标识符的内容(行中的值)以及用于标识匹配行的其他相关列。 p>



  • 对每个网站重复此操作。


  • 现在使用我们有信心的信息从我们的产品填充我们的网站,并从 ProductSite 显示哪些网站有产品销售code>。



  • I have a Database storing details of products which are taken from many sites, and gathered through the individual sites API's. When I call the feed, the details are stored in a database table.

    The problem I'm having is that because the exact same product is listed on many sites by the seller I end up having duplicate items in my database, and then when I display them on a web page there are many duplicates.

    The problem is that the item doesn't have any obvious unique identifier, it has specific details of the item (of which there could be many), and then a description of the item from the seller.

    What I would like is for the item to show up once, and then give the user details of where else the item is listed.

    How would I identify the duplicates that have come in, without slowing down the entire database? How would I also then pick one advert from all the duplicates, and then store what other sites the advert is displayed on.

    Thanks for any help.

    解决方案

    The problem is two-fold, and both are on your side. When you figure out how to deal with that, writing the code into a program (Java or SQL will be easy). I'll name them first and then identify the solutions.

    1. For some unknown reason, you have assumed that collecting product descriptions from mulitple sites will not collect the same product.

    2. You are used to the common and nonsensical Id column, which is fine when you are working with spreadsheets prototyping functionality; but it is nowhere near what is required for a database or Development-level functionality. Your users (or boss) have naturally expected database capability from the database, and you did not provide any. (And no, it does not require fuzzy string logic or magic of any kind.)

    Solution

    This is a condensed version of the IDEF1X Standard for modelling Relational Databases; the portion re Identifiers.

    1. You need to think in database terms, and think about the database tables you need to perform your function, which means you are not allowed to use an auto-increment Id column. That column gives a spreadsheet a RowId, but it does not imply anything about the content of the table, or the columns that identify a product.

    2. And you cannot simply rip data off another website, you need to think about what your website requires for products. What does your company understand a product to be, and how does it identify a product ?

    3. Identify all the columns and datatypes for the columns.

    4. Identify which columns are mandatory and which are optional.

    5. Identify which are strong Identifiers. Eg. Manufacturer and Model; the short Product Name, not the long Description (or may be for your company, the long description is an Identifier). Work with your users, and work that out.

    6. You will find you actually have a small cluster of tables around Product, such as Manufacturer, ProductType, perhaps Vendor, etc.

    7. Organise those tables, and Normalise them, so that you are not duplicating data.

    8. Make sure you treat those Identifiers with a bit of respect. Choose which will be unique. Those are Candidate Keys. You need at least one per table, and there will be more than one in Product. All the Identifiers that will be searched on will need to be indexed (Unique or not). Note that Unique Indices cannot be Nullable, so you cannot choose an optional column.

    9. What makes a single Unique Identifier for Product may not be a single column. That's ok, we can evaluate multiple columns for keys in databases; they are called Compound Keys.

    10. Take the best, most stable (one which will not change) Unique Identifier, one of the Candidate Keys, and make that the Primary Key.

    11. If, and only if, the Unique Identifier, the Primary Key, which may be a Compound Key, is very long, and therefore unsuitable for a Primary Key, which is migrated to the child tables, then add a Surrogate Key. That will be the Id column. Note that that is an additional column and additional Index. It is not a substitute for the Identifiers of Product, the Candidate Keys; they cannot be removed.

    So far we have a Product database on your companies side of the web, that is meaningful to it. Now we are in a position to evaluate products from the other side of the web; and when we do, we have a framework on our side that is strong, against which we can measure the rubbish that we get from the other side of the web.

    Feeds

    1. You need a WebSite table to manage the feeds.

    2. There will be an Associative table (many-to-many) between Product and WebSite. Let's call it ProductSite. It will contain only our ProductId, and the WebSiteCode. It may containPrice`. The contents are valid for a single feed cycle.

    3. Load each feed into a staging database or schema, an incoming ProductIn table, maybe one per source website. This is just the flat file from the external source. Add a column IsValid and set the Default to true.

    4. Then write some SQL that compares that ProductIn table, with its loose and floppy contents, with our Product table with its strong Identifiers.

      • The way I would do it is, several waves of separate checks, each marking the rows that fail, with IsValid to false. At the end Insert the IsValid rows into our ProductSite.

      • You might be lucky, and get away with an optimistic approach. That is, as long as you find a match on a few important columns, the match is valid. (reverse the Default and update of the IsValid boolean).

      • This is the proc that will require some back-and-forth work, until it settles down. That is why you need to work with your users re the Indentifiers. The goal is to exclude no external products, but your starting point will exclude many. That will include going back to our Product table and improving the content (values in the rows) of the Identifiers, and other relevant columns that you use to identify matching rows.

    5. Repeat for each WebSite.

    6. Now populate our website from our Product table, using information that we are confident about, and show which sites have the product for sale from ProductSite.

    这篇关于如何识别从多个Feed收集的重复项目并在数据库中链接到它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆