如何将公共数据从不同的模式插入临时表? [英] How can I insert common data into a temp table from disparate schemas?

查看:142
本文介绍了如何将公共数据从不同的模式插入临时表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道如何解决这个问题:

I am not sure how to solve this problem:

我们从各种在线供应商(亚马逊,Newegg等)导入订单信息。每个供应商都有自己的特定术语和结构,用于我们已镜像到数据库中的订单。我们的数据导入数据库没有任何问题,但我面临的问题是编写一个方法,从数据库中提取必需的字段,而不管架构如何。

We import order information from a variety of online vendors ( Amazon, Newegg etc ). Each vendor has their own specific terminology and structure for their orders that we have mirrored into a database. Our data imports into the database with no issues, however the problem I am faced with is to write a method that will extract required fields from the database, regardless of the schema.

例如假设我们有以下结构:

For instance assume we have the following structures:

Newegg结构:

"OrderNumber" integer NOT NULL, -- The Order Number
"InvoiceNumber" integer, -- The invoice number
"OrderDate" timestamp without time zone, -- Create date.

亚马逊结构:

"amazonOrderId" character varying(25) NOT NULL, -- Amazon's unique, displayable identifier for an order.
"merchant-order-id" integer DEFAULT 0, -- A unique identifier optionally supplied for the order by the Merchant.
"purchase-date" timestamp with time zone, -- The date the order was placed.

如何选择这些项目并将它们放入临时表中以供我查询?

临时表可能如下所示:

"OrderNumber" character varying(25) NOT NULL,
"TransactionId" integer,
"PurchaseDate" timestamp with time zone

我知道有些数据库代表一个带整数的订单号,而其他数据库则代表一个字符变化;处理我计划将数据类型转换为字符串值。

I understand that some of the databases represent an order number with an integer and others a character varying; to handle that I plan on casting the datatypes to String values.

有没有人建议我阅读有关这将有助于我解决这个问题?

Does anyone have a suggestion for me to read about that will help me figure this out?

我不需要确切的答案,只需要朝着正确的方向轻推。

数据将由Java使用,因此如果任何特定的Java类有帮助,请随时提出建议。

The data will be consumed by Java, so if any particular Java classes will help, feel free to suggest them.

推荐答案

首先,你可以创建一个 VIEW 提供此功能:

First, you can create a VIEW to provide this functionality:

CREATE VIEW orders AS
SELECT '1'::int            AS source -- or any other tag to identify source
      ,"OrderNumber"::text AS order_nr
      ,"InvoiceNumber"     AS tansaction_id -- no cast .. is int already
      ,"OrderDate" AT TIME ZONE 'UTC' AS purchase_date -- !! see explanation
FROM   tbl_newegg

UNION  ALL  -- not UNION!
SELECT 2
       "amazonOrderId"
      ,"merchant-order-id"
      ,"purchase-date"
FROM   tbl_amazon;

您可以像查看任何其他表一样查询此视图:

You can query this view like any other table:

SELECT * FROM orders WHERE order_nr = 123 AND source = 2;




  • 来源 order_nr 不是唯一的,则需要c $ c>。您如何保证不同来源的唯一订单号?

    • The source is necessary if the order_nr is not unique. How else would you guarantee unique order-numbers over different sources?

      没有时区的时间戳是在全球范围内模棱两可。它与时区有关。如果你混合时间戳 timestamptz ,你需要放置时间戳在某个时区使用 AT TIME ZONE 构造使其工作。有关更多说明,请参阅此相关答案

      A timestamp without time zone is an ambiguous in a global context. It's only good in connection with its time zone. If you mix timestamp and timestamptz, you need to place the timestamp at a certain time zone with the AT TIME ZONE construct to make this work. For more explanation read this related answer.

      我使用UTC作为时区,您可能想要提供另一个。一个简单的演员OrderDate:: timestamptz 将假定您当前的时区。 AT TIME ZONE 应用于时间戳会产生 timestamptz 。这就是为什么我没有添加另一个演员。

      I use UTC as time zone, you might want to provide a different one. A simple cast "OrderDate"::timestamptz would assume your current time zone. AT TIME ZONE applied to a timestamp results in timestamptz. That's why I did not add another cast.

      虽然可以,但我建议不要在PostgreSQL中使用驼峰式标识符。避免多种可能的混淆。请注意我提供的小写标识符(没有现在不必要的双引号)。

      While you can, I advise not to use camel-case identifiers in PostgreSQL ever. Avoids many kinds of possible confusion. Note the lower case identifiers (without the now unnecessary double-quotes) I supplied.

      不要使用 varchar(25)作为 order_nr 的类型。如果必须是一个字符串,只需使用 text 而不使用任意长度修饰符。如果所有订单号仅由数字组成,整数 bigint 会更快。

      Don't use varchar(25) as type for the order_nr. Just use text without arbitrary length modifier if it has to be a string. If all order numbers consist of digits exclusively, integer or bigint would be faster.

      实现此目标的一种方法是实现视图。即,将结果写入(临时)表:

      One way to make this fast would be to materialize the view. I.e., write the result into a (temporary) table:

      CREATE TEMP TABLE tmp_orders AS
      SELECT * FROM orders;
      
      ANALYZE tmp_orders; -- temp tables are not auto-analyzed!
      
      ALTER TABLE tmp_orders
      ADD constraint orders_pk PRIMARY KEY (order_nr, source);
      

      需要一个索引。在我的示例中,主键约束自动提供索引。

      You need an index. In my example, the primary key constraint provides the index automatically.

      如果表很大,请确保有足够的临时缓冲区来处理这在之前的中创建临时表。否则它实际上会让你失望。

      If your tables are big, make sure you have enough temporary buffers to handle this in RAM before you create the temp table. Else it will actually slow you down.

      SET temp_buffers = 1000MB;
      

      必须是会话中对临时对象的第一次调用。不要在全局范围内设置它,仅适用于您的会话。无论如何,临时表会在会话结束时自动删除。

      Has to be the first call to temp objects in your session. Don't set it high globally, just for your session. A temp table is dropped automatically at the end of your session anyway.

      要估计需要多少RAM,请创建一次表并测量:

      To get an estimate how much RAM you need, create the table once and measure:

      SELECT pg_size_pretty(pg_total_relation_size('tmp_orders'));
      

      有关关于dba.SE的相关问题

      如果您必须在一个会话中处理多个查询,则只需支付所有开销。对于其他用例,还有其他解决方案。如果您在查询时知道源表,那么将查询定向到源表会快得多。如果你不这样做,我会再次质疑你的 order_nr 的唯一性。事实上,如果确保它是唯一的,你可以放弃我介绍的列。

      All the overhead only pays if you have to process a number of queries within one session. For other use cases there are other solutions. If you know the source table at the time of the query, it would be much faster to direct your query to the source table instead. If you don't, I would question the uniqueness of your order_nr once more. If it is, in fact, guaranteed to be unique you can drop the column source I introduced.

      仅限一个或几个查询,使用视图而不是物化视图可能会更快。

      For only one or a few queries, it might be faster to use the view instead of the materialized view.

      我还会考虑 plpgsql函数一个表一个接一个地查询,直到找到记录。考虑到开销,考虑几个查询可能会更便宜。当然需要每个表的索引。

      I would also consider a plpgsql function that queries one table after the other until the record is found. Might be cheaper for a couple of queries, considering the overhead. Indexes for every table needed of course.

      另外,如果你坚持 text 对于 order_nr ,varchar ,请考虑 收集C

      Also, if you stick to text or varchar for your order_nr, consider COLLATE "C" for it.

      这篇关于如何将公共数据从不同的模式插入临时表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆