批量/离线处理设计书/文档 [英] batch/offline processing design book / documentation

查看:27
本文介绍了批量/离线处理设计书/文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一本书或任何文档描述了设计用于在两方之间共享数据的批处理(离线)流程的最佳实践?

我在spring batch站点上找到了一些有用的信息,但水平相当低:批处理策略批处理原则指南.

批处理有很多注意事项,例如:

  1. 数据传输方法(例如文件)
  2. 双方之间的控制协议
  3. 错误处理
  4. 文件命名约定(如果使用文件进行传输)
  5. 同步双方的截止时间

如果有一些权威文档或检查表可以确保设计遵循该领域的最佳实践,那就太好了.

<小时>更新:

当我遇到这些问题时,我会在这部分添加答案.

一般批处理/离线处理信息

本节摘自@user1813068 的回答.

您可以在此链接以及在此链接 描述了合作伙伴间集成和数据同步的方法.

这个维基百科页面还提供了架构模式的高级概述,包括数据集成模式:架构模式.

数据集成蓝图和建模一书也很不错.

数据文件

本节的大部分内容都来自这里:来源

使用页眉和页脚进行平面文件交换被认为是最佳实践.可以在没有页眉和页脚的情况下交换平面文件,并且文件的命名可以概述与页眉相同的一些信息.使用分隔文件时,始终需要字段列表标题.

标题

在系统之间交换数据时,接收方确切知道正在发送的数据类型非常重要.确保这一点的一种方法是提供一个标题行,其中包含有关数据内容及其处理方式的相关信息.

处理平面文件时,文件名本身也可用于通知接收方文件的内容.但是,标题行可以更好地支持所有可能可用的选项.

在使用 API 时,可以以类似的方式提供这些标头字段.实现将由 API 服务的开发者决定.

如果包含标题,则它由一组数据组成,并且必须始终是文件中的第一个数据.

页脚

当使用基于文件的格式时,可能会提供一个页脚来表明没有更多的数据需要处理.

处理时,应忽略页脚行后找到的数据.此外,在创建数据时,请注意页脚行之后的任何数据都将被忽略.

数据格式

分隔文件

事实上的行业标准是分隔文件.

逗号分隔(CSV,或逗号分隔值)文件通常需要数据封装,通常使用双引号 (");然后必须将双引号转义,使用反斜杠 () 或双双引号 ("").由于CSV实现的不一致,建议使用制表符作为分隔符,不进行封装.在这种情况下,必须从数据中删除制表符.分隔文件通常更快地处理XML文件.

XML 文件

业内有些人更喜欢 XML 文件.XML 允许更清晰地表示信息,因为它支持嵌套数据.许多公司对这种格式的支持有限或不支持,因此不推荐使用.

编码

UTF-8 编码

所有数据都应采用 UTF-8 编码,以确保所有系统之间的最大兼容性.

日期和日期次

建议对所有日期和时间使用 UTC 时间防止混淆的时间字段.

<小时>

更多最佳实践:EDI 调度和文件传输

解决方案

您可以在此 link 以及这个 link 描述了合作伙伴之间的集成和数据同步.

这个维基百科页面还提供了架构模式的高级概述,包括数据集成模式:架构模式.

数据集成蓝图和建模一书也很不错.

Is there a book or any documentation available that describes the best practice for designing batch (offline) processes for sharing data between two parties?

I have found some useful information on the spring batch site, but it is quite low level: batch processing strategies and batch principles guidelines.

There are a lots of considerations for batch, for example:

  1. data transfer method (e.g. files)
  2. control protocol between the two parties
  3. error handling
  4. file naming conventions (if using files for transfer)
  5. synchronising cut-off times between the two parties
  6. etc.

It would be good if there was some authorative document or checklists that ensure designs follow the best practice in the field.


UPDATE:

I'll add answers to this section as I come across them.

General Batch/Offline Processing info

This section is taken from @user1813068's answer.

You can find some architectural design patterns at this link and also at this link that describe approaches for partner to partner integration and for data synchronization.

This wikipedia page also gives a high level overview of architectural patterns and includes patterns for Data Integration: architectural patterns.

The book Data Integration Blueprint and Modeling is very good too.

Data Files

Most of the content in this section has come from here: source

The use of headers and footers for flat file exchange is considered best practice. Flat files can be exchanged without headers and footers and the naming of the file can outline some of the same information as the header. When using a delimited file, the field list header is always required.

Headers

When exchanging data between systems, it is very important for the receiving party to know exactly what type of data is being sent. One way to ensure this is to provide a header row that includes pertinent information regarding the content of the data and how it should be processed.

When working with flat files, the filename itself can also be used to inform the receiving party of the content of the file. However, a header row provides better support for all options that may be available.

When working with an API these header fields can be provided in a similar fashion. Implementation will be determined by the developer of the API service.

If the header is included, it consists of a single set of data, and must always be the first data in the file.

Footers

A footer may be provided when using file-based formats to indicate that there is no more data left to process.

When processing, the data found after the footer row should be ignored. Also, when creating the data, be aware that any data after the footer row will be ignored.

Data Formats

Delimited Files

The de facto industry standard is delimited files.

Comma-delimited (CSV, or comma-separated values) files usually requires data encapsulation, usually with double quotes ("); the double quotes must then be escaped, either with a backslash () or double double quotes (""). Due to the inconsistencies in CSV implementation, it is recommended to use tabs as a delimiter, with no encapsulation. In this case, tab characters must be removed from the data. Delimited Files are usually quicker to process that XML Files.

XML Files

There are some in the industry who prefer XML files. XML allows for a more clear representation of the information, since it supports nested data. Many companies have limited or no support for this format, so it is not recommended.

Encoding

UTF-8 Encoding

All data should be UTF-8 encoded to ensure maximum compatibility between all systems.

Dates & Times

It is recommended to use UTC time for all date & time fields to prevent confusion.


Some more best practices: EDI Scheduling and File Transfer

解决方案

You can find some architectural design patterns at this link and also at this link that describe approaches for partner to partner integration and for data synchronization.

This wikipedia page also gives a high level overview of architectural patterns and includes patterns for Data Integration: architectural patterns.

The book Data Integration Blueprint and Modeling is very good too.

这篇关于批量/离线处理设计书/文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆