我想加载一个 YAML 文件,可能编辑数据,然后再次转储它.如何保留格式? [英] I want to load a YAML file, possibly edit the data, and then dump it again. How can I preserve formatting?

查看:35
本文介绍了我想加载一个 YAML 文件,可能编辑数据,然后再次转储它.如何保留格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题试图以与语言无关的方式收集分布在关于不同语言和 YAML 实现的问题的信息.

假设我有一个这样的 YAML 文件:

首先:- 富:{a:b"}- 酒吧":[1, 2, 3]第二:|# 一些评论一些长块标量值

我想将此文件加载到本机数据结构中,可能会更改或添加一些值,然后再次转储.但是,当我转储它时,不会保留原始格式:

  • 标量的格式不同,例如"b" 失去了引号,second 的值不再是文字块标量,等等.
  • 集合的格式不同,例如foo 的映射值以块样式而不是给定的流样式编写,类似地 "bar" 的序列值以块样式编写
  • 映射键的顺序(例如first/second)发生变化
  • 评论不见了
  • 缩进级别不同,例如first 中的项目不再缩进.

如何保留原始文件的格式?

解决方案

前言:在这个答案中,我提到了一些流行的 YAML 实现.这些提及永远不会详尽,因为我不知道所有 YAML 实现.

我将使用 YAML 术语来表示数据结构:原子文本内容(偶数)是一个标量.项目序列,在别处称为数组或列表,是序列.键值对的集合,在别处称为字典或散列,是一个映射.

如果您使用 Python,使用 ,例如详细介绍使用 SnakeYAML 加载/转储事件.p>

在您的实现的事件流中已经丢失的信息(例如大多数实现中的注释)是不可能保存的.标量布局也无法保留,如下例所示:

"1 x2B 1"

这会在解析转义序列后作为字符串 1 + 1" 加载.即使在事件流中,关于转义序列的信息也已经在我知道的所有实现中丢失了.该事件只记得它是一个双引号标量,因此将其写回将导致:

1 + 1"

类似地,折叠块标量(以 > 开头)通常不会记住原始输入中的换行符在哪里折叠成空格字符.

总而言之,加载到事件树并再次转储通常会保留:

  • 样式:未引用/引用/块标量、流/块集合(序列和映射)
  • 映射中键的顺序
  • YAML 标签和锚点

你通常会输:

  • 有关流标量中的转义序列和换行符的信息
  • 缩进和非内容间距
  • 注释 - 除非实现特别支持将它们放入事件和/或节点中

如果您使用 Node Graph 而不是 Event Tree,您可能会丢失锚表示(即 &foo 可能是稍后写为 &a ,所有别名都使用 *a 而不是 *foo 引用它).您还可能会丢失映射中的键顺序.一些 API,如 go-yaml,不提供对 Event Tree 的访问,因此您别无选择,只能使用 Node Graph.

修改数据

如果您想修改数据并仍保留原始格式的内容,则需要在不将数据加载到本机结构的情况下操作数据.这通常意味着您对 YAML 标量、序列和映射进行操作,而不是 stringsnumberslists 或目标编程语言提供的任何结构.

您可以选择处理 事件树节点图(假设您的 API 允许您访问它).哪个更好通常取决于您想做什么:

  • 事件树通常以事件流的形式提供.对于大数据可能会更好,因为您不需要将完整的数据加载到内存中;相反,您检查每个事件,跟踪您在输入结构中的位置,并相应地进行修改.this question 的答案显示了如何使用 PyYAML 的事件 API 将提供路径和值的项目附加到给定的 YAML 文件.
  • 节点图更适合高度结构化的数据.如果您使用锚点和别名,它们将在那里被解析,但您可能会丢失有关它们名称的信息(如上所述).与事件不同,您需要自己跟踪当前位置,数据在此处显示为完整的图表,您可以直接进入相关部分.

无论如何,您都需要了解一些有关 YAML 类型解析的知识,才能正确处理给定的数据.当您将 YAML 文件加载到已声明的本机结构中时(通常在具有静态类型系统的语言中,例如 Java 或 Go),如果可能,YAML 处理器会将 YAML 结构映射到目标类型.但是,如果没有给出目标类型(在 Python 或 Ruby 等脚本语言中很常见,但在 Java 中也可能),则从节点内容和样式中推断出类型.

由于我们需要保留格式信息,因此我们不使用本机加载,因此不会执行此类型解析.但是,您需要知道它在两种情况下是如何工作的:

  • 当您需要确定标量节点或事件的类型时,例如你有一个内容为 42 的标量,需要知道它是 string 还是 integer.
  • 当您需要创建稍后应作为特定类型加载的新事件或节点时.例如.如果您创建一个包含 42 的标量,您可能想要控制它是作为 integer 42 还是 string 42" 稍后.

我不会在这里讨论所有细节;在大多数情况下,知道如果 string 被编码为标量但看起来像其他东西(例如数字)就足够了,您应该使用带引号的标量.

根据您的实施,您可能会接触到 YAML 标签.很少在 YAML 文件中使用(它们看起来像 !!str!!map!!int 等),它们包含类型有关可用于具有异构数据的集合的节点的信息.更重要的是,YAML 定义了所有没有显式标签的节点都将被分配一个作为类型解析的一部分.这在 Node Graph 级别可能已经发生,也可能尚未发生.因此,在您的节点数据中,即使原始节点没有,您也可能会看到节点的标签.

以两个感叹号开头的标签实际上是简写,例如!!strtag:yaml.org,2002:str 的简写.您可能会在数据中看到任何一种,因为实现对它们的处理方式完全不同.

对您来说重要的是,当您创建节点或事件时,您可能也可能需要分配标签.如果您不希望输出包含显式标记,请将非特定标记 ! 用于非普通标量,将 ? 用于事件级别的其他所有内容.在节点级别,请查阅您的实现文档,了解您是否需要提供已解析的标签.如果不是,则适用于非特定标签的相同规则.如果文档没有提及(很少提及),请尝试一下.

总结一下:您通过加载 Event TreeNode Graph 来修改数据,您可以在获得的数据中添加、删除或修改事件或节点,然后您将修改后的数据再次呈现为 YAML.根据您想要做什么,它可能会帮助您创建要添加到 YAML 文件中的数据作为本机结构,将其序列化为 YAML,然后将其再次加载为 Node Graph事件树.从那里,您可以将其包含在要修改的 YAML 文件的结构中.

结论/TL;DR

YAML 不是为此任务设计的.事实上,它已被定义为一种序列化语言,假设您的数据是在某种编程语言中作为本机数据结构创作的,并从那里转储到 YAML.然而,实际上,YAML 被大量用于配置,这意味着您通常手动编写 YAML,然后将其加载到本机数据结构中.

这种对比是在保留格式的同时修改 YAML 文件如此困难的原因:YAML 格式被设计为 transient 数据格式,由一个应用程序编写,然后由由另一个(或相同)应用程序加载.在这个过程中,保留格式并不重要.但是,对于签入版本控制的数据(您希望您的差异仅包含您实际更改的数据的行)以及您手动编写 YAML 的其他情况,因为您想要保持风格一致.

没有完美的解决方案可以只更改给定 YAML 文件中的一个数据项并保持其他所有内容不变.加载 YAML 文件不会为您提供 YAML 文件的视图,而是为您提供它所描述的内容.因此,不属于所描述内容的所有内容(最重要的是评论和空格)都极难保存.

如果格式保存对您很重要,并且您无法接受此答案中的建议所做出的妥协,那么 YAML 不是适合您的工具.

This question tries to collect information spread over questions about different languages and YAML implementations in a mostly language-agnostic manner.

Suppose I have a YAML file like this:

first:
  - foo: {a: "b"}
  - "bar": [1, 2, 3]
second: |   # some comment
  some long block scalar value

I want to load this file into an native data structure, possibly change or add some values, and dump it again. However, when I dump it, the original formatting is not preserved:

  • The scalars are formatted differently, e.g. "b" loses its quotation marks, the value of second is not a literal block scalar anymore, etc.
  • The collections are formatted differently, e.g. the mapping value of foo is written in block style instead of the given flow style, similarly the sequence value of "bar" is written in block style
  • The order of mapping keys (e.g. first/second) changes
  • The comment is gone
  • The indentation level differs, e.g. the items in first are not indented anymore.

How can I preserve the formatting of the original file?

解决方案

Preface: Throughout this answer, I mention some popular YAML implementations. Those mentions are never exhaustive since I do not know all YAML implementations out there.

I will use YAML terms for data structures: Atomic text content (even numbers) is a scalar. Item sequences, known elsewhere as arrays or lists, are sequences. A collection of key-value pairs, known elsewhere as dictionary or hash, is a mapping.

If you are using Python, using ruamel will help you preserve quite some formatting since it implements round-tripping up to native structures. However, it isn't perfect and cannot preserve all formatting.

Background

The process of loading YAML is also a process of losing information. Let's have a look at the process of loading/dumping YAML, as given in the spec:

When you are loading a YAML file, you are executing some or all of the steps in the Load direction, starting at the Presentation (Character Stream). YAML implementations usually promote their most high-level APIs, which load the YAML file all the way to Native (Data Structure). This is true for most common YAML implementations, e.g. PyYAML/ruamel, SnakeYAML, go-yaml, and Ruby's YAML module. Other implementations, such as libyaml and yaml-cpp, only provide deserialization up to the Representation (Node Graph), possibly due to restrictions of their implementation languages (loading into native data structures requires either compile-time or runtime reflection on types).

The important information for us is what is contained in those boxes. Each box mentions information which is not available anymore in the box left to it. So this means that styles and comments, according to the YAML specification, are only present in the actual YAML file content, but are discarded as soon as the YAML file is parsed. For you, this means that once you have loaded a YAML file to a native data structure, all information about how it originally looked in the input file is gone. Which means that when you dump the data, the YAML implementation chooses a representation it deems useful for your data. Some implementations let you give general hints/options, e.g. that all scalars should be quoted, but that doesn't help you restore the original formatting.

Thankfully, this diagram only describes the logical process of loading YAML; a conforming YAML implementation does not need to slavishly conform to it. Most implementations actually preserve data longer than they need to. This is true for PyYAML/ruamel, SnakeYAML, go-yaml, yaml-cpp, libyaml and others. In all these implementations, the style of scalars, sequences and mappings is remembered up until the Representation (Node Graph) level.

On the other hand, comments are discarded rather early since they do not belong to an event or node (the exceptions here is ruamel which links comments to the following event, and go-yaml which remembers comments before, at and after the line that created a node). Some YAML implementations (libyaml, SnakeYAML) provide access to a token stream which is even more low-level than the Event Tree. This token stream does contain comments, however it is only usable for doing things like syntax highlighting, since the APIs do not contain methods for consuming the token stream again.

So what to do?

Loading & Dumping

If you need to only load your YAML file and then dump it again, use one of the lower-level APIs of your implementation to only load the YAML up until the Representation (Node Graph) or Serialization (Event Tree) level. The API functions to search for are compose/parse and serialize/present respectively.

It is preferable to use the Event Tree instead of the Node Graph as some implementations already forget the original order of mapping keys (due to internally using hashmaps) when composing. This question, for example, details loading / dumping events with SnakeYAML.

Information that is already lost in the event stream of your implementation, for example comments in most implementations, is impossible to preserve. Also impossible to preserve is scalar layout, like in this example:

"1 x2B 1"

This loads as string "1 + 1" after resolving the escape sequence. Even in the event stream, the information about the escape sequence has already been lost in all implementations I know. The event only remembers that it was a double-quoted scalar, so writing it back will result in:

"1 + 1"

Similarly, a folded block scalar (starting with >) will usually not remember where line breaks in the original input have been folded into space characters.

To sum up, loading to the Event Tree and dumping again will usually preserve:

  • Style: unquoted/quoted/block scalars, flow/block collections (sequences & mappings)
  • Order of keys in mappings
  • YAML tags and anchors

You will usually lose:

  • Information about escape sequences and line breaks in flow scalars
  • Indentation and non-content spacing
  • Comments – unless the implementation specifically supports putting them in events and/or nodes

If you use the Node Graph instead of the Event Tree, you will likely lose anchor representations (i.e. that &foo may be written out as &a later with all aliases referring to it using *a instead of *foo). You might also lose key order in mappings. Some APIs, like go-yaml, don't provide access to the Event Tree, so you have no choice but to use the Node Graph instead.

Modifying Data

If you want to modify data and still preserve what you can of the original formatting, you need to manipulate your data without loading it to a native structure. This usually means that you operate on YAML scalars, sequences and mappings, instead of strings, numbers, lists or whatever structures the target programming language provides.

You have the option to either process the Event Tree or the Node Graph (assuming your API gives you access to it). Which one is better usually depends on what you want to do:

  • The Event Tree is usually provided as stream of events. It may be better for large data since you do not need to load the complete data in memory; instead you inspect each event, track your position in the input structure, and place your modifications accordingly. The answer to this question shows how to append items giving a path and a value to a given YAML file with PyYAML's event API.
  • The Node Graph is better for highly structured data. If you use anchors and aliases, they will be resolved there but you will probably lose information about their names (as explained above). Unlike with events, where you need to track the current position yourself, the data is presented as complete graph here, and you can just descend into the relevant sections.

In any case, you need to know a bit about YAML type resolution to work with the given data correctly. When you load a YAML file into a declared native structure (typical in languages with a static type system, e.g. Java or Go), the YAML processor will map the YAML structure to the target type if that's possible. However, if no target type is given (typical in scripting languages like Python or Ruby, but also possible in Java), types are deduced from node content and style.

Since we are not working with native loading because we need to preserve formatting information, this type resolution will not be executed. However, you need to know how it works in two cases:

  • When you need to decide on the type of a scalar node or event, e.g. you have a scalar with content 42 and need to know whether that is a string or integer.
  • When you need to create a new event or node that should later be loaded as a specific type. E.g. if you create a scalar containing 42, you might want to control whether that it is loaded as integer 42 or string "42" later.

I won't discuss all the details here; in most cases, it suffices to know that if a string is encoded as a scalar but looks like something else (e.g. a number), you should use a quoted scalar.

Depending on your implementation, you may come in touch with YAML tags. Seldom used in YAML files (they look like e.g. !!str, !!map, !!int and so on), they contain type information about a node which can be used in collections with heterogeneous data. More importantly, YAML defines that all nodes without an explicit tag will be assigned one as part of type resolution. This may or may not have already happened at the Node Graph level. So in your node data, you may see a node's tag even when the original node does not have one.

Tags starting with two exclamation marks are actually shorthands, e.g. !!str is a shorthand for tag:yaml.org,2002:str. You may see either in your data, since implementations handle them quite differently.

Important for you is that when you create a node or event, you may be able and may also need to assign a tag. If you don't want the output to contain an explicit tag, use the non-specific tags ! for non-plain scalars and ? for everything else on event level. On node level, consult your implementation's documentation about whether you need to supply resolved tags. If not, same rule for the non-specific tags applies. If the documentation does not mention it (few do), try it out.

So to sum up: You modify data by loading either the Event Tree or the Node Graph, you add, delete or modify events or nodes in the data you get, and then you present the modified data as YAML again. Depending on what you want to do, it may help you to create the data you want to add to your YAML file as native structure, serialize it to YAML and then load it again as Node Graph or Event Tree. From there, you can include it in the structure of the YAML file you want to modify.

Conclusion / TL;DR

YAML has not been designed for this task. In fact, it has been defined as a serialization language, assuming that your data is authored as native data structures in some programming language and from there dumped to YAML. However, in reality, YAML is used a lot for configuration, meaning that you typically write YAML by hand and then load it into native data structures.

This contrast is the reason why it is so difficult to modify YAML files while preserving formatting: The YAML format has been designed as transient data format, to be written by one application, and then to be loaded by another (or the same) application. In that process, preserving formatting does not matter. It does, however, for data that is checked-in to version control (you want your diff to only contain the line(s) with data you actually changed), and other situations where you write your YAML by hand, because you want to keep style consistent.

There is no perfect solution for changing exactly one data item in a given YAML file and leaving everything else intact. Loading a YAML file does not give you a view of the YAML file, it gives you the content it describes. Therefore, everything that is not part of the described content – most importantly, comments and whitespace – is extremely hard to preserve.

If format preservation is important to you and you can't live with the compromises made by the suggestions in this answer, YAML is not the right tool for you.

这篇关于我想加载一个 YAML 文件,可能编辑数据,然后再次转储它.如何保留格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆