Simplified Yaml
Yaml Documents
Conceptually, a yaml document contains a hierarchy of nodes. Each node may contain either a single value or a collection of other nodes.
The spec provides an extensible type system, with several standardized collection and scalar types. The spec also defines several different styles for specifying both collections and scalars.
Simplified Yaml defines three node types: two collection nodes: Ordered Sequences, and Unordered Maps — and one scalar node: the String. While Simplified Yaml provides only one way of specifying each collection type, it provides two ways of specifying strings.
Collections must appear using yaml’s block syntax. Strings can appear either as plain scalars, sans any and all quotation marks, or as “double quoted strings”.
Lines
Physically, a yaml document may contain one or more lines of text. Each line in turn may contain indentation, directives, content, and comments. It is the job of the parser to turn this physical representation into the conceptual representation. I will document each of the physical components in turn to show how they generate the conceptual components.
- Indentation communicates the hierarchy of a yaml document via whitespace. Increases in indentation indicate an increased depth in the conceptual hierarchy; movement further from the root. Decreases in indentation indicate a decreased depth in the conceptual hierarchy; movement back towards the root. Because the hierarchy in the conceptual hierarchy is created via collections, changes in indentation only occur at the beginning and end of a collection.
- Directives modify the interpretation of a yaml document. Directives begin with a single indicator, but may require the specification of additional indicators to complete the directive. Directives can almost be thought of as commands to the yaml parser, telling the parser what’s coming next.
- Content delivers data stored in a yaml document to an application. Content begins with a directive that tells the parser how to interpret the upcoming data.
- Comments provide user annotations to the yaml-data.
One important note on line processing: different operating systems define different kinds of new lines. Even a simple parser must cope with the different kinds, and normalize them into simple line feeds (\n). See (4.1.4,[22-29]) and (4.1.6,[50,53]).
Once normalized, the parser can use standardized to detect and handle line breaks in the yaml document.
Indentation
Indentation defines the node hierarchy that a parser produces from the yaml document. Indentation should only increase on the first new line of a new collection. A decrease of indentation designates the end of a previous collection.
Examples of indentation changes appear in the collection documentation (below)
The spec only allows the use of actual spaces ( Ascii 0×20 ) for indentation. Tabs are considered dangerous.
Directives
The yaml spec reserves 19 indicators. For reference’s sake they are:
Simplified Yaml makes use of only the: sequence dash (’-'), the mapping colon (’:'), the comment hash(’#'), the string double-quote (’”‘), and the escape character (’\').
Directives
More information on these directives can be found in the sections that makes use of each directive.
In particular see the sections: Comments, Sequence Blocks, and Mapping Blocks.
Comments
The spec (3.2.3.3) defines comments as a communication mechanism between author(s) of a yaml document. Comments have no effect the processing of other yaml document elements. For all intents and purposes of the application: comments do not exist. The parser, however, must recognize comments to the extent that it can successfully ignore them.
There are three types of comments: Implicit Comments, Leading Comments, Trailing Comments.
- Leading Comments appear on their own lines, and may or may not have spaces before them. The preceding spaces are completely ignored; they don’t carry any indentation information.
While the spec requires that trailing comments get preceded by a space [70], in reality, like multi-character directives, this actually results from how plain scalars work. Its not something special in the parsing of comments.Trailing comments appear on the same line as directives or content, and span from the hash to the end of the line. A space must precede the hash for a parser to successfully recognize the trailing comment.
Comments can contain any character except new lines; the end of a line defines the end of a given comment.
Content
The three forms of content in simple yaml are: Strings, Sequence Blocks, and Mapping Blocks.
Strings
Simplified Yaml supports two string styles: Plain Scalars, and Double-Quoted Strings.
Plain Scalars
One of the most important aspects of a good format are clear rules that can be easily conveyed to end users. Yaml’s plain scalars look great when used right, but have parsing rules that can send the weak willed off to INI files. Let’s look briefly at some of the complete spec’s rules.
In the complete spec most every character is allowed in a plain scalar, but the indicators are only allowed in limited contexts. For instance:
While the colon (’:') can be used it cannot be followed by a space,
the exclamation (’!') can be used so long as its not the first character,
the question (’?') can be used as a first character, so long as its not followed by a space,
and so on.
These rules make sense when paired with knowledge about how each of the indicators are used. They are not, however, intuitive.
The point of having plain scalars at all is to alleviate the need for users to add string quotes, and to keep the number of actual directives used in a given yaml document down to a minimum. In this spirit, its important for a simplified subset to allow the heavy usage of plain scalars, but at the same time its important to keep the rules clean, clear, and simple. Most importantly: its necessary to make parsing feel consistent to even the most non-technical of users. If sometimes the comma (’,') works and sometimes it doesn’t, people will either give up and not use commas at all, or get bit by a rule they don’t understand.
To encourage the goal of consistency, Simplified Yaml excludes punctuation from plain scalars all together, and directs end users to encase their strings in double-quotes (’”‘) when punctuation is needed. Like the complete spec, Simplified Yaml, allows spaces but excludes tabs. Finally, under Simplified Yaml, all plain scalars must fit all on a single line.
Double Quoted Strings
Double quoted strings allow the expression of a relatively arbitrary series of characters.
Simplified double quoted strings cannot contain span multiple lines unless the line breaks are escaped. The escaped line break [135] allows the user to split lines for readability in the yaml document, but escaped breaks don’t result in any new lines in the actual string itself.
Space characters — what otherwise would be indentation — after escaped line breaks get ignored. Such spaces do not become part of the string itself. This is how the first string above would look to an application:
“This looks split across multiple lines but the string isn’t.”
Actual new line characters, however, can be added to a string using an escape sequence:
Escape Sequences
Escape sequences allow the inclusion of characters in double quoted strings that would either be hard to represent in plain text, or, would, in some way, conflict with string parsing.
The yaml spec defines a super-set of the c programming language’s escape sequences.
The simplified, non-unicode, sub-set follows:
- \0 (0×0) the null
- \a (0×7) the bell
- \b (0×8) the backspace
- \t (0×9) the tab
- \n (0xA) the linefeed ( aka. newline )
- \v (0xB) the vertical tab
- \f (0xC) the form feed
- \r (0xD) the carriage return
- \e (0×1E) the escaped character
- \ the escaped tab
- \ the escaped space
- \” the escaped quote
- \\ the escaped backslash
Note: the escaped space and the escaped tab and don’t show up well in html, but they are the backslash character followed by an actual space or actual tab.
Collections
Sequence Blocks
The following examples are based on those that appear in the Yaml for Ruby Cookbook. except that Why’s colors look nicer….
The simplest of sequences, just an array of strings, looks like:
- - one
- - two
Specifying a blank sequence indicator starts a new, nested, sequence:
- - one
- - two
- -
- - ONE
- - TWO
Specifying a blank indicator with no sub-sequence, indicates merely an empty node. (4.4.5.2)
- - one
- -
What kind of empty node is, apparently, up to the parser, though (Example 4.51) does seem to indicate an empty string would be best.
Mapping Blocks
The simplest of associative collections. Asking for “A” will yield “a”; “B” yields “b”.
Here, asking for “C” yields our simple map again:
Again, specifying a blank indicator, indicates merely an empty node. Its not entirely clear to me whether the colon (”:”) provides enough information to satisfy the requirement that: “Completely empty block nodes may only appear when there is some explicit indicator for their existance.” (4.4.5.2)
Mixed Blocks
Sequences can contain any collection, even maps:
Maps can contain any collection, even sequences:
According to the spec (4.6.1.2) the dash (”-”) counts towards indentation — this is intended to make embedded sequences more readable. I’m not sure how well it actually works but here it is anyway.
The Ruby Cookbook refers to this as a map shortcut.
New lines in collections
One final question worth touching on: can new lines appear in collection blocks?
The spec largely answers this via its BNF productions, so both the result, and the intent, are somewhat obfuscated. Rather than following the letter of but not the spirit, or perhaps worse, vice-versa, I advocate disallowing newlines except as already documented above: for use with empty blocks, and collection in collection blocks.
This might be overly restrictive based on what the complete spec may allow, but seems a workable rule for this simplified subset.