Simplified Yaml

The Simplified Yaml format attempts to define the smallest whole unit of the Yaml language that’s still meaningful. This page introduces the simplified format, touching on areas that relate to the complete Yaml 1.1 Working Draft 2004-12-28 Specification ( “the spec” ) as a whole where appropriate.It’s mainly targeted at programmers interested in understanding Yaml better, and is intended to help interested parties implement a simple yaml compliant parser quickly.

Yaml Documents

Conceptually, a yaml document contains a hierarchy of nodes. Each node may contain either a single value or a collection of other nodes.

The spec provides an extensible type system, with several standardized collection and scalar types. The spec also defines several different styles for specifying both collections and scalars.

Simplified Yaml defines three node types: two collection nodes: Ordered Sequences, and Unordered Maps — and one scalar node: the String. While Simplified Yaml provides only one way of specifying each collection type, it provides two ways of specifying strings.

Collections must appear using yaml’s block syntax. Strings can appear either as plain scalars, sans any and all quotation marks, or as “double quoted strings”.

Lines

Physically, a yaml document may contain one or more lines of text. Each line in turn may contain indentation, directives, content, and comments. It is the job of the parser to turn this physical representation into the conceptual representation. I will document each of the physical components in turn to show how they generate the conceptual components.

  • Indentation communicates the hierarchy of a yaml document via whitespace. Increases in indentation indicate an increased depth in the conceptual hierarchy; movement further from the root. Decreases in indentation indicate a decreased depth in the conceptual hierarchy; movement back towards the root. Because the hierarchy in the conceptual hierarchy is created via collections, changes in indentation only occur at the beginning and end of a collection.
  • Directives modify the interpretation of a yaml document. Directives begin with a single indicator, but may require the specification of additional indicators to complete the directive. Directives can almost be thought of as commands to the yaml parser, telling the parser what’s coming next.
  • Content delivers data stored in a yaml document to an application. Content begins with a directive that tells the parser how to interpret the upcoming data.
  • Comments provide user annotations to the yaml-data.

One important note on line processing: different operating systems define different kinds of new lines. Even a simple parser must cope with the different kinds, and normalize them into simple line feeds (\n). See (4.1.4,[22-29]) and (4.1.6,[50,53]).

New line characters:

  • ‘0xA’: Line Feed (\n)
  • ‘0xD’: Carriage Return (\r)

New line styles:

  • DOS/Windows: \r \n
  • Macintosh: \r
  • Unix: \n

Once normalized, the parser can use standardized to detect and handle line breaks in the yaml document.

Indentation

Because collections can be nested, its best if parsers track the current indentation level using a simple stack. Changes in indentation at unexpected times should be flagged as an error.

Indentation defines the node hierarchy that a parser produces from the yaml document. Indentation should only increase on the first new line of a new collection. A decrease of indentation designates the end of a previous collection.

Examples of indentation changes appear in the collection documentation (below)

The spec only allows the use of actual spaces ( Ascii 0×20 ) for indentation. Tabs are considered dangerous.

Directives

The yaml spec reserves 19 indicators. For reference’s sake they are:

  • ‘-’ | ‘?’ | ‘:’ | ‘,’ | ‘[’ | ‘]’ | ‘{’ | ‘}’ |
  • ‘#’ | ‘&’ | ‘*’ | ‘!’ | ‘|’ | ‘>’ | ‘'’ | ‘”‘ |
  • ‘%’ | ‘@’ | ‘`’

Simplified Yaml makes use of only the: sequence dash (’-'), the mapping colon (’:'), the comment hash(’#'), the string double-quote (’”‘), and the escape character (’\').

Directives

The spec over specifies and makes multi-character directives mandatory ( [196], [221] ). Multi-character directives actually result from the way plain scalars (4.5.1.3) get processed. It’s not something explicit in the way indicators need work. The simplified plain scalars don’t have this conflict, so, ironically, multi-character directives must be handled explicitly in a simplified parser.

Directives determine how a parser interprets yaml text. Every directive begins with a single indicator character, though some may be multi-character. The following 3 directives are the only ones used by simplified yaml (Note: that the last two directives are multi-character: an indicator followed by a space.)

  • ‘#’: Comment
  • “- “: Sequence entry
  • “: “: Mapped value

More information on these directives can be found in the sections that makes use of each directive.
In particular see the sections: Comments, Sequence Blocks, and Mapping Blocks.

Comments

Certain kinds of content in the complete spec may extend to the end of a line or even across multiple lines. In these cases comments may or may not be allowed depending on the particular parsing mode the content requires. These “greedy” definitions are explicitly called out where they occur. Any specialized management of comments under those modes is left wholly to those modes.

The spec (3.2.3.3) defines comments as a communication mechanism between author(s) of a yaml document. Comments have no effect the processing of other yaml document elements. For all intents and purposes of the application: comments do not exist. The parser, however, must recognize comments to the extent that it can successfully ignore them.

There are three types of comments: Implicit Comments, Leading Comments, Trailing Comments.

  • Implicit Comments (4.2.2;4.2.3) are simply blank lines.

Leading and Trailing Comments comments both start with the hash (#) character.

  • Leading Comments appear on their own lines, and may or may not have spaces before them. The preceding spaces are completely ignored; they don’t carry any indentation information.

  • While the spec requires that trailing comments get preceded by a space [70], in reality, like multi-character directives, this actually results from how plain scalars work. Its not something special in the parsing of comments.

    Trailing comments appear on the same line as directives or content, and span from the hash to the end of the line. A space must precede the hash for a parser to successfully recognize the trailing comment.

Comment Styles

  • # a leading comment
  • - random yaml content # a trailing comment
  • # the previous line was an implicit comment

Comments can contain any character except new lines; the end of a line defines the end of a given comment.

Content

The three forms of content in simple yaml are: Strings, Sequence Blocks, and Mapping Blocks.

Strings

Simplified Yaml supports two string styles: Plain Scalars, and Double-Quoted Strings.

Plain Scalars

One of the most important aspects of a good format are clear rules that can be easily conveyed to end users. Yaml’s plain scalars look great when used right, but have parsing rules that can send the weak willed off to INI files. Let’s look briefly at some of the complete spec’s rules.

In the complete spec most every character is allowed in a plain scalar, but the indicators are only allowed in limited contexts. For instance:

While the colon (’:') can be used it cannot be followed by a space,
the exclamation (’!') can be used so long as its not the first character,
the question (’?') can be used as a first character, so long as its not followed by a space,
and so on.

These rules make sense when paired with knowledge about how each of the indicators are used. They are not, however, intuitive.

The point of having plain scalars at all is to alleviate the need for users to add string quotes, and to keep the number of actual directives used in a given yaml document down to a minimum. In this spirit, its important for a simplified subset to allow the heavy usage of plain scalars, but at the same time its important to keep the rules clean, clear, and simple. Most importantly: its necessary to make parsing feel consistent to even the most non-technical of users. If sometimes the comma (’,') works and sometimes it doesn’t, people will either give up and not use commas at all, or get bit by a rule they don’t understand.

To encourage the goal of consistency, Simplified Yaml excludes punctuation from plain scalars all together, and directs end users to encase their strings in double-quotes (’”‘) when punctuation is needed. Like the complete spec, Simplified Yaml, allows spaces but excludes tabs. Finally, under Simplified Yaml, all plain scalars must fit all on a single line.

Double Quoted Strings

Double quoted strings allow the expression of a relatively arbitrary series of characters.

Single line double-quoted string

  • - “It’s all on one line alright. And look: (arbitrary) punctuation!”

This limited definition avoids complicating simplified strings with the complete line folding (4.2.6) rules.

Simplified double quoted strings cannot contain span multiple lines unless the line breaks are escaped. The escaped line break [135] allows the user to split lines for readability in the yaml document, but escaped breaks don’t result in any new lines in the actual string itself.

Split line double-quoted string

  • Split line: “This looks split \
  • across multiple lines \
  • but the string isn’t.”
  • An equivalent string: “This doesn’t even look split.”

Space characters — what otherwise would be indentation — after escaped line breaks get ignored. Such spaces do not become part of the string itself. This is how the first string above would look to an application:
“This looks split across multiple lines but the string isn’t.”

Actual new line characters, however, can be added to a string using an escape sequence:

String with newlines

  • - “This one\n contains multiple lines\n in the actual string.”

Escape Sequences

Escape sequences allow the inclusion of characters in double quoted strings that would either be hard to represent in plain text, or, would, in some way, conflict with string parsing.
The yaml spec defines a super-set of the c programming language’s escape sequences.

The simplified, non-unicode, sub-set follows:

  • \0 (0×0) the null
  • \a (0×7) the bell
  • \b (0×8) the backspace
  • \t (0×9) the tab
  • \n (0xA) the linefeed ( aka. newline )
  • \v (0xB) the vertical tab
  • \f (0xC) the form feed
  • \r (0xD) the carriage return
  • \e (0×1E) the escaped character
  • \ the escaped tab
  • \ the escaped space
  • \” the escaped quote
  • \\ the escaped backslash

Note: the escaped space and the escaped tab and don’t show up well in html, but they are the backslash character followed by an actual space or actual tab.

String with escaped characters

  • - “\”I have a\t tab\”, said the string.”

Collections

Sequence Blocks

The following examples are based on those that appear in the Yaml for Ruby Cookbook. except that Why’s colors look nicer….

The simplest of sequences, just an array of strings, looks like:

Simple sequence

  • - one
  • - two

Specifying a blank sequence indicator starts a new, nested, sequence:

Nested sequence

  • - one
  • - two
  • -
    • - ONE
    • - TWO

Specifying a blank indicator with no sub-sequence, indicates merely an empty node. (4.4.5.2)

Empty Item

  • - one
  • -

What kind of empty node is, apparently, up to the parser, though (Example 4.51) does seem to indicate an empty string would be best.

Mapping Blocks

The simplest of associative collections. Asking for “A” will yield “a”; “B” yields “b”.

Simple map

  • A: a
  • B: b

Here, asking for “C” yields our simple map again:

Nested map

  • C:
    • A: b

Again, specifying a blank indicator, indicates merely an empty node. Its not entirely clear to me whether the colon (”:”) provides enough information to satisfy the requirement that: “Completely empty block nodes may only appear when there is some explicit indicator for their existance.” (4.4.5.2)

Empty mapping

  • A: a
  • :

Mixed Blocks

Sequences can contain any collection, even maps:

Map in a sequence

  • - one
  • - two
  • -
    • A: a
    • B: b

Maps can contain any collection, even sequences:

Sequence in a map

  • A: a
  • C:
    • - one
    • - two

There’s another set of syntax shortcuts, what the spec refers to as a block’s compact in-line form, but, for the moment, I’m advocating leaving them out of simplified yaml due to their strict seeming whitespace rules.

According to the spec (4.6.1.2) the dash (”-”) counts towards indentation — this is intended to make embedded sequences more readable. I’m not sure how well it actually works but here it is anyway.

The Ruby Cookbook refers to this as a map shortcut.

Sequence in a map shortcut

  • A: a
  • C:
  • - one
  • - two

New lines in collections

One final question worth touching on: can new lines appear in collection blocks?

The spec (Example 4.20. Separation Spaces) does seem indicate that after the mapping indicator, the colon (:), new lines are allowed, tho not before, unless perhaps as part the folding rules of multi-line plain scalars. No examples seem to indicate that newlines can be left after the sequence indicator (-) dash, but again it may be allowed implicitly due to folding rules.

The spec largely answers this via its BNF productions, so both the result, and the intent, are somewhat obfuscated. Rather than following the letter of but not the spirit, or perhaps worse, vice-versa, I advocate disallowing newlines except as already documented above: for use with empty blocks, and collection in collection blocks.

This might be overly restrictive based on what the complete spec may allow, but seems a workable rule for this simplified subset.