YAML Data Validation

By ubaumann, Tue 01 February 2022, in category DevOps

CICD, DevNet, DevOps, GitOps, YAML

YAML Data Validation

YAML Files (and other data files like JSON) are becoming more and more important in infrastructure deployments and projects. We often edit YAML files in a text editor and a mistake can have a big impact. Before something is deployed in production, it should definitely be validated, tested and verified but how can we check that a YAML file is not only syntactically correct but also that the data structure is correct?

JSON Schema is probably the defacto standard for validation of JSON data and can also be used for YAML files. A nice side effect is syntax highlighting in most text editors, which makes editing YAML files more pleasant and less error-prone.

JSON Schema

This blog post only gives a general overview and some examples of JSON Schema. A good starting point for learning is JSON Schema - Understanding. There are also many good tools and libraries available helping generating schemas. A list of implementations can be found here.

In [1]:
cat > urs.yaml <<EOF
---
name: urs
ipv4: 127.0.0.1
...
EOF

The YAML data can be validated with a JSON Schema. Suppose we want to have YAML files with the name and IPv4 address. To validate the content we need to describe the schema. A mapping in YAML is an object in JSON. In this case the object has 2 properties named "name" and "ipv4". Both of type "string".

In [2]:
cat > name_schema.json <<EOF
{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "additionalProperties": false,
    "properties": {
        "name": {
            "type": "string"
        },
        "ipv4": {
            "type": "string"
        }
    }
}
EOF

To validate the YAML file the CLI tool yajsv can be used for example. ajv-cli is also a good option.

In [3]:
curl https://github.com/neilpa/yajsv/releases/download/v1.4.0/yajsv.linux.amd64 -o yajsv -L -s
chmod +x yajsv
In [4]:
./yajsv -s name_schema.json urs.yaml
urs.yaml: pass

Data Validation

JSON Schema can validate more than just data types. For the type "number" there are among others minimum, exclusiveMinimum, maximum, exclusiveMaximum and multipleOf options available. For "string" you can validate patterns with regex and there are also predefined formats available.

In [5]:
cat > name_schema.json <<EOF
{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "additionalProperties": false,
    "properties": {
        "name": {
            "type": "string",
            "pattern": "^[A-Z].*$"
        },
        "ipv4": {
            "type": "string",
            "format": "ipv4"
        }
    }
}
EOF
In [6]:
./yajsv -s name_schema.json urs.yaml
urs.yaml: fail: name: Does not match pattern '^[A-Z].*$'
1 of 1 failed validation
urs.yaml: fail: name: Does not match pattern '^[A-Z].*$'

The regex checks if the name content starts with a capital letter and now fails. The IP address is valid. After the name is corrected, the file passes the verification again.

In [7]:
cat > urs.yaml <<EOF
---
name: Urs
ipv4: 127.0.0.1
...
EOF
In [8]:
./yajsv -s name_schema.json urs.yaml
urs.yaml: pass

JSON Schema has generic annotations not used for validations, but to describe and self-document the schema. It is also used in tools like syntax highlighting in editors.

In [9]:
cat > name_schema.json <<EOF
{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "schema/schemas/name.json",
    "type": "object",
    "$comment": "Only the defined properties are allowed",
    "additionalProperties": false,
    "properties": {
        "name": {
            "type": "string",
            "pattern": "^[A-Z].*$",
            "title": "Name",
            "description": "Name beginning with a capital letter",
            "examples": [
                "Jane Doe",
                "John Doe",
                "Jane"
            ]
        },
        "ipv4": {
            "type": "string",
            "format": "ipv4",
            "title": "IP Address",
            "description": "IPv4 Address belonging to the name",
            "examples": [
                "127.0.0.1",
                "10.11.12.13"
            ]
        }
    }
}
EOF

Structuring Schema

The keyword $ref can be used for structuring schemas and make especially bigger schemas more readable by defining (sub)schemas and (re)using them.

In [10]:
cat > service_schema.json <<EOF
{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "additionalProperties": false,
    "properties": {
        "services": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {
                        "type": "string"
                    },
                    "ports": {
                        "type": "array",
                        "items": {
                            "$ref": "#/$defs/Port"
                        }
                    }
                },
                "required": [
                    "name",
                    "ports"
                ]
            }
        }
    },
    "required": [
        "services"
    ],
    "$defs": {
        "Port": {
            "type": "object",
            "properties": {
                "port": {
                    "type": "integer"
                },
                "name": {
                    "type": "string"
                },
                "targetPort": {
                    "type": "integer"
                }
            },
            "required": [
                "port",
                "name"
            ]
        }
    }
}
EOF
In [11]:
cat > myService.yaml <<EOF
---
services:
  - name: app01
    ports:
      - name: http
        port: &http 80
      - port: 8080
        name: http_alt
        targetPort: *http
      - name: https
        port: 443
  - name: db01
    ports:
      - port: 5432
        name: sql
...
EOF
In [12]:
./yajsv -s service_schema.json myService.yaml
myService.yaml: pass

Schema Generator

Generators provide a good starting point for creating a schema. Of the JSON Schema generators used, https://app.quicktype.io/ is one of the most popular. The generator only supports JSON so the data need to be converted first when creating the schema. Depending on the structure, a single JSON can be used or many JSON objects in the "Source type" Multiple JSON. Most of the time the generated schema needs adjustment and adding semantic checks like pattern, format, enum or number restrictions but it shortens the time for creating a schema enormously.

quicktype

Editor Support

Many editors support JSON Schema for YAML files and thus autocompletion and tooltips as well as validation. This makes editing YAML files easier and less error-prone, since you get feedback before you save the file. Many editors use the yaml-language-server implementation from Red Hat. The following examples are tested with VS Code with the YAML extension.

vscode_yaml_validation

Like other editors the yaml-language-server supports the JSON Schema Store. A list of schemas with associated fileMatch patterns is retrieved from the API. If a file matches a pattern, the associated scheme is used. For example, all YAML files under the path .github/workflows/*.yaml are automatically validated with the schema github-workflow.json.

To see the list of schemas with YAML files, jq can be used. The following command is limited to the first 15 lines.

In [13]:
curl https://www.schemastore.org/api/json/catalog.json -s | jq '.schemas[] | select((.fileMatch != null) and ((.fileMatch[] | contains("yaml")) or (.fileMatch[] | contains("yml")))) | { name: .name, fileMatch: .fileMatch }' 2>&1 | head -15
{
  "name": "AnyWork Automation Configuration",
  "fileMatch": [
    ".awc.yaml",
    ".awc.yml",
    ".awc.json",
    ".awc.jsonc",
    ".awc"
  ]
}
{
  "name": "AnyWork Automation Configuration",
  "fileMatch": [
    ".awc.yaml",
    ".awc.yml",

Schema Assignment

Schemas can also be stored on any webserver, on the file system or in the project directory. In VS Code you can configure the schema assignment in the settings. Globally or for each project.

In a project that contains the schemas, the .vscode/settings.json file might look like this using relative paths:

{
    "yaml.schemas": {
        "schema/schemas/hosts.json": [
            "host*.yaml",
            "host*.yml"
        ],
        "schema/schemas/groups.json": [
            "group*.yaml",
            "group*.yml"
        ],
        "schema/schemas/defaults.json": [
            "default.yaml",
            "default.yml"
        ],
    }
}

Because the schemas are included in the project, it is easy to use them in the CI/CD pipeline.

Modeline

The schema can be specified inline with a modeline comment at the beginning of the YAML file. The schema url can be a web url, a relative or an absolute path.

# yaml-language-server: $schema=https://server/schema.json
# yaml-language-server: $schema=../relative/path/hosts.json
# yaml-language-server: $schema=/opt/schemas/groups.json
In [14]:
cat > urs.yaml <<EOF
---
# yaml-language-server: $schema=schema/schemas/name.json
name: Urs
ipv4: 127.0.0.1
...
EOF

Kubernetes

The yaml-language-server includes Kubernetes, but does not know if a file is a Kubernetes file or not. Therefore, the pattern is needed in the settings to identify the YAML files. To recognize all YAML files starting with "k8s" as Kubernetes files, the following settings are required.

{
    "yaml.schemas": {
        "kubernetes": [
            "k8s*.yaml",
            "k8s*.yml"
        ]
    }
}

Also inline specification works. Schemas generated from Swagger are available.

# yaml-language-server: $schema=https://raw.githubusercontent.com/yannh/kubernetes-json-schema/master/master-standalone-strict/all.json

Or a specific Kubernetes version

# yaml-language-server: $schema=https://raw.githubusercontent.com/yannh/kubernetes-json-schema/master/v1.23.1-standalone-strict/all.json