Container Orchestration Guidelines
This section describes a set of standards, conventions and guidelines for deploying application suites on Container Orchestration technologies. For a hands on training please check :
Containerisation and Orchestration Concept PDF
Containerisation and Orchestration exercises
Warning
Information regarding examples on these pages may be out of date
Overview of Standards
These standards, best practices and guidelines are based on existing industry standards and tooling. The main references are:
The standards are broken down into the following areas:
Structuring application suites for orchestration - general guidelines for breaking up application suites for running in a container orchestration
Defining and building cloud native application suites - resource definitions, configuration, platform resource integration
Kubernetes primitives - a more detailed look at key components: Pods, Services, Ingress
Scheduling and running cloud native application suites - scheduling, execution, monitoring, logging, diagnostics, security considerations
Throughout this documentation, Kubernetes in conjunction with Helm is used as the reference implementation with the canonical versions being Kubernetes v1.16.2 and Helm v3.1.2, however the aim is to target compliance with the OCI specifications and CNF guidelines so it is possible to substitute in alternative Container Orchestration solutions, and tooling.
A set of example Helm Charts are provided in the repository container-orchestration-chart-examples. These can be used to get an overall idea of how the components of a chart function together, and how the life cycle and management of a chart can be managed with make
.
Structuring application suites for Orchestration
In order to understand how to structure applications suites for orchestration, we first need to understand what the goals of Cloud Native software engineering are.
what is Cloud Native
It is the embodiment of modern software delivery practices supported by tools, frameworks, processes and platform interfaces.
These capabilities are the next evolution of Cloud Computing, raising the level of abstraction for all actors against the architecture from the hardware unit to the application component.
What does this mean? Developers and system operators (DevOps) interface with the platform architecture using abstract resource concepts, and should have next to no concern regarding the plumbing or wiring of the platform, while still being able to deploy and scale applications according to cost and usage.
Cloud Native exploits the advantages of the Cloud Computing delivery model:
PaaS (Platform as a Service) layered on top of IaaS (Infrastructure as a Service)
CI/CD (Continuous Integration/Delivery) – fully automated build, test, deploy
Modern DevOps – auto-scaling, monitoring feedback loop to tune resource requirements
Software abstraction from platform compute, network, storage
Portability across Cloud Services providers
Why Cloud Native SDLC (Software Development Life Cycle)?
Kubernetes provides cohesion for distributed projects:
Codify standards through implementing testing gates
Ensures code quality, consistency and predictability of deployment success – CI/CD
Automation – build AND rebuild for zero day exploits at little cost
Portability of SDI (Software Defined Infrastructure) as well as code
Provides a codified reference implementation of best practices, and exemplars
Enables broad engagement – an open and collaborate system - a “Social Coding Platform”
Consistent set of standards for integration with SRC (SKA Regional Centres), and other projects – the future platform of integrated science projects through shared resources enabled by common standards
How does orchestration work
At the core of Cloud Native is the container orchestration platform. For the purposes of these guidelines, this consists of Kubernetes as the orchestration layer, over Docker as the container engine.
Kubernetes provides an abstraction layer from hardware infrastructure resources enabling compute, network, storage, and other dependent services (other applications) to be treated as abstract concepts. A computing cluster is not a collection of machines but instead is an opaque pool of resources, that are advertised for availability through a consistent REST based API. These resources can be customised to provide access to and accounting of specialised devices such as GPUs.
Through the Kubernetes API, the necessary resources that make up an application suite (compute, network, storage) are addressed as objects in an idempotent way that declares the desired state eg: this number of Pods running these containers, backed by this storage, on that network. The scheduler will constantly move the cluster towards this desired state including in the event of application or node/hardware failure. This builds in robustness and auto-healing. See A Quick Introduction to Kubernetes for a general introduction.
Both platform and service resources can be classified by performance characteristics and reservation criteria using labelling, which in turn are used by scheduling algorithms to determine optimum placement of workloads across the cluster. All applications are deployed as sets of one or more containers in a minimum configuration called a Pod. Pods are the minimum scalable unit that are distributed and replicated across the cluster according to the scheduling algorithm. A Pod is essentially a single Kernel namespace holding one or more containers. It only makes sense to put together containers that are essentially tightly coupled and logically indivisible by design. These Pods can be scheduled in a number of patterns using Controllers (full list) including bare Pod (a single Pod instance), Deployment (a replicated Pod set), StatefulSet (a Deployment with certain guarantees about naming and ordering of replicated units), DaemonSets (one Pod per scheduled compute node), and Job/CronJob (run to completion applications).
A detailed discussion of these features can be found in the main Kubernetes documentation under Concepts.
Structuring Application Suites
Architecting software to run in an orchestration environment builds on the guidelines given in the Container Standards ‘Structuring Containerised Applications’ section. The key concepts of treating run time containers as immutable and atomic applications where any application state is explicitly dealt with through connections to storage mechanisms, is key.
The application should be broken into components that represent:
an application component has an independent development lifecycle
individual process that performs a discrete task such as a micro service, specific database/web service, device, computational task etc.
component that exposes a specific service to another application eg. a micro service or database
a reusable component that is applicable to multiple application deployments eg. a co-routine or proximity depdendent service (logger, metrics collector, network helper, private database etc)
an independently scalable unit that can be replicated to match demand
the minimum unit required to match a resource profile at scheduling time such as storage, memory, cpu, specialised device
Above all, design software to scale horizontally through a UNIX process model so that individual components that have independent scaling characteristics can be replicated independently.
The application interface should be through the standard container run time interface contract:
inputs come via a configurable Port
outputs go to a configurable network service
logging goes to stdout/stderr and syslog and uses JSON to enrich metadata (see Container Standards ‘Logging’)
metrics are advertised via a standard such as Prometheus Exporters, or emit metrics in a JSON format over TCP consumable by ETL services such as LogStash
configuration is passed in using environment variables, and simple configuration files (eg: ini, or key/value pairs).
POSIX compliant storage IO is facilitated by bind mounted volumes.
connections to DBMS, queuing technologies and object storage are managed through configuration.
applications should have builtin recoverability so that prior state and context is automatically discovered on restart. This enables the cluster to auto-heal by re-launching workloads on other resources when nodes fail (critical aspect of a micro-services architecture).
By structuring an application in this fashion, it can scale from the single instance desktop development environment up to a large parallel deployment in production without needing to have explicit understanding builtin for the plumbing and wiring of each specific environment because this is handled through external configuration at the Infrastructure management layer.
Example: Tango Controls
To help illustrate the Cloud Native application architecture concepts, a walk through of a Tango application suite is used.
A Tango Controller System environment is typically made up of the following:
Database containing the system state eg: MySQL.
DatabaseDS Tango device server.
One or more Tango devices.
Optional components - Tango REST interface, Tango logviewer, SysAdmin and debugging tools such as Astor and Jive.
These components map to the following Kubernetes resources:
MySQL Database == StatefulSet.
DatabaseDS == Deployment or StatefulSet.
Tango REST interface == Deployment.
Tango Device == bare Pod, or single replica Deployment.
This example does not take into consideration an HA deployment of MySQL, treating MySQL as a single instance StatefulSet. Using a StatefulSet in this case gives the following guarantees above a Deployment:
Stable unique network identifiers.
Stable persistent storage.
Ordered graceful deployment and scaling.
Ordered automated rolling updates.
These characteristics are useful for stable service types such as databases and message queues.
DatabaseDS is a stateless and horizontally scalable service in it’s own right (state comes from MySQL). This makes it a fit for the Deployment (which in turn uses a ReplicaSet) or the StatefulSet. Deployments are a good fit for stateless components that require high availability through mechanisms such as rolling upgrades.
The Tango Devices are single instance applications that act as a proxy between the ‘real’ hardware being controlled and the DatabaseDS service that provides each Tango Device with a gateway to the Tango cluster state database (MySQL). Considering that in most cases, an upgrade to a Device Pod is likely to be a delete and replace, we can use the simplest case of a bare Pod which will enable us to name each Pod after it’s intended device without the random suffix generated for Deployments.
Example: MPI jobs
A typical MPI application consists of a head node, and worker nodes with the (run to completion) job being launched from the head node, which in turn controls the work distribution over the workers.
This can be broken in to:
a generic component type that covers head node and worker nodes.
a launcher that triggers the application on the designated head node.
These components map to the following Kubernetes resources:
Worker node == DaemonSet or StatefulSet.
Launcher and Head node == Job.
MPI jobs typically only require a single instance per physical compute node, and this is exactly the use case of DaemonSets where Kubernetes ensures exactly one instance of a Pod is running on each designated node. Using Jobs enables the launcher and the head node to be combined. Both Job and DaemonSet Pods will most likely need the same library and tools from MPI, so can be combined into a single container image.
Linking Components Together
Components of an application suite or even between suites should use DNS for service discovery. This is achieved by using the Service resource. Services should always be declared before Pods so that the automatic generation of associated Environment Variables happens in time for the subsequent Pods to discover them. Service names are permanent and predictable, and are tied to the Namespace that a application suite is deployed in, for example in the namespace test
, the DatabaseDS Tango component can find the MySQL database tangodb
using the name tangodb
or tangodb.test
which is distinctly different to the instance running in the qa
namespace also named tangodb
but addressable by tangodb.qa
. This greatly simplifies configuration management for software deployment.
Defining and building cloud native application suites
All Kubernetes resource objects are described through the REST based API. The representations of the API documents are in either JSON or YAML, however the preference is for YAML as the description language as this tends to be more human readable. The API representations are declarative, specifying the end desired state. It is up to the Kubernetes scheduler to make this a reality.
It is important to use generic syntax and Kubernetes resource types. Specialised resource types reduce portability of resource descriptors and templates, and increase dependency on 3rd party integrations. This could lead to upgrade paralysis because the SDLC is out of our control. An example of this might be using a non-standard 3rd party Database Operator for MySQL instead of the official Oracle one.
Metadata
Each resource is described with:
apiVersion - API version that this document should invoke
kind - resource type (object) that is to be handled
metadata - descriptive information including name, labels, annotations, namespace, ownership, references
spec(ification) - the body of the specification for this resource type denoted by kind
The following is an example of the start of a StatefulSet for the Tango DatabaseDS:
Resource description
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: databaseds-integration-tmc-webui-test
labels:
app.kubernetes.io/name: databaseds-integration-tmc-webui-test
helm.sh/chart: integration-tmc-webui-0.1.0
app.kubernetes.io/instance: test
app.kubernetes.io/managed-by: helm
spec:
...
Namespaces
Even though it is possible to specify the namespace directly in the Metadata, it SHOULD NOT be, as this reduces the flexibility of any resource definition and templating solution employed such as Helm. The namespace can be specified at run time eg: kubectl --namespace test apply -f resource-file.yaml
.
Name and Labels
Naming and labelling of all resources associated with a deployment should be consistent. This ensures that deployments that land in the same namespace can be identified along with all inter-dependencies. This is particulaly useful when using the kubectl
command line tool as label based filtering can be employed to sieve out all related objects.
Labels are entirely flexible and free form, but as a minimum specify:
the
name
andapp.kubernetes.io/name
with the same identifier with sufficient precision that the same application component deplyed in the same namespace can be distinguished eg: a concatenation of <application>-<suite>-<release>.name
andapp.kubernetes.io/name
are duplicated because label filter interaction between resources relies on labels eg:Service
exposingPods
of aDeployment
.the labels of the deployment suite such as the
helm.sh/chart
for Helm, including the version.the
app.kubernetes.io/instance
(which isrelease
) of the deployment suite.app.kubernetes.io/managed-by
what tooling is used to manage this deployment - most likelyhelm
.
Optional extras which are also useful for filtering are:
app.kubernetes.io/version
the component version.app.kubernetes.io/component
the component type (most likely related to the primary container).app.kubernetes.io/part-of
what kind of application suite this component belongs to.
The recommended core label set are described under Kubernetes common labels.
metadata:
name: databaseds-integration-tmc-webui-test
labels:
app.kubernetes.io/name: databaseds-integration-tmc-webui-test
helm.sh/chart: integration-tmc-webui-0.1.0
app.kubernetes.io/instance: test
app.kubernetes.io/version: "1.0.3"
app.kubernetes.io/component: databaseds
app.kubernetes.io/part-of: tango
app.kubernetes.io/managed-by: helm
Using this labelling scheme enables filtering for all deployment related objects eg: kubectl get all -l helm.sh/chart=integration-tmc-webui-0.1.0,app.kubernetes.io/instance=test
.
kubectl label filtering
$ kubectl get all,configmaps,secrets,pv,pvc -l helm.sh/chart=integration-tmc-webui-0.1.0,app.kubernetes.io/instance=test
NAME READY STATUS RESTARTS AGE
pod/databaseds-integration-tmc-webui-test-0 1/1 Running 0 55s
pod/rsyslog-integration-tmc-webui-test-0 1/1 Running 0 55s
pod/tangodb-integration-tmc-webui-test-0 1/1 Running 0 55s
pod/tangotest-integration-tmc-webui-test 1/1 Running 0 55s
pod/webjive-integration-tmc-webui-test-0 0/6 Init:0/1 0 55s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/databaseds-integration-tmc-webui-test ClusterIP None <none> 10000/TCP 55s
service/rsyslog-integration-tmc-webui-test ClusterIP None <none> 514/TCP,514/UDP 55s
service/tangodb-integration-tmc-webui-test ClusterIP None <none> 3306/TCP 55s
service/webjive-integration-tmc-webui-test ClusterIP 10.97.135.8 <none> 80/TCP,5004/TCP,3012/TCP,8080/TCP,27017/TCP 55s
NAME READY AGE
statefulset.apps/databaseds-integration-tmc-webui-test 1/1 55s
statefulset.apps/rsyslog-integration-tmc-webui-test 1/1 55s
statefulset.apps/tangodb-integration-tmc-webui-test 1/1 55s
statefulset.apps/webjive-integration-tmc-webui-test 0/1 55s
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/rsyslog-integration-tmc-webui-test 10Gi RWO Retain Bound default/rsyslog-integration-tmc-webui-test standard 56s
persistentvolume/tangodb-integration-tmc-webui-test 1Gi RWO Retain Bound default/tangodb-integration-tmc-webui-test standard 55s
persistentvolume/webjive-integration-tmc-webui-test 1Gi RWO Retain Bound default/webjive-integration-tmc-webui-test standard 55s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/rsyslog-integration-tmc-webui-test Bound rsyslog-integration-tmc-webui-test 10Gi RWO standard 56s
persistentvolumeclaim/tangodb-integration-tmc-webui-test Bound tangodb-integration-tmc-webui-test 1Gi RWO standard 55s
persistentvolumeclaim/webjive-integration-tmc-webui-test Bound webjive-integration-tmc-webui-test 1Gi RWO standard 55s
Templating the Application
While it is entirely possible to define all the necessary resources for an application suite to be deployed on Kubernetes in individual or a single YAML file, this approach is static and quickly reveals it’s limitations in terms of creating reusable and composable application suites. This is where Helm Charts have been adopted by the Kubernetes community as the leading templating solution for deployment. Helm provides a mechanism for generically describing an application suite, separating out configuration, and rolling out deployment releases all done in a declarative ‘configuration as code’ style. All Helm Charts should target a minimum of three environments:
Minikube - the standalone developer environment.
CI/CD - the Continuous Integration testing environment which is typically the same benchmark as Minikube.
Production Cluster - the target production Kubernetes environment.
Minikube should be the default target environment for a Chart, as this will have the largest audience and should be optimised to work without modification of any configuration if possible.
When designing a Chart it is important to have clear separation of concerns:
the application - essentially the containers to run.
configuration - any variables that influence the application run time.
resources - any storage, networking, configuration files, secrets, ACLs.
The general structure of a Chart should follow:
charts/myapp/
Chart.yaml # A YAML file containing information about the chart and listing
# dependencies for the chart (refer to Helm 2 vs Helm 3 differences).
LICENSE # OPTIONAL: A plain text file containing the license for the chart
README.md # OPTIONAL: A human-readable README file
values.yaml # The default configuration values for this chart
charts/ # A directory containing any charts upon which this chart depends.
templates/ # A directory of templates that, when combined with values,
# will generate valid Kubernetes manifest files.
templates/NOTES.txt # OPTIONAL: A plain text file containing short usage notes
templates/tests # A directory of test templates for running with 'helm test'
All template files in the templates/
directory should be named in a readily identifiable way after the component that it contains, and if further clarification is required then it should be suffixed with the Kind
of resource eg: tangodb.yaml
contains the StatefulSet
for the Tango database, and tangodb-pv.yaml
contains the PersistentVolume
declaration for the Tango database. ConfigMaps
should be clustered in configmaps.yaml
and Secrets
in secrets.yaml
. The aim is to make it easy for others to understand the layout of application suite being deployed.
Helm sub-chart architecture
Introduction to subcharts
A chart can have one or more dependencies charts, called sub-charts. According to the helm documentation:
a chart is stand-alone (cannot depend on a parent chart),
a sub-chart cannot access the values of its parent,
a parent sub-chart can override values for its sub-charts and
all charts (parent and sub-chart) can access the global values.
Let’s consider two charts, A and B where A depends on B. The file Chart.yaml for the chart A will specify the dependency and in the values file it is possible for chart A to override any value of chart B. The following figure shows how to do it:
It is also important to consider the operational aspects of using dependencies which state that when Helm installs/upgrades a chart, the Kubernetes objects from the chart and all its dependencies are
aggregated into a single set; then
sorted by type followed by name; and then
created/updated in that order.
This means that if chart A defines the following k8s resources:
namespace “A-Namespace”
statefulset “A-StatefulSet”
service “A-Service”
and chart B defines the following k8s resources:
namespace “B-Namespace”
statefulset “B-ReplicaSet”
service “B-Service”
Then the result of the helm install command for chart A will be:
A-Namespace
B-Namespace
A-Service
B-Service
B-ReplicaSet
A-StatefulSet.
Subcharts architecture
Considering the Module Views for the evolutionary prototype (section “Primary representation: MVP Uses in Kubernetes Deployment”), a partial dependency diagram for the helm charts available within the gitlab.com/ska-telescope group can be represented by the following diagram:
All charts depend on the tango-base and, in general, all charts could need the archiver and the webjive interface. At the moment, this is modelled in skampi repository where there is one parent chart called skampi and all other charts are its subcharts. They are installed with Helm templating instead of normal installation There are a number of disadvantages in this model specifically:
Common testing: one place for all integration testing. No clear distinction between system and integration tests
Not easy to find logs: many tests on the same namespace
Same namespace for many deployments
No versioning: charts are not versioned
Three solutions have been proposed and described in the Supporting model page:
One parent chart (umbrella) that contains everything needed and Subcharts with no dependency
Charts with dependencies and Subcharts enabled by levels
Charts with dependencies and Subcharts enabled with conditions and tags
The chosen solution is an hybrid approach which enables a single level hierarchy for the shared charts and umbrella charts for charts composition (i.e. specific deployment or testing purpose). The rational is:
Every chart can be deployed with its own tango eco-system
Every chart can have tango-base, webjive and the archiver as dependencies
Every dependency must have a common condition on it, so that it will be possible to disable the shared charts if they are included in the parent umbrella. For instance if there is the need (for testing purposes) to have the TMC and the OET charts together the result will be:
The initial model will become:
Gitlab Helm/k8s testing pipeline
In order to enable the GitLab pipeline to deploy and test the specific component each ska-telescope repository must:
contain at least one helm chart (i.e. starting point is skampi charts): link to example
have an environment (i.e. test): link to example
adopt the Makefile for k8s testing: link to example
Also, note that each project/repository in the ska-telescope group has a Kubernetes cluster already enabled.
The test job of the GitLab pipeline needs to be:
test:
stage: test
tags:
- docker-executor
image: artefact.skao.int/ska-cicd-k8s-tools-build-deploy:0.6.1
script:
- kubectl version
- make install-chart
- make wait
- make smoketest
- make test
after_script:
- make uninstall-chart
- make delete_namespace
environment:
name: test
kubernetes:
namespace: ci-$CI_PROJECT_NAME-$CI_COMMIT_SHORT_SHA
artifacts:
name: "$CI_PROJECT_NAME-$CI_JOB_ID"
paths:
- "charts/build"
reports:
junit: charts/build/report.xml
where:
make install: installs the chart in the namespace specified in the environment tag
make wait: wait for all jobs to be completed and all pods to be running
make smoketest: checks that no containers are waiting
make test:
Create a pod into the specified namespace
Run pytests
Return the tests results
after_script: remove everything after tests
The artifacts are the output of the tests and it will have the report both in xml and json but also other information like the pytest output.
Tango-util library chart
A library chart is a type of Helm chart that defines chart primitives or definitions which can be shared by Helm templates in other charts. In SKAMPI, many charts are a collections of device servers so it is possible to harmonize their definition with a library so to keep charts DRY.
The following diagram shows the data model for the harmonized values file:
Elements:
Element |
Description |
Chart |
collection of files that describe a related set of Kubernetes resources |
Values |
built-in objects of helm which provides access to values passed into the chart for templating |
DsConfig |
dsconfig file configuration |
DeviceServer |
TANGO Device Server |
Device |
TANGO device |
Global |
Global values accessible by all charts |
Labels |
to be added to all Kubernetes resources |
Environment variables |
Name/Value pair available in shell |
Image |
Detail of the docker image to be used |
ResourceRequestandLimits |
struct for characterise the resource requests and limits for a device server |
DB |
struct for characterise a DB software application |
Rationale:
Almost all helm charts in the Skampi repository are device server configurations so it appears natural to start the modelling from that concept
The depends_on relationship has been added so that it is possible to extract the dependency map of the MVP prototype
Every chart of the ska-telescope can have the shared charts in the dependency list
The annotations block has been added to enable GitLab’s Deploy Boards
The DeviceServer struct specifies the shell args so that it is possible to start the related container instance of the linked image
Advantages
With this architecture, a number of advantages can be obtained:
By using a separate deployment (i.e. Namespace) for each test, searching for all the logs of a particular test will be easy: example
Requires teams to create versions of docker images and charts
Avoids the use of docker-compose in favour of Kubernetes testing
Harmonized values yml files (for «common» definitions i.e. TANGO device servers)
Unit and integration testing within the repositories of teams
Skampi testing becomes system testing
Helm Best Practices
The Helm community have a well defined set of best practices. The following highlights key aspects of these practices that will help with achieving consistency and reliability.
charts should be placed in a
charts/
directory within the parent project.chart names should be lowercase and hyphenated and must match the directory name eg.
charts/my-app
.name
,version
,description
,home
,maintainers
andsources
must be included.version
must follow the Semantic Versioning standards.the chart must pass the
helm lint charts/<chart-name>
test.
Warning
Helm 2 vs Helm 3
It should be noted that we have now migrated to using Helm 3. Feel free to upgrade Helm in your development environments using our Ansible Playbook upgrade_helm.yml
found in the SKA Ansible Playbooks repository.
There are a few changes that may impact specific cases, to read up on them please read up at This blog post, as well as on Helm’s own FAQ page.
Example Chart.yaml
file:
name: my-app
version: 1.0.0
description: Very important app
keywords:
- magic
- mpi
home: https://www.skatelescope.org/
icon: http://www.skatelescope.org/wp-content/uploads/2016/07/09545_NEW_LOGO_2014.png
sources:
- https://gitlab.com/ska-telescope/my-app
maintainers:
- name: myaccount
email: myacount@skatelescope.org
Metadata with Helm
All resources should have the following boilerplate metadata to ensure that all resources can be uniquely identified to the chart, application and release:
...
metadata:
name: <component>-{{ template "my-app.name" . }}-{{ .Release.Name }}
labels:
app.kubernetes.io/name: <component>-{{ template "my-app.name" . }}-{{ .Release.Name }}
helm.sh/chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
...
Defining resources
The Helm templating language is based on Go template.
All resources go in the templates/
directory with the general rule is one Kubernetes resource per template file. Files that render resources are suffixed .yaml
whilst files that contain expressions and macros only go in files suffixed .tpl
.
Sample resource template for a Service generated by ‘helm create mychart’
apiVersion: v1
kind: Service
metadata:
name: {{ include "mychart.fullname" . }}
labels:
app.kubernetes.io/name: {{ include "mychart.name" . }}
helm.sh/chart: {{ include "mychart.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
spec:
type: {{ .Values.service.type }}
ports:
- port: {{ .Values.service.port }}
targetPort: http
protocol: TCP
name: http
selector:
app.kubernetes.io/name: {{ include "mychart.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
Expression or macro template generated by ‘helm create mychart’
{{/* vim: set filetype=mustache: */}}
{{/*
Expand the name of the chart.
*/}}
{{- define "mychart.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "mychart.fullname" -}}
{{- if .Values.fullnameOverride -}}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- $name := default .Chart.Name .Values.nameOverride -}}
{{- if contains $name .Release.Name -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "mychart.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" -}}
{{- end -}}
Tightly coupled resources may go in the same template file where they are logically linked or there is a form of dependency.
An example of logically linked resources are PersistentVolume and PersistentVolumeClaim definitions. Keeping these together makes debugging and maintenance easier.
PersistentVolume and PersistentVolumeClaim definitions
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
app.kubernetes.io/name: tangodb-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
storageClassName: standard
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/tangodb-{{ template "tango-chart-example.name" . }}/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
app.kubernetes.io/name: tangodb-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
volumeName: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
An example of dependency is the declaration of a Service before the associated Pod/Deployment/StatefulSet/DaemonSet. The Pod will get the environment variables set from the Service as this will be evaluated by the Kubernetes API first as guaranteed by being in the same template file.
Service before the associated Pod/Deployment
---
apiVersion: v1
kind: Service
metadata:
name: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
type: ClusterIP
ports:
- name: rest
port: 80
targetPort: rest
protocol: TCP
selector:
app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
replicas: {{ .Values.tangorest.replicas }}
template:
metadata:
labels:
app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
containers:
- name: tango-rest
image: "{{ .Values.tangorest.image.registry }}/{{ .Values.tangorest.image.image }}:{{ .Values.tangorest.image.tag }}"
imagePullPolicy: {{ .Values.tangorest.image.pullPolicy }}
command:
- /usr/local/bin/wait-for-it.sh
- databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}:10000
- --timeout=30
- --strict
- --
- /usr/bin/supervisord
- --configuration
- /etc/supervisor/supervisord.conf
env:
- name: TANGO_HOST
value: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}:10000
ports:
- name: rest
containerPort: 8080
protocol: TCP
restartPolicy: Always
{{- with .Values.nodeSelector }}
nodeSelector:
{{ toYaml . | indent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{ toYaml . | indent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{ toYaml . | indent 8 }}
{{- end }}
Note
It may also be necessary to consider the alphabetic ordering of template files, if there is a declaration dependency wider than the immediate file, for instance when s Service
definition and it’s environment variables are necessary for multiple Deployment/StatefulSet/DaemonSet definitions. In this case, it maybe necessary to use a numerical file prefix such as 00-service-and-pod.yaml, 01-db-statefulset.yaml …
Use comments liberally in the template files to describe the intended purpose of the resource declarations and any other features of the template markup. #
YAML comments get copied through to the rendered template output and are a valuable help when debugging template issues with helm template charts/chart-name/ ...
.
Managing configuration
Helm charts and the Go templating engine enable separation of application management concerns along multiple lines:
resources are broken out into related and named templates.
Application specific configuration values are placed in
ConfigMaps
.volatile run time configuration values are placed in the
values.yaml
file, and then templated intoConfigMaps
, container commandline parameters or environment variables as required.sensitive configuration is placed in
Secrets
.template content is programable (iterators and operators) and this can be parameterised at template rendering time.
Variable names for template substitution should observe the following rules:
Use camel-case or lowercase variable names - never hyphenated.
Structure parameter values in shallow nested structures to make it easier to pass on the Helm command line eg:
--set tangodb.db.connection.host=localhost
is convoluted compared to--set tangodb.host=localhost
.Use explicitly typed values eg:
enabled: false
is notenabled: "false"
.Be careful of how YAML parsers coerce value types - long integers get coerced into scientific notation so if in doubt use strings and type casting eg:
foo: "12345678"
and{{ .Values.foo | int }}
.use comments in the
values.yaml
liberally to describe the intended purpose of variables.
Config in ConfigMaps
ConfigMaps
can be used to populate Pod
configuration files, environment variables and command line parameters where the values are largely stable, and should not be bundled with the container itself. This should include any (small) data artefacts that could be different (hence configured) between different instances of the running containers. Even files that already exist inside a given container image can be overwritten by using the volumeMounts
example below.
ConfigMap values in Pods
---
apiVersion: v1
kind: ConfigMap
metadata:
name: special-config
namespace: default
data:
SPECIAL_LEVEL: very
SPECIAL_TYPE: charming
example.ini: |-
property.1=value-1
property.2=value-2
property.3=value-3
---
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
# accessing ConfigMap values in the commandline fron env vars
command: [ "/bin/sh", "-c", "echo $(SPECIAL_LEVEL_KEY) $(SPECIAL_TYPE_KEY); cat /etc/config/example.ini" ]
env:
# reference the map and key to assign to env var
- name: SPECIAL_LEVEL_KEY
valueFrom:
configMapKeyRef:
name: special-config
key: SPECIAL_LEVEL
- name: SPECIAL_TYPE_KEY
valueFrom:
configMapKeyRef:
name: special-config
key: SPECIAL_TYPE
volumeMounts:
# mount a ConfigMap file blob as a configuration file
- name: config-volume
mountPath: /etc/config/example.ini
subPath: example.ini
readOnly: true
volumes:
- name: config-volume
configMap:
# Provide the name of the ConfigMap containing the files you want
# to add to the container
name: special-config
restartPolicy: Never
# check the logs with kubectl logs dapi-test-pod
# clean up with kubectl delete pod/dapi-test-pod configmap/special-config
Where configuration objects are large or have a sensitive format, then separate these out from the configmaps.yaml
file, and then include them using the template directive: tpl (.Files.Glob "configs/*").AsConfig . )
where the configs/
directory is relative to the charts/my-chart
directory.
ConfigMap file blobs separated
---
apiVersion: v1
kind: ConfigMap
metadata:
name: config-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
labels:
app.kubernetes.io/name: config-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
data:
{{ (tpl (.Files.Glob "configs/*").AsConfig . ) | indent 2 }}
Secrets
Secrets
information is treated in almost exactly the same way as ConfigMaps
. While the default configuration (as at v1.14.x) is for Secrets
to be stored as Base64 encoded in the etcd database, it is possible and expected that the Kubernetes cluster will be configured with encryption at rest (available from v1.13). All account details, passwords, tokens, keys and certificates should be extracted and managed using Secrets
.
As was for ConfigMaps
, separate Secrets
out into the secrets.yaml
template.
Secret values in Pods
---
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
stringData:
username: myuser
password: mypassword
config.yaml: |-
apiUrl: "https://my.api.com/api/v1"
username: myuser
password: mypassword
---
apiVersion: v1
kind: Pod
metadata:
name: secret-env-pod
spec:
containers:
- name: mycontainer
image: k8s.gcr.io/busybox
# accessing Secret values in the commandline fron env vars
command: [ "/bin/sh", "-c", "echo $(SECRET_USERNAME) $(SECRET_PASSWORD); cat /etc/config/example.yaml" ]
env:
- name: SECRET_USERNAME
valueFrom:
secretKeyRef:
name: mysecret
key: username
- name: SECRET_PASSWORD
valueFrom:
secretKeyRef:
name: mysecret
key: password
volumeMounts:
- name: foo
mountPath: "/etc/config"
volumes:
- name: foo
secret:
secretName: mysecret
items:
- key: config.yaml
path: example.yaml
mode: 511
restartPolicy: Never
# check the logs with kubectl logs secret-env-pod
# clean up with kubectl delete pod/secret-env-pod secret/mysecret
Where sensitive data objects are large or have a sensitive format, then separate these out from the secrets.yaml
file, and then include them using the template directive: tpl (.Files.Glob "secrets/*").AsSecrets . )
where the secrets/
directory is relative to the charts/my-chart
directory.
Secret file blobs separated
---
apiVersion: v1
kind: Secret
metadata:
name: secret-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
labels:
app.kubernetes.io/name: secret-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
type: Opaque
data:
{{ (tpl (.Files.Glob "secrets/*").AsSecrets . ) | indent 2 }}
Storage
PersistentVolumes
and partner PersistentVolumeClaims
should be defined by default in a separate template. This template should be bracketed with a switch to enable the storage declaration to be turned off (eg: {{ if .Values.tangodb.createpv }}
), which will most likely be dependent on, and optimised for each environment.
On the PersistentVolume
:
All storage should be treated as ephemeral by setting
persistentVolumeReclaimPolicy: Delete
.Explicitly set volume mode eg:
volumeMode: Filesystem
so that it is clear whetherFilesystem
orBlock
is being requested.Explicitly set the access mode eg:
ReadWriteOnce, ReadOnlyMany, or ReadWriteMany
so that it is clear what access rights containers are expected to have.always specify the storage class - this should always default to
standard
eg:storageClassName: standard
given that the default target environment is Minikube.
On the PersistentVolumeClaim
:
Always specify the matching storage class eg:
storageClassName: standard
, so that it will bind to the intendedPersistentVolume
storage class.Where possible, always specify an explicit
PersistentVolume
withvolumeName
eg:volumeName: tangodb-tango-chart-example-test
. This will force thePersistentVolumeClaim
to bind to a specificPersistentVolume
and storage class, avoiding the loosely binding issues that volumes can have.
Storage In Kubernetes Clusters Managed by the Systems Team
In any of the existing deployed Kubernetes clusters there are a number of default StorageClasses available, that are backed by Ceph, and integrated using Rook. The StorageClass
es expose
RDB
block devices and CephFS
Network File System based storage to Kubernetes.
The StorageClasses are as follows:
Classname
Maps to
Usage
nfss1
CephFS
Shared Network Filesystem - ReadWriteMany
nfs
alias to nfss1
Shared Network Filesystem - ReadWriteMany
bds1
RBD
Single concurrent use ext4 - ReadWriteOnce
block
alias to bds1
Single concurrent use ext4 - ReadWriteOnce
StorageClass naming convention follows the following pattern:
<xxx type><x class><n version>[-<location>]
xxx type - bd=block device, nfs=network filesystem
x class - s=standard,i=iops optimised (could be ssd/nvme), t=throughput optimised (could be hdd, or cheaper ssd)
n version - 1=first version,…
location - future tag for denoting location context, rack, dc, etc
Current classes:
bds1 - block device - single mount (ReadWriteOnce) - standard - version 1
nfss1 - network filesystem enabled storage (ReadWriteMany) - standard - version 1
block = shortcut for bds1
nfs = shortcut for nfss1
Tests
Helm Chart tests live in the templates/tests
directory, and are essentially one Pod
per file that must be run-to-completion (i.e. restartPolicy: Never
). These Pods
are annotated in one of two ways:
"helm.sh/hook": test-success
-Pod
is expected to exit with return code0
"helm.sh/hook": test-failure
-Pod
is expected to exit with return codenot equal 0
This is a simple solution for test assertions at the Pod
scale.
As with any other resource definition, tests should have name and metadata correctly scoping them. End the Pod
name with a string that indicates what the test is suffixed with -test
.
Helm tests, must be self contained are should be atomic and non-destructive as the intention is that a chart user can use the tests to determine that the chart installed correctly. As with the following example, the test is for checking that Pods
can reach the DatabaseDS service. Other tests might be checking services are correctly exposed via Ingress
.
Helm Chart test Pod - metadata and annotations on a simple connection test
---
apiVersion: v1
kind: Pod
metadata:
name: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}-connection-test
namespace: {{ .Release.Namespace }}
labels:
app.kubernetes.io/name: databaseds-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
annotations:
"helm.sh/hook": test-success
spec:
{{- if .Values.pullSecrets }}
imagePullSecrets:
{{- range .Values.pullSecrets }}
- name: {{ . }}
{{- end}}
{{- end }}
containers:
- name: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}-connection-test
image: "{{ .Values.powersupply.image.registry }}/{{ .Values.powersupply.image.image }}:{{ .Values.powersupply.image.tag }}"
imagePullPolicy: {{ .Values.powersupply.image.pullPolicy }}
command:
- sh
args:
- -c
- "( retry --max=10 -- tango_admin --ping-device test/power_supply/1 ) && echo 'test OK'"
env:
- name: TANGO_HOST
value: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}:10000
restartPolicy: Never
Integrating a chart into the SKAMPI repo
Prerequisites
Verify that Docker, kubectl, Minikube and Helm are installed and working properly - refer to Incorporate my project into the integration environment.
The required docker images have been uploaded to and are available from Central Artefact Repository, see docker upload instructions
To integrate a helm chart into the SKAMPI repo, follow these steps:
Local steps
Clone the SKAMPI repo, available here.
Add a directory in charts with a descriptive name
Add your helm chart and associated files within that directory
Check the validity of the chart
Verify that the chart is formatted correctly
helm lint ./charts/<your_chart_directory>/
Verify that the templates are rendered correctly and the output is as expected
helm install --dry-run --debug ./charts/<your_chart_directory>/
For some debugging tips refer to: debugging tips.
Check that your chart deploys locally (utilising minikube as per our standards) and behaves as expected
make deploy KUBE_NAMESPACE=integration make deploy KUBE_NAMESPACE=integration HELM_CHART=<your_chart_directory>
Once functionality has been confirmed, go ahead and commit and push the changes
Gitlab
Once the changes had been pushed it will be built in Gitlab. Find the pipeline builds at https://gitlab.com/ska-telescope/ska-skampi/pipelines.
If the pipeline completes successfully, the full integration environment will be available at selected deployment environment which should be accessible from pipeline, i.e. http://k8s.stfc.skao.int for stfc.
Kubernetes primitives
The following focuses on the core Kubernetes primitives - Pod, Service, and Ingress. These provide the core delivery chain of a networked application to the end consumer.
The Pod
The Pod
is the basic deployable application unit in Kubernetes, and provides the primary configurable context of an application component. Within this construct, all configuration and resources are plugged in to the application.
This is a complete example that demonstrates container patterns, initContainers and life-cycle hooks discussed in the following sections.
Container patterns and life-cycle hooks
---
kind: Service
apiVersion: v1
metadata:
name: pod-examples
spec:
type: ClusterIP
selector:
app: pod-examples
ports:
- name: http
protocol: TCP
port: 80
targetPort: http
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: pod-examples
labels:
app: pod-examples
spec:
replicas: 1
template:
metadata:
labels:
app: pod-examples
spec:
volumes:
# lifecyle containers as hooks share state using volumes
- name: shared-data
emptyDir: {}
- name: the-end
hostPath:
path: /tmp
type: Directory
initContainers:
# initContainers can initialise data, and do pre-flight checks
- name: init-container
image: alpine
command: ['sh', '-c', "echo 'initContainer says: hello!' > /pod-data/status.txt"]
volumeMounts:
- name: shared-data
mountPath: /pod-data
containers:
# primary data generator container
- name: main-app-container
image: alpine
command: ["/bin/sh"]
args: ["-c", "while true; do echo 'Main app says: ' `date` >> /pod-data/status.txt; sleep 5;done"]
lifecycle:
# postStart hook is async task called on Pod boot
# useful for async container warmup tasks that are not hard dependencies
# definitely not guaranteed to run before main container command
postStart:
exec:
command: ["/bin/sh", "-c", "echo 'Hello from the postStart handler' >> /pod-data/status.txt"]
# preStop hook is async task called on Pod termination
# useful for initiating termination cleanup tasks
# definitely not guaranteed to complete before container termination (sig KILL)
preStop:
exec:
command: ["/bin/sh", "-c", "echo 'Hello from the preStop handler' >> /the-end/last.txt"]
volumeMounts:
- name: shared-data
mountPath: /pod-data
- name: the-end
mountPath: /the-end
# Sidecar helper that exposes data over http
- name: sidecar-nginx-container
image: nginx
ports:
- name: http
containerPort: 80
protocol: TCP
volumeMounts:
- name: shared-data
mountPath: /usr/share/nginx/html
livenessProbe:
httpGet:
path: /index.html
port: http
readinessProbe:
httpGet:
path: /index.html
port: http
# Ambassador pattern used as a proxy or shim to access external inputs
# gets date from Google and adds it to input
- name: ambassador-container
image: alpine
command: ["/bin/sh"]
args: ["-c", "while true; do echo 'Ambassador says: '`wget -S -q 'https://google.com/' 2>&1 | grep -i '^ Date:' | head -1 | sed 's/^ [Dd]ate: //g'` > /pod-data/input.txt; sleep 60; done"]
volumeMounts:
- name: shared-data
mountPath: /pod-data
# Adapter pattern used as a proxy or shim to generate/render outputs
# fit for external consumption (similar to Sidecar)
# reformats input data from sidecar and ambassador ready for output
- name: adapter-container
image: alpine
command: ["/bin/sh"]
args: ["-c", "while true; do cat /pod-data/status.txt | head -3 > /pod-data/index.html; cat /pod-data/input.txt | head -1 >> /pod-data/index.html; cat /pod-data/status.txt | tail -1 >> /pod-data/index.html; echo 'All from your friendly Adapter' >> /pod-data/index.html; sleep 5; done"]
volumeMounts:
- name: shared-data
mountPath: /pod-data
This will produce output that demonstrates each of the containers fulfilling their role:
$ curl http://`kubectl get service/pod-examples -o jsonpath="{.spec.clusterIP}"` initContainer says: hello! Main app says: Thu May 2 03:45:42 UTC 2019 Hello from the postStart handler Ambassador says: Thu, 02 May 2019 03:45:55 GMT Main app says: Thu May 2 03:46:12 UTC 2019 All from your friendly Adapter $ kubectl delete deployment/pod-examples service/pod-examples deployment.extensions "pod-examples" deleted service "pod-examples" deleted piers@wattle:~$ cat /tmp/last.txt Hello from the preStop handler
Container patterns
The Pod
is a cluster of one or more containers that share the same resource namespaces. This enables the Pod cluster to communicate as though they are on the same host which is ideal for preserving the one-process-per-container ideal, but be able to deliver orchestrated processes as a single application that can be separately maintained.
All Pod
deployments should be designed around having a core or leading container. All other containers in the Pod
provide auxillary or secondary services. There are three main patterns for multi-container Pods
:
Sidecar - extend the primary container functionality eg: adds logging, metrics, health checks (as input to livenessProbe/readinessProbe).
Ambasador - container that acts as an out-bound proxy for the primary container by handling translations to external services.
Adapter - container that acts as an in-bound proxy for the primary container aligning interfaces with alternative standards.
initContainers
Any serial container action that does not neatly fit into the one-process-per-container pattern, should be placed in an initContainer
. These are typically actions like initialising databases, checking for upgrade processes, executing migrations. initContainer
are executed in order, and if any one of them fails, the Pod
will be restarted inline with the restartPolicy
. With this behaviour, it is important to ensure that the initContainer
actions are idempotent, or there will be harmful side effects on restarts.
postStart/preStop
Life-cycle hooks have very few effective usecases as there is no guarantee that a postStart
task will run before the main container command does (this is demonstrated above), and there is no guarantee that a preStop
task (which is only issued when a Pod is terminated - not completed) will complete before the KILL
signal is issued to the parent container after the cluster wide configured grace period (30s).
The value of the lifecycle hooks are generally reserved for:
postStart
- running an asynchronous non-critical task in the parent container that would otherwise slow down the boot time for thePod
and impact service availability.preStop
- initiating asynchronous clean up tasks via an external service - essentially an opportunity to send a quick message out before thePod
is fully terminated.
readinessProbe/livenessProbe
Readiness probes are used by the scheduler to determine whether the container is in a state ready to serve requests.
Liveness probes are used by the scheduler to determine whether the container continues to be in a healthy state for serving requests.
Where possible, livenessProbe
and readinessProbe
should be specified. This is automatically used to calculate whether a Pod
is available and healthy and whether it should be added and load balanced in a Service
. These features can play an important role in the continuity of service when clusters are auto-healed, workloads are shifted from node to node, or during rolling updates to deployments.
The following shows the registered probes and their status for the sidecar container in the examples above:
$ kubectl describe deployment.apps/pod-examples ... sidecar-nginx-container: Image: nginx Port: 80/TCP Host Port: 0/TCP Liveness: http-get http://:http/index.html delay=0s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://:http/index.html delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /usr/share/nginx/html from shared-data (rw) ...
While probes can be a command, it is better to make health checks an http service that is combined with an application metrics handler so that external applications can use the same feature to do health checking (eg: Prometheus, or Icinga).
Use of Services
Service
resources should be defined in the same template file as the associated application deployment and ordered at the top. This will ensure that service related environment variables will be passed into the deployment at scheduling time. It is good practice to only have a single Service
resource per deployment that covers the port mapping/exposure for each application port. It is also important to only have one deployment per Service
as it will make debugging considerably harder mapping a Service
to more than one application. As part of this, ensure that the selector
definition is specific to the fully qualified deployment including release and version to prevent leakage across multiple deployment versions. Fully qualify port definitions with name
, port
, protocol
and targetPort
so that the interface is self documenting. Using names for targetPort
the same as name
is encouraged as this can give useful hints as to the function of the container interface.
Service resource with fully qualified port description and specific selector
---
apiVersion: v1
kind: Service
metadata:
name: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
type: ClusterIP
ports:
- name: rest
protocol: TCP
port: 80
targetPort: rest
selector:
app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
type: ClusterIP
is the default and should almost always be used and declared. NodePort
should only be used under exceptional circumstances as it will reserve a fixed port on the underlying node using up the limited node port address range resource.
Only expose ports that are actually needed external to the deployment. This will help reduce clutter and reduce the surface area for attack on an application.
Use of Ingress
A Helm chart represents an application to be deployed, so it follows that it is best practice to have a single Ingress
resource per chart. This represents the single frontend for an application that exposes it to the outside world (relative to the Kubernetes cluster). If a chart seemingly requires multiple hostnames and/or has services that want to inhabit the same port or URI space, then consider splitting this into multiple charts so that the component application can be published independently.
It is useful to parameterise the control of SSL/TLS configuration so that this can be opted in to in various deployment strategies (as below).
One Ingress per chart with TLS parameterised
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: rest-api-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
labels:
app.kubernetes.io/name: rest-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
app.kubernetes.io/managed-by: "{{ .Release.Service }}"
helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
annotations:
{{- range $key, $value := .Values.ingress.annotations }}
{{ $key }}: {{ $value | quote }}
{{- end }}
spec:
rules:
- host: {{ .Values.ingress.hostname }}
http:
paths:
- path: /
backend:
serviceName: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
servicePort: 80
{{- if .Values.ingress.tls.enabled }}
tls:
- secretName: {{ tpl .Values.ingress.tls.secretname . }}
hosts:
- {{ tpl .Values.ingress.hostname . }}
{{- end -}}
Scheduling and running cloud native application suites
Security
Security covers many things, but this section will focus on RBAC and network Policies.
Roles
Kubernetes will implement role based access control which will be used to control external and internal user access to scheduling and consuming resources.
While it is possible to create serviceAccounts
to modify the privileges for a deployment, this should generally be avoided so that the access control profile of the deploying user can be inherited at launch time.
Do not create ClusterRole
and ClusterRoleBinding
resources and/or allocate these to ServiceAccounts
used in a deployment as these have extended system wide access rights. Role
and RoleBinding
are scoped to the deployment Namespace
so limit the scope for damage.
Pod Security Policies
Pod Security Policies will affect what can be requested in the securityContext section.
It should be assumed that Kubernetes clusters will run restrictive Pod security policies, so it should be expected that:
Pods
do not need to access resources outside the currentNamespace
.Pods
do not run asprivileged: true
and will not have privilege escalation.hostNetwork
activation will require discussion with operations.hostIPC
will be unavailable.hostPID
will be unavailable.Containers should run as a non-root user.
host ports will be restricted.
host paths will be restricted (
hostPath
mounts).it maybe required to have read only root filesystem (layer in container).
Capabilities maybe dropped and a restricted list put in place to determine what can be added.
it should be expected that the
default
service account credentials will NOT be mounted into the running containers by default - applications should rarely need to query the Kubernetes API, so access will be removed by default.
In general, only system level deployments such as Kubernetes control plane components (eg: adminsion controllers, device drivers, Operators, etc.) are the only deployments that should have cluster level rights.
Network Policies
Explicit Network Policies are encouraged to restrict unintended access across deployments, and to secure applications from some forms of intrusion.
The following restricts access to the deployed TangoDB to only the DatabaseDS application.
One Ingress per chart with TLS parameterised
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}-network-policy
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: tangodb-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
# enable the DatabaseDS interface
matchLabels:
app.kubernetes.io/name: databaseds-{{ template "tango-chart-example.name" . }}
app.kubernetes.io/instance: "{{ .Release.Name }}"
ports:
- name: ds
protocol: TCP
port: 10000
egress:
- to:
# anywhere in the standard Pod Network address range to all ports
- ipBlock:
cidr: 10.0.0.0/16
Resource reservations and constraints
Compute platform level resources encompass:
Memory.
CPU.
Plugin based devices.
Extended resources - configured node level logical resources.
Resources can be either specified in terms of:
Limits - the maximum amount of resource a container is allowed to consume before it maybe restarted or evicted.
Requests - the amount of resource a container requires to be available before it will be scheduled.
Limits and requests are specified at the individual container level:
... containers: - name: tango-device-thing resources: requests: cpu: 4000m # 4 cores memory: 512M # 0.5GB skatelescope.org/widget: 3 limits: cpu: 8000m # 8 cores memory: 1024M # 1GB
Resource requirements should be explicitly set both in terms of requests and limits (not normally applicable to extended resources) as this can be used by the scheduler to determine load balancing policy, and to determine when an application is misbehaving. These parameters should be set as configured values.yaml
parameters.
Restarts
Containers should be designed to cleanly crash - the main process should exit on a fatal error (no internal restart). This then will ensure that the configured livenessProbe
and readinessProbe
function correctly and where necessary, remove the affected Pod
from Services
ensuring that there are no dead service connections.
Logging
The SKA has adopted SKA Log Message Format as the logging standard to be used by all SKA software. This should be considered a base line standard and will be decorated with additional data by an infrastructure wide integrated logging solution (eg: ElasticStack). To ensure compliance with this, all containers must log to stdout/stderr
and/or be configured to log to syslog
. Connection to syslog
should be configurable using standard container mechanisms such as mounted files (handled by ConfigMaps
) or environment variables. This will ensure that any deployed application can be automatically plugged into the infrastructure wide logging and monitoring solution. A simple way to achieve this is to use a logging client library that is dynamically configurable for output destination such as import logging
for Python
.
Metrics
Each Pod
should have an application metrics handler that emits the adopted container standard format. For efficency purposes this should be amalgamated with the livenessProbe
and readinessProbe
.
Scheduling
Scheduling in Kubernetes enables the resources of the entire cluster to be allocated using a fine grained model. These resources can be partitioned according to user policies, namespaces, and quotas. The default scheduler is a comprehensive rules processing engine that should be able to satisfy most needs.
The primary mechanism for routing incoming tasks to execution is by having a labelling system throughout the cluster that reflects the distribution profile of workloads and types of resources required, coupled with Node and Pod affinity/anti-affinity rules. These are applied like a sieve to the available resources that the Scheduler keeps track of to determine if resources are available and where the next Pod can be placed.
Scheduling on Kubernetes behaves similarly to a force directed graph, in that the tensions between the interdependent rules form the pressures of the spring bars that influence relative placement across the cluster.
When creating scheduling constraints, attempt to keep them as generic as possible. Concentrate on declaring rules related to the individual Helm chart and the current chart in relation to any dependent charts (subcharts). Avoid coding in node specific requirements. Often it is more efficient to outsource the rules to the values.yaml
file as they are almost guaranteed to change between environments.
---
...
{{- with .Values.nodeSelector }}
nodeSelector:
{{ toYaml . | indent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{ toYaml . | indent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{ toYaml . | indent 8 }}
{{- end }}
...
Always remember that the Kubernetes API is declarative and expect that deployments will use the apply
semantics of kubectl, with the scheduler constantly trying to move the system towards the desired state as and when resources become available as well as in response to failures. This means that scheduling is not guaranteed, so any downstream depedencies must be able to cope with that (also a tenent of micro-services architecture).
Examples of scheduling control patterns
The below scheduling scenarios are run using the following conditions:
container replicas launched using a sleep command in busybox, defined in a StatefulSet.
Specific node.
Type of node.
Density - 1 per node, n per node.
Position next another Pod - specific Pod, or Pod type.
Soft and hard rules.
A four node cluster - master and three minions.
The nodes have been split into two groups: rack01 - k8s-master-0 and k8s-minion-0, and rack02 - k8s-minion-1, and k8s-minion-2.
The master node has the labels: node-role.kubernetes.io/headnode, and node-role.kubernetes.io/master.
The aim is to demonstrate how the scheduler works, and how to configure for the common use cases.
obs1 and obs2 - nodeAffinity
Use nodeSelector to force all 3 replicas onto rack: rack01
for obs1-rack01 and rack02
for obs2-rack02:
node select rack01 for obs1-rack01 and rack02 for obs2-rack02
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: obs1-rack01
labels:
group: scheduling-examples
app: obs1
spec:
replicas: 3
selector:
matchLabels:
app: obs1
serviceName: obs1
template:
metadata:
labels:
group: scheduling-examples
app: obs1
annotations:
description: node select rack01
spec:
containers:
- image: busybox:1.28.3
name: obs1-rack01
command: ["sleep", "365d"]
nodeSelector:
rack: rack01
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: obs2-rack02
labels:
group: scheduling-examples
app: obs2
spec:
replicas: 3
selector:
matchLabels:
app: obs2
serviceName: obs2
template:
metadata:
labels:
group: scheduling-examples
app: obs2
annotations:
description: node select rack02
spec:
containers:
- image: busybox:1.28.3
name: obs2-rack02
command: ["sleep", "365d"]
nodeSelector:
rack: rack02
Scenario obs1 - run 3 Pods on hosts allocated to rack01. Only nodes master-0, and minion-0 are used reflecting rack01.
NAME DESC STATUS NODE
obs1-rack01-0 node select rack01 Running k8s-master-0
obs1-rack01-1 node select rack01 Running k8s-minion-0
obs1-rack01-2 node select rack01 Running k8s-master-0
and for Scenario obs2 - run 3 Pods on hosts allocated to rack02. Only minion-1 and minion-2 are used reflecting rack02.
NAME DESC STATUS NODE
obs2-rack02-0 node select rack02 Running k8s-minion-2
obs2-rack02-1 node select rack02 Running k8s-minion-1
obs2-rack02-2 node select rack02 Running k8s-minion-2
obs3 - nodeAffinity exclussion
Use nodeAffinity operator: NotIn
rules to exclude the master node from scheduling:
nodeAffinity NotIn master
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: obs3-node-affinity-not-master
labels:
group: scheduling-examples
app: obs3
spec:
replicas: 4
selector:
matchLabels:
app: obs3
serviceName: obs3
template:
metadata:
labels:
group: scheduling-examples
app: obs3
annotations:
description: nodeAffinity NotIn master
spec:
containers:
- image: busybox:1.28.3
name: obs3-node-affinity-not-master
command: ["sleep", "365d"]
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: NotIn
values:
- ""
Scenario obs3 - run 4 Pods on any host so long as they are not labelled node-role.kubernetes.io/master. In this case minion-0 and minion-1 have been selected minion-2 could also have been used.
NAME DESC STATUS NODE
obs3-node-affinity-not-master-0 nodeAffinity NotIn master Running k8s-minion-1
obs3-node-affinity-not-master-1 nodeAffinity NotIn master Running k8s-minion-0
obs3-node-affinity-not-master-2 nodeAffinity NotIn master Running k8s-minion-1
obs3-node-affinity-not-master-3 nodeAffinity NotIn master Running k8s-minion-0
obs4 - nodeAntiAffinity
Use podAffinity (hard requiredDuringSchedulingIgnoredDuringExecution) to position on the same node as obs1-rack01, and nodeAntiAffinity to (soft preferredDuringSchedulingIgnoredDuringExecution) exclude the node labelled ‘node-role.kubernetes.io/headnode’ from scheduling:
podAffinity require obs1-rack01, nodeAntiAffinity prefer headnode
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: obs4-pod-affinity-obs1-pref-not-headnode
labels:
group: scheduling-examples
app: obs4
spec:
replicas: 5
selector:
matchLabels:
app: obs4
serviceName: obs4
template:
metadata:
labels:
group: scheduling-examples
app: obs4
annotations:
description: podAffinity req obs1, nodeAntiAffinity pref headnode
spec:
containers:
- image: busybox:1.28.3
name: obs4-pod-affinity-obs1-pref-not-headnode
command: ["sleep", "365d"]
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- obs1
topologyKey: kubernetes.io/hostname
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-role.kubernetes.io/headnode
operator: NotIn
values:
- ""
Scenario obs4 - run 5 Pods using required Pod Affinity with obs1 and preferred Node Anti Affinity with headnode (master label). Pods have been scheduled on minion-0 and master-0 as this is where obs1 is. This is further compounded by the anti affinity rule with headnode where only one replica is on master-0.
NAME DESC STATUS NODE
obs4-pod-affinity-obs1-pref-not-headnode-0 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0
obs4-pod-affinity-obs1-pref-not-headnode-1 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0
obs4-pod-affinity-obs1-pref-not-headnode-2 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0
obs4-pod-affinity-obs1-pref-not-headnode-3 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-master-0
obs4-pod-affinity-obs1-pref-not-headnode-4 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0
obs5 - podAntiAffinity
Use podAntiAffinity (hard requiredDuringSchedulingIgnoredDuringExecution) to ensure only one instance of self per node (topologyKey: “kubernetes.io/hostname”), and podAffinity to require a position on the same node as obs3:
podAntiAffinity require self and podAffinity require obs3
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: obs5-pod-one-per-node-and-obs3
labels:
group: scheduling-examples
app: obs5
spec:
replicas: 5
selector:
matchLabels:
app: obs5
serviceName: obs5
template:
metadata:
labels:
group: scheduling-examples
app: obs5
annotations:
description: podAntiAffinity req self, podAffinity req obs3
spec:
containers:
- image: busybox:1.28.3
name: obs5-pod-one-per-node-and-obs3
command: ["sleep", "365d"]
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- obs5
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- obs3
topologyKey: "kubernetes.io/hostname"
Scenario obs5 - run 3 Pods using required Pod Anti Affinity with self (force schedule one per node) and require Pod Affinity with obs3. This has forced scheduling of one per node, and because obs3 is only running on two different nodes the 3rd replica is in a constant state of Pending. Pod Affinity is described with a topology key that is
obs6 - Taint NoSchedule
kubernetes.io/hostname ie. the node identifier. The topology key sets the scope for implementing the rule, so could be a node, a group of nodes, an OS or device classificaton etc.
NAME DESC STATUS NODE
obs5-pod-one-per-node-and-obs3-0 podAntiAffinity req self, podAffinity req obs3 Running k8s-minion-0
obs5-pod-one-per-node-and-obs3-1 podAntiAffinity req self, podAffinity req obs3 Running k8s-minion-1
obs5-pod-one-per-node-and-obs3-2 podAntiAffinity req self, podAffinity req obs3 Pending <none>
First, the master node is tainted to disallow scheduling with kubectl cordon <master node>
.
Use nodeSelector to force all 3 replicas onto rack: rack01
, but this will fail to schedule as the taint will not allow it so subsequently forced onto minion-0:
node select rack01, but trapped by Taint NoSchedule
---
# kubectl taint nodes k8s-master-0 key1=value1:NoSchedule, or kubectl cordon k8s-master-0
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: obs6-rack01-taint
labels:
group: scheduling-examples
app: obs6
spec:
replicas: 3
selector:
matchLabels:
app: obs6
serviceName: obs6
template:
metadata:
labels:
group: scheduling-examples
app: obs6
annotations:
description: node select rack01, but trapped by Taint NoSchedule
spec:
containers:
- image: busybox:1.28.3
name: obs6-rack01-taint
command: ["sleep", "365d"]
nodeSelector:
rack: rack01
The resulting schedule is:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
obs6-rack01-taint-0 1/1 Running 0 32s 192.168.105.180 k8s-minion-0 <none>
obs6-rack01-taint-1 1/1 Running 0 31s 192.168.105.177 k8s-minion-0 <none>
obs6-rack01-taint-2 1/1 Running 0 29s 192.168.105.181 k8s-minion-0 <none>
For obs6, a StatefulSet that has nodeSelector:
nodeSelector:
rack: rack01
The result shows that of the two nodes (ks-master-0, and k8s-minion-0) in rack01, only k8s-minion-0 is available for these Pods.
obs7 - add toleration
Repeat obs6 as obs7 but add a toleration to the NoSchedule taint:
node select rack01, with Toleration to Taint NoSchedule
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: obs7-rack01-taint-and-toleration
labels:
group: scheduling-examples
app: obs7
spec:
replicas: 3
selector:
matchLabels:
app: obs7
serviceName: obs7
template:
metadata:
labels:
group: scheduling-examples
app: obs7
annotations:
description: node select rack01, with Tolleration to Taint NoSchedule
spec:
containers:
- image: busybox:1.28.3
name: obs7-rack01-taint-and-toleration
command: ["sleep", "365d"]
nodeSelector:
rack: rack01
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
Now with the added a Toleration to the Taint, we have the following:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
obs7-rack01-taint-and-toleration-0 1/1 Running 0 33s 192.168.105.184 k8s-minion-0 <none>
obs7-rack01-taint-and-toleration-1 1/1 Running 0 32s 192.168.72.27 k8s-master-0 <none>
obs7-rack01-taint-and-toleration-2 1/1 Running 0 31s 192.168.105.182 k8s-minion-0 <none>
For a StatefulSet that has nodeSelector and Tolerations:
nodeSelector:
rack: rack01
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
The result shows that the two nodes k8s-master-0, and k8s-minion-0 in rack01, are available for these Pods.