This is the multi-page printable view of this section. Click here to print.
AgileTV CDN Manager (esb3027)
- 1: Getting Started
- 2: System Requirements Guide
- 3: Architecture Guide
- 4: Quick Start Guide
- 5: Installation Guide
- 6: Configuration Guide
- 7: Networking
- 8: Storage Guide
- 9: Metrics and Monitoring
- 10: Operations Guide
- 11: Post Installation Guide
- 12: Releases
- 12.1: Release esb3027-1.4.0
- 12.2: Release esb3027-1.2.1
- 12.3: Release esb3027-1.2.0
- 12.4: Release esb3027-1.0.0
- 13: API Guides
- 13.1: Healthcheck API
- 13.2: Authentication API
- 13.3: Router API
- 13.4: Selection Input API
- 13.5: Operator UI API
- 14: Use Cases
- 14.1: Custom Deployments
- 15: Troubleshooting Guide
- 16: Glossary
1 - Getting Started
Introduction
The ESB3027 AgileTV CDN Manager is a suite of services responsible for coordinating the Content Delivery Network (CDN) operations. It provides essential APIs and features supporting the ESB3024 AgileTV CDN Director. Key capabilities include:
Centralized user management for authentication and authorization Configuration services, APIs, and user interfaces CDN usage monitoring and metrics reporting License-based tracking, monitoring, and billing Core API services Event coordination and synchronization The software can be deployed as either a self-managed cluster or in a public cloud environment such as AWS. Designed as a cloud-native application following CNCF best practices, its deployment varies slightly depending on the environment:
Self-hosted: A lightweight Kubernetes cluster runs on bare-metal or virtual machines within the customer’s network. The application is deployed within this cluster.
Public cloud: The cloud provider manages the cluster infrastructure, with the application deploying into it. The differences are primarily operational; the software’s functionality remains consistent across environments, with distinctions clearly noted in this guide.
Since deployment relies on Kubernetes, familiarity with key tools is essential:
helm: The package manager for Kubernetes, used for installing, upgrading, rolling back, and removing application charts. Helm charts are collections of templates and default values that generate Kubernetes manifests for deployment.
kubectl: The primary command-line tool for managing Kubernetes resources and applications. In a self-hosted setup, it’s typically used from the control plane nodes; in cloud environments, it may be run locally, often from your laptop or desktop.
Cloud provider tools: In cloud environments, familiarity with CLI tools like awscli and the WebUI is also required for managing infrastructure.
Architectural Overview
See the Architecture Guide.
Installation Overview
The installation process for the manager varies depending on the environment.
Self-hosted: Begin by deploying a lightweight Kubernetes cluster. The installation ISO includes an installer for a simple K3s cluster, a Rancher Labs Kubernetes distribution.
Public cloud: Use your cloud provider’s tooling to deploy the cluster. Specific instructions are beyond this document’s scope, as they vary by provider.
Once the cluster is operational, the remaining steps are the same: deploy the manager software using Helm.
The following sections provide an overview based on your environment. For detailed instructions, refer to the Installation Guide.
Hardware Requirements
In a Kubernetes cluster, each node has a fixed amount of resources—such as CPU, memory, and free disk space. Pods are assigned to nodes based on resource availability. The control plane uses a best-effort approach to schedule pods on nodes with the lowest overall utilization.
Kubernetes manifests for each deployment specify both resource requests and limits for each pod. A node must have at least the requested resources available to schedule a pod there. Since each replica of a deployment requires the same resource requests, the total resource consumption depends on the number of replicas, which is configurable.
Additionally, a Horizontal Pod Autoscaler can automatically adjust the number of replicas based on resource utilization, within defined minimum and maximum bounds.
Because of this, the hardware requirements for deploying the software depend heavily on expected load, configuration, and cluster size. Nonetheless, there are some general recommendations for hardware selection.
See the System Requirements Guide for details about the recommended hardware, supported operating systems, and networking requirements.
Installation Guide
The installation instructions can be found in the Installation Guide.
Configuration Reference
A detailed look at the configuration can be found in the Configuration Reference Guide.
2 - System Requirements Guide
Cluster Sizing
The ESB3027 AgileTV CDN Manager requires a minimum of three machines for production deployment. While it’s possible to run the software on a single node in a lab environment, such an setup will not offer optimal performance or high availability.
A typical cluster comprises nodes assigned to either a Server or Agent role. Server nodes are responsible for running the control plane software, which manages the cluster, and they can also host application workloads if configured accordingly. Agent nodes, on the other hand, execute the application containers (workloads) but do not participate in the control plane or quorum. They serve to scale capacity as needed. See the Installation Guide for more information about the role types and responsibilities.
For high availability, it is essential to have an odd number of Server nodes. The minimum recommended is three, which allows the cluster to tolerate the loss of one server node. Increasing the Server nodes to five enhances resilience, enabling the cluster to withstand the loss of two server nodes. The critical factor is that more than half of the Server nodes are available; this quorum ensures the cluster remains operational. The loss of Agent nodes does not impact quorum, though workloads on failed nodes are automatically migrated if there is sufficient capacity.
Hardware Requirements
Single-Node Lab Cluster (Acceptance Testing)
For customer acceptance testing in a single-node lab environment, the following hardware is required. These requirements match the Lab Install Guide and are intended for non-production, single-node clusters only:
| CPU | Memory | Disk | |
|---|---|---|---|
| Minimum | 8 Cores | 16GB | 128GB |
| Recommended | 12 Cores | 24GB | 128GB |
- Disk space should be available in the
/varpartition
Note: These requirements are for lab/acceptance testing only. For production workloads, see below.
Production Cluster (3 or More Nodes)
The following tables outline the minimum and recommended hardware specifications for different node
roles within a production cluster. All disk space values refer to the available space on the
/var/lib/longhorn partition. Additional capacity may be needed in other locations not specified
here; it is advisable to follow the operating system vendor’s recommendations for those areas. For
optimal performance, it is recommended to use SSDs or similar high-speed disks for Longhorn storage.
Both virtual machines and bare-metal hardware are supported; however, hosting multiple nodes under a
single hypervisor can impact performance.
Server Role - Control Plane only
| CPU | Memory | Disk | |
|---|---|---|---|
| Minimum | 4 Cores | 8GB | 64GB |
| Recommended | 8 Cores | 16GB | 128GB |
- Disk space should be available in the
/varpartition
Agent Role
| CPU | Memory | Disk | |
|---|---|---|---|
| Minimum | 8 Cores | 16GB | 128GB |
| Recommended | 16 Cores | 32GB | 256GB |
- Disk space should be available in the
/varpartition
Server Role - Control Plane + Workloads
| CPU | Memory | Disk | |
|---|---|---|---|
| Minimum | 12 Cores | 24GB | 128GB |
| Recommended | 24 Cores | 48GB | 256GB |
- Disk space should be available in the
/varpartition
Operating System Requirements
| Operating System | Supported |
|---|---|
| RedHat 7 | No |
| RedHat 8 | Yes |
| RedHat 9 | Yes |
| RedHat 10 | Untested |
We currently support RedHat Enterprise Linux or any compatible clone such as Oracle Linux, Alma Linux, etc., as long as the major version is listed as supported in the above table.
SELinux support will be installed if SELinux is “Enforcing” when installing the ESB3027 AgileTV CDN Manager cluster.
Networking Requirements
A minimum of 1 Network Interface Card must be present and configured as the default gateway on the node when the cluster is installed. If the node does not have an interface with the default route, a default route must be configured. See the Installation Guide for details.
3 - Architecture Guide
Kubernetes Architecture
Kubernetes is an open-source container orchestration platform that simplifies the deployment, management, and scaling of containerized applications. It provides a robust framework to run applications reliably across a cluster of machines by abstracting the complexities of the underlying infrastructure. At its core, Kubernetes manages resources through various objects that define how applications are deployed and maintained.
Nodes are the physical or virtual machines that make up the Kubernetes cluster. Each node runs a container runtime, the kubelet agent, and other necessary components to host and manage containers. The smallest deployable units in Kubernetes are Pods, which typically consist of one or more containers sharing storage, network, and a specified way to run the containers. Containers within Pods are the actual runtime instances of the applications.
To manage the lifecycle of applications, Kubernetes offers different controllers such as Deployments and StatefulSets. Deployments are used for stateless applications, enabling easy rolling updates and scaling. StatefulSets, on the other hand, are designed for stateful applications that require persistent storage and stable network identities, like databases. Kubernetes also uses Services to provide a stable network endpoint that abstracts Pods, facilitating reliable communication within the application or from outside the cluster, often distributing traffic load across multiple Pods.
graph TD
subgraph Cluster
direction TB
Node1["Node"]
Node2["Node"]
end
subgraph "Workloads"
Deployment["Deployment (stateless)"]
StatefulSet["StatefulSet (stateful)"]
Pod1["Pod"]
Pod2["Pod"]
Container1["Container"]
Container2["Container"]
end
subgraph "Networking"
Service["Service"]
end
Node1 -->|Hosts| Pod1
Node2 -->|Hosts| Pod2
Deployment -->|Manages| Pod1
StatefulSet -->|Manages| Pod2
Pod1 -->|Contains| Container1
Pod2 -->|Contains| Container2
Service -->|Provides endpoint to| Pod1
Service -->|Provides endpoint to| Pod2Additional Concepts
Both Deployments and StatefulSets can be scaled by adjusting the number of Pod replicas.
In a Deployment, replicas are considered identical clones of the Pod, and a Service
typically performs load balancing across them. Each replica in a ReplicaSet is assigned
a fixed name, usually following a pattern like <name>-<index>, for example, postgresql-0,
postgresql-1, and so on.
Many applications use a fixed number of replicas set through Helm, which remains constant regardless of system load. Alternatively, for more dynamic scaling, a Horizontal Pod Autoscaler (HPA) can be used to automatically adjust the number of replicas between a defined minimum and maximum based on real-time load metrics. In public cloud environments, a Vertical Pod Autoscaler (VPA) may also be employed to dynamically scale the number of nodes, but since this feature is not supported in self-hosted setups and depends on the specific cloud provider’s implementation, it is less commonly used in on-premises environments.
Architectural Diagram
graph TD
subgraph Cluster
direction TB
PostgreSQL[PostgreSQL Database]
Kafka[kafka-controller Pods]
Redis[Redis Master & Replicas]
VictoriaMetrics[VictoriaMetrics]
Prometheus[Prometheus Server]
Grafana[Grafana Dashboard]
Gateway[Nginx Gateway]
Confd[Confd]
Manager[ACD-Manager]
Frontend[MIB Frontend]
ZITADEL[Zitadel]
Telegraf[Telegraf]
AlertManager[Alertmanager]
end
PostgreSQL -->|Stores data| Manager
Kafka -->|Streams data| Manager
Redis -->|Cache / Message Broker| Manager
VictoriaMetrics -->|Billing data| Grafana
Prometheus -->|Billing data| VictoriaMetrics
Prometheus -->|Monitoring data| Grafana
Manager -->|Metrics & Monitoring| Prometheus
Manager -->|Alerting| AlertManager
Manager -->|User Interface| Frontend
Manager -->|Authentication| ZITADEL
Frontend -->|Authentication| Manager
Confd -->|Config Updates| Manager
Telegraf -->|System Metrics| Prometheus
Gateway -->|Proxies| Director[Director APIs]
style PostgreSQL fill:#f9f,stroke:#333,stroke-width:1px
style Kafka fill:#ccf,stroke:#333,stroke-width:1px
style Redis fill:#cfc,stroke:#333,stroke-width:1px
style VictoriaMetrics fill:#ffc,stroke:#333,stroke-width:1px
style Prometheus fill:#ccf,stroke:#333,stroke-width:1px
style Grafana fill:#f99,stroke:#333,stroke-width:1px
style Gateway fill:#eef,stroke:#333,stroke-width:1px
style Confd fill:#eef,stroke:#333,stroke-width:1px
style Manager fill:#eef,stroke:#333,stroke-width:1px
style Frontend fill:#eef,stroke:#333,stroke-width:1px
style ZITADEL fill:#eef,stroke:#333,stroke-width:1px
style Telegraf fill:#eef,stroke:#333,stroke-width:1px
style AlertManager fill:#eef,stroke:#333,stroke-width:1pxCluster Scaling
Most components, of the cluster can be horizontally scaled, as long as sufficient resources exist in the cluster to support the additional pods. There are a few exceptions however. The Selection Input service, currently does not support scaling as the order in which Kafka records would no longer be maintained among different consumer group members. Services such as PostgreSQL, Prometheus and VictoriaMetrics also do not support scaling at the present time due to the additional configuration requirements. Most if not all of the other services may be scaled, either by explicitly setting the number of replicas in the configuration or in some cases by enabling and configuring the horizontal pod autoscaler.
The Horizontal Pod Autoscaler, monitors the resource utilization of the Pods in a deployment, and based on some configurable metrics, will manage the scaling between a preset minimum and maximum number of replicas. See the Configuration Guide for more information.
Kubernetes automatically selects which node will run the pods based on several factors including, the resource utilization of the nodes, any pod and node affinity rules, selector labels, among other considerations. By default, all nodes with the ability to run workloads of both Server and Agent roles are considered unless specific configuration for node and pod affinity rules have been defined.
Summary
- The
acd-managerinteracts with core components like PostgreSQL, Kafka, and Redis for data storage, messaging, and caching. - It exposes APIs via the API Gateway and integrates with Zitadel for authentication.
- Monitoring and alerting are handled through Prometheus, VictoriaMetrics, Grafana, and Alertmanager.
- Supporting services like Confd facilitate configuration management, while Telegraf collects system metrics.
4 - Quick Start Guide
Lab Install Guide
This section describes a simplified installation process for customer acceptance testing in a single-node lab environment. Unlike the production Quick Start Guide (which assumes 3 or more nodes), the Lab Install Guide is intended for customers to perform acceptance testing prior to installing a production environment.
System Requirements:
- RHEL 8 or 9 (or equivalent) with at least a minimal installation
- 8-core CPU
- 16 GB RAM
- 128 GB available disk space in the
/varpartition
Step 1: Mount the ISO
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
Step 2: Install the Base Cluster Software
/mnt/esb3027/install
Step 3: (Air-gapped only) Mount the Extras ISO and Load Images
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
/mnt/esb3027-extras/load-images
Step 4: Deploy the Cluster Helm Chart
helm install --wait --timeout 10m acd-cluster /mnt/esb3027/helm/charts/acd-cluster
Step 5: Deploy the Manager Helm Chart
helm install acd-manager /mnt/esb3027/helm/charts/acd-manager --values ~/values.yaml --timeout 10m
Step 6: Next Steps
See the Post Install Guide for post-installation steps and recommendations.
You can now access the manager and begin acceptance testing. For full configuration details, see the full Installation Guide.
Quick Start Guide
This section provides a concise, step-by-step summary for installing the ESB3027 AgileTV CDN Manager cluster in a production environment. The Quick Start Guide is intended for production deployments with three or more nodes, providing high availability and scalability. For full details, see the full Installation Guide.
Step 1: Mount the ISO
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
Step 2: Install the Base Cluster Software
/mnt/esb3027/install
Step 3: (Air-gapped only) Mount the Extras ISO and Load Images
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
/mnt/esb3027-extras/load-images
Step 4: Fetch the Node Token
cat /var/lib/rancher/k3s/server/node-token
Step 5: Join Additional Nodes
On each additional node, repeat Step 1, then run:
/mnt/esb3027/join-server https://<primary-server-ip>:6443 <node-token>
# or for agent nodes:
/mnt/esb3027/join-agent https://<primary-server-ip>:6443 <node-token>
Step 6: Deploy the Cluster Helm Chart
helm install --wait --timeout 10m acd-cluster /mnt/esb3027/helm/charts/acd-cluster
Step 7: Deploy the Manager Helm Chart
helm install acd-manager /mnt/esb3027/helm/charts/acd-manager --values ~/values.yaml --timeout 10m
Step 8: Next Steps
See the Post Install Guide for post-installation steps and recommendations.
For configuration details and troubleshooting, see the full Installation Guide.
5 - Installation Guide
SELinux Requirements
SELinux is fully supported provided it is enabled and set to “Enforcing” mode at the time of the initial cluster installation on all Nodes. This is the default configuration for Red Hat Enterprise Linux and its derivatives, such as Oracle Linux and AlmaLinux. If the mode is set to “Enforcing” prior to install time, the necessary SELinux packages will be installed, and the cluster will be started with support for SELinux. For these reasons, enabling SELinux after the initial cluster installation is not supported.
Firewalld Requirements
Please see the Networking Guide for the current firewall recommendations.
Hardware Requirements
Refer to the System Requirements Guide for the current Hardware, Operating System, and Network Requirements.
Networking Requirements
A minimum of one Network Interface Card must be present and configured as the default gateway on the node when the cluster is installed. If the node does not have an interface with the default route, a default route must be configured. Even a black-hole route via a dummy interface will suffice. The K3s software requires a default route in order to auto-detect the node’s primary IP, and for cluster routing to function properly. To add a dummy route do the following:
ip link add dummy0 type dummy
ip link set dummy0 up
ip addr add 203.0.113.254/31 dev dummy0
ip route add default via 203.0.113.255 dev dummy0 metric 1000
Special Considerations when using Multiple Network Interfaces
If there are special network considerations, such as using a non-default interface for
cluster communication, that must be configured using the INSTALL_K3S_EXEC environment
variable as below before installing the cluster or joining nodes.
As an example, consider the case where the node contains two interfaces, bond0 and bond1, where the
default route exists through bond0, but where bond1 should be used for cluster communication. In
that case, ensure that the INSTALL_K3S_EXEC environment variable is set as follows in the environment
prior to installing or joining the cluster. Assuming that bond1 has the local IP address 10.0.0.10:
export INSTALL_K3S_EXEC="<MODE> --node-ip 10.0.0.10 --flannel-iface=bond1"
Where MODE should be one of server or agent depending on the role of the node. The initial
node used to create the cluster MUST be server, and additional nodes vary depending on the
role.
Air-Gapped Environments
In air-gapped environments—those without direct Internet access—additional considerations are
required. First, on each node, the Operating System’s ISO must be mounted so that dnf can be
used to install essential packages included with the OS. Second, the “Extras” ISO from the
ESB3027 AgileTV CDN Manager must be mounted to provide access to container images for
third-party software that would otherwise be downloaded from public repositories. Details on
mounting this ISO and loading the included images are provided below.
Introduction
details about node roles and sizing can be found in the System Requirements Guide. Installing the ESB3027 AgileTV CDN Manager for production requires a minimum of three nodes. More details about node roles and sizing can be found in the System Requirements Guide. Before beginning the installation, select one node as the primary “Server” node. This node will serve as the main installation point. Once additional Server nodes join the cluster, all Server nodes are considered equivalent, and cluster operations can be managed from any of them. The typical process involves installing the primary node as a Server, then adding more Server nodes to expand the cluster, followed by joining Agent nodes as needed to increase capacity.
Roles
All nodes in the cluster have one of two roles. Server nodes run the control-plane software necessary to manage the cluster and provide redundancy. Agent nodes do not run the control-plane software; instead, they are responsible for running the Pods that make up the applications. Jobs are distributed among agent nodes to enable horizontal scalability of workloads. However, agent nodes do not contribute to the cluster’s high availability. If an agent node fails, the Pods assigned to that node are automatically moved to another node, provided sufficient resources are available.
Control-plane only Server nodes
Both server nodes and agent nodes run workloads within the cluster. However, a special attribute called the “CriticalAddonsOnly” taint can be applied to server nodes. This taint prevents the node from scheduling workloads that are not part of the control plane. If the hardware allows, it is recommended to apply this taint to server nodes to separate their responsibilities. Doing so helps prevent misbehaving applications from negatively impacting the overall health of the cluster.
graph TD
subgraph Cluster
direction TB
ServerNodes[Server Nodes]
AgentNodes[Agent Nodes]
end
ServerNodes -->|Manage cluster and control plane| ControlPlane
ServerNodes -->|Provide redundancy| Redundancy
AgentNodes -->|Run application Pods| Pods
Pods -->|Handle workload distribution| Workloads
AgentNodes -->|Failover: Pods move if node fails| Pods
ServerNodes -->|Can run Pods unless tainted with CriticalAddonsOnly| PodExecution
Taint[CriticalAddonsOnly Taint] -->|Applied to server nodes to restrict workload| ServerNodesFor high availability, at least three nodes running the control plane are required, along with at least three nodes running workloads. These can be a combination of server and agent roles, provided that the control-plane nodes are sufficient. If a server node has the “CriticalAddonsOnly” taint applied, an additional agent node must be deployed to ensure workloads can run. For example, the cluster could consist of three untainted server nodes, or two untainted servers, one tainted server, and one agent, or three tainted servers and three agents—all while maintaining at least three control-plane nodes and three workload nodes.
The “CriticalAddonsOnly” taint can be applied to server nodes at any time after cluster installation. However, it only affects Pods scheduled in the future. Existing Pods that have already been assigned to a server node will remain there until they are recreated or rescheduled due to an external event.
kubectl taint nodes <node-name> CriticalAddonsOnly=true:NoSchedule
Where node-name is the hostname of the node for which to apply the taint. Multiple node names
may be specified in the same command. This command should only be run from one of the server nodes.
Installing the Primary Server Node
Mount the ESB3027 ISO
Start by mounting the core ESB3027 ISO on the system. There are no limitations on the exact
mountpoint used, but for this document, we will assume /mnt/esb3027.
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
Run the installer
Run the install command to install the base cluster software.
/mnt/esb3027/install
(Air-gapped only) Mount the “Extras” ISO and Load Container Images
In an air-gapped environment, after running the installer, the “extras” image must be mounted. This image contains publicly available container images that otherwise would be simply downloaded from the source repositories.
mkdir -p /mnt/esb3027-extras
mount -o loop,ro esb3027-acd-manager-extras-X.Y.Z.iso /mnt/esb3027-extras
The public container images for third-party products such as Kafka, Redis, Zitadel, etc., need to be loaded into the container runtime. An embedded registry mirror is used to distribute these images to other nodes within the cluster, so this only needs to be performed on one machine.
/mnt/esb3027-extras/load-images
Fetch the primary node token
In order to join additional nodes into the cluster, a unique node token must be provided. This token is automatically generated on the primary node during the installation process. Retrieve this now, and take note of it for later use.
cat /var/lib/rancher/k3s/server/node-token
Join Additional Server Nodes
From each additional server node, mount the core ISO and join the cluster using the following commands.
mkdir -p /mnt/esb3027
mount esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
Obtain the node token from the primary server as you will need to include it in the following command. You will also need the URL to the primary server for which to connect.
/mnt/esb3027/join-server https://primary-server-ip:6443 abcdefg0123456...987654321
Where primary-server-ip is replaced with the IP address to which this node should connect to the
primary server, and abcdef...321 is the contents of the node-token retrieved from the primary server.
Repeat the above steps on each additional Server node in the cluster.
Join Agent Nodes
From each additional agent node, mount the core ISO and join the cluster using the following commands.
mkdir -p /mnt/esb3027
mount esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027
Obtain the node token from the primary server as you will need to include it in the following command. You will also need the URL to the primary server for which to connect.
/mnt/esb3027/join-agent https://primary-server-ip:6443 abcdefg0123456...987654321
Where primary-server-ip is replaced with the IP address to which this node should connect to the
primary server, and abcdef...321 is the contents of the node-token retrieved from the primary server.
Repeat the above steps on each additional Agent node in the cluster.
Verify the state of the cluster
At this point, a generic Kubernetes cluster should have multiple nodes connected and be marked Ready. Verify this is the case by running the following from any one of the Server nodes.
kubectl get nodes
Each node in the cluster should be listed in the output with the status “Ready”, and the Server nodes should have “control-plane” in the listed Roles. If this is not the case, see the Troubleshooting Guide to help diagnose the problem.
Deploy the cluster helm chart
The acd-cluster helm chart, which is included on the core ISO, contains the clustering software which
is required for self-hosted clusters, but may be optional in Cloud deployments. Currently this consists
of a PostgreSQL database server, but additional components may be added in later releases.
helm install --wait --timeout 10m acd-cluster /mnt/esb3027/helm/charts/acd-cluster
Deploying the Manager chart
The acd-manager helm chart is used to deploy the acd-manager application as well as any of the
third-party services on which the chart depends. Installing this chart requires at least a minimal
configuration to be applied. To get started, either copy the default values.yaml file from the chart
directory /mnt/esb3027/helm/charts/acd-manager/values.yaml or copy the following minimal template to a
writable location such as the user’s home directory.
global:
hosts:
manager:
- host: manager.local
routers:
- name: director-1
address: 192.0.2.1
- name: director-2
address: 192.0.2.2
zitadel:
zitadel:
configmapConfig:
ExternalDomain: manager.local
Where:
manager.localis either the external IP or resolvable DNS name used to access the manager’s cluster.- All director instances should be listed in the
global.hosts.routerssection. Thenamefield is used in URLs, and must consist of only alpha-numeric characters or ‘.’, ‘-’, or ‘_’.
Further details on the available configuration options in the default values.yaml file can be found in
the Configuration Guide.
You must set at a minimum the following properties:
| Property | Type | Description |
|---|---|---|
| global.hosts.manager | Array | List of external IP addresses or DNS hostnames for each node in the cluster |
| global.hosts.router | Array | List of name and address for each instance of ESB3024 AgileTV CDN Director |
| zitadel.zitadel.configmapConfig.ExternalDomain | String | External DNS domain name or IP address of one manager node. This must match the first entry from global.hosts.manager |
Note! The Zitadel ExternalDomain must match the hostname or IP address given in the first
global.hosts.manager entry, and MUST match the Origin used when accessing Zitadel. This is enforced by
CORS.
Hint: For non-air-gapped environments, where no DNS servers are present, a third-party service
sslip.io may be used to provide a resolvable DNS name which can be used for both the
global.hosts.manager and Zitadel ExternalDomain entries. Any IP address passed as
W.X.Y.Z.sslip.io will resolve to the IP W.X.Y.Z
Only the value used for Zitadel’s ExternalDomain may be used to access Zitadel due to CORS
restrictions. E.g. if that is set to “10.10.10.10.sslip.io”, then Zitadel must be accessed via the URL
https://10.10.10.10.sslip.io/ui/console. This must match the first entry in global.hosts.manager as
that entry will be used by internal services that need to interact with Zitadel, such as the frontend
GUI and the manager API services.
Importing TLS Certificates
By default, the manager will generate a self-signed TLS certificate for use with the cluster ingress.
In production environments, it is recommended to use a valid TLS certificate issued by a trusted Certificate Authority (CA).
To install the TLS certificate pair into the ingress controller, the certificate and key must be saved in a Kubernetes secret. The simplest way of doing this is to let Helm generate the secret by including the PEM formatted certificate and private key directly in the configuration values. Alternatively, the secret can be created manually and simply referenced by the configuration.
Option 1: Let Helm manage the secret
To have Helm automatically manage the secret based on the PEM formatted certificate and key, add a record
to ingress.secrets as described in the following snippet.
ingress:
secrets:
- name: <secret-name>
key: |-
-----BEGIN RSA PRIVATE KEY-----
...
-----END RSA PRIVATE KEY-----
certificate: |-
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
Option 2: Manually creating the secret
To manually create the secret in Kubernetes, execute the following command: This will create a secret named “secret-name”.
kubectl create secret tls secret-name --cert=tls.crt --key=tls.key
Configure the Ingress
The ingress controllers must be configured as to the name of the secret holding the certificate and key files. Additionally, the DNS hostname or IP address, covered by the certificate, which Must be used to access the ingress, must be set in the configuration.
ingress:
hostname: <dns-hostname>
tls: true
secretName: <secret-name>
zitadel:
ingress:
tls:
- hosts:
- <dns-hostname>
secretName: <secret-name>
confd:
ingress:
hostname: <dns-hostname>
tls: true
secretName: <secret-name>
mib-frontend:
ingress:
hostname: <dns-hostname>
tls: true
secretName: <secret-name>
dns-hostname- A valid DNS hostname for the cluster which is valid for the certificate. For compatibility with Zitadel and CORS restrictions, this MUST be the same DNS hostname listed as the first entry inglobal.hosts.manager.secret-name- An arbitry name used to identify the Kubernetes secret containing the TLS certificate and key. This has a maximum length limitation of 53 characters.
Loading Maxmind GeoIP databases
The Maxmind GeoIP databases are required if GeoIP lookups are to be performed by the manager. If this functionality is used, then Maxmind formatted GeoIP databases must be configured. The following databases are used by the manager.
GeoIP2-City.mmdb- The City Database.GeoLite2-ASN.mmdb- The ASN Database.GeoIP2-Anonymous-IP.mmdb- The VPN and Anonymous IP database.
A helper utility has been provided on the ISO called generate-maxmind-volume that will prompt the user
for the locations of these 3 database files, and the name of a volume, which will be created in
Kubernetes. After running this command, set the manager.maxmindDbVolume property in the configuration
to the volume name.
To run the utility, use:
/mnt/esb3027/generate-maxmind-volume
Installing the Chart
Install the acd-manager helm chart using the following command: (This assumes the configuration is in
~/values.yaml)
helm install acd-manager /mnt/esb3027/helm/charts/acd-manager --values ~/values.yaml --timeout 10m
By default, there is not expected to be much output from the helm install command itself. If you would
like to see more detailed information in real-time throughout the deployment process, you can add the
--debug flag to the command:
helm install acd-manager /mnt/esb3027/helm/charts/acd-manager --values ~/values.yaml --timeout 10m --debug
Note: The
--timeout 10mflag increases the default Helm timeout from 5 minutes to 10 minutes. This is recommended because the default may not be sufficient on slower hardware or in resource-constrained environments. You may need to adjust the timeout value further depending on your system’s performance or deployment conditions.
Monitor the chart rollout with the following command:
kubectl get pods
The output of which should look similar to the following:
NAME READY STATUS RESTARTS AGE
acd-cluster-postgresql-0 1/1 Running 0 44h
acd-manager-6c85ddd747-5j5gt 1/1 Running 0 43h
acd-manager-confd-558f49ffb5-n8dmr 1/1 Running 0 43h
acd-manager-gateway-7594479477-z4bbr 1/1 Running 0 43h
acd-manager-grafana-78c76d8c5-c2tl6 1/1 Running 0 43h
acd-manager-kafka-controller-0 2/2 Running 0 43h
acd-manager-kafka-controller-1 2/2 Running 0 43h
acd-manager-kafka-controller-2 2/2 Running 0 43h
acd-manager-metrics-aggregator-f6ff99654-tjbfs 1/1 Running 0 43h
acd-manager-mib-frontend-67678c69df-tkklr 1/1 Running 0 43h
acd-manager-prometheus-alertmanager-0 1/1 Running 0 43h
acd-manager-prometheus-server-768f5d5c-q78xb 1/1 Running 0 43h
acd-manager-redis-master-0 2/2 Running 0 43h
acd-manager-redis-replicas-0 2/2 Running 0 43h
acd-manager-selection-input-844599bc4d-x7dct 1/1 Running 0 43h
acd-manager-telegraf-585dfc5ff8-n8m5c 1/1 Running 0 43h
acd-manager-victoria-metrics-single-server-0 1/1 Running 0 43h
acd-manager-zitadel-69b6546f8f-v9lkp 1/1 Running 0 43h
acd-manager-zitadel-69b6546f8f-wwcmx 1/1 Running 0 43h
acd-manager-zitadel-init-hnr5p 0/1 Completed 0 43h
acd-manager-zitadel-setup-kjnwh 0/2 Completed 0 43h
The output contains a “READY” column, which indicates the number of ready pods on the left, and the number of requested pods on the right. Pods with status “Completed” are one time commands that have terminated successfully and can be ignored in this output. For “Running” pods, once all pods have the same number on both sides of the “READY” status the rollout is complete.
If a Pod is marked as “CrashLoopBackoff” or “Error” this means that either one of the containers in the pod has failed to deploy, or that the container has terminated in an Error state. See the Troubleshooting Guide to help diagnose the problem. The Kubernetes cluster will retry failed pod deployments several times, and the number in the “RESTARTS” column will show the number of times that has happened. If a pod restarts during the initial rollout, this may simply be that the state of the cluster was not as expected by the pod at that time, and this can be safely ignored. After the initial rollout has completed, the pods should stabilize, and multiple restarts may be an indication that something is wrong. In that case, refer to the Troubleshooting Guide for more information.
Next Steps
For post-installation steps, see the Post Install Guide.
6 - Configuration Guide
Overview
When deploying the acd-manager helm chart, a configuration file containing the chart values must
be supplied to Helm. The default values.yaml file can be found on the ISO in the chart’s directory.
Helm does not require that the complete file be supplied at install time, as any files supplied via the
--values command will be merged with the defaults from the chart. This allows the operator to maintain
a much simpler configuration file containing only the modified values. Additionally, values may be
individually overridden by passing --set key=value to the Helm command. However, this is discouraged for
all but temporary cases, as the same arguments must be specified any time the chart is updated.
The default values.yaml file is located on the ISO under the subpath /helm/charts/acd-manager/values.yaml
Since the ISO is mounted read-only, you must copy this file to a writable location to make changes. Helm
supports multiple --values arguments where all files will be merged left-to-right before being merged
with the chart defaults.
Applying the Configuration
After updating the configuration file, you must perform a helm upgrade for the changes to be propagated
to the cluster. Helm tracks the changes in each revision, and supports rolling back to previous configurations.
During the initial chart installation, the configuration values will be supplied to Helm through the helm install
command, but to update an existing installation, the following command line shall be used instead.
helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager --values /path/to/values.yaml
Note: Both the helm install and helm upgrade commands take many of the same arguments, and a shortcut
exists helm upgrade --install which can be used in place of either, to update an existing installation, or
deploy a new installation if one did not previously exist.
If the configuration update was unsuccessful, you can roll back to a previous revision using the following
command. Keep in mind, this will not change the values.yaml file on disk, so you must revert the changes
to that file manually, or restore the file from a backup.
helm rollback acd-manager <revision_number>
You can view the current revision number of all installed charts with helm list --all
If you wish to temporarily change one or more values, for instance to increase the manager log level from “info”
to “debug”, you can do so with the --set command.
helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager --values /path/to/values.yaml --set manager.logLevel=debug
It is also possible to split the values.yaml into multiple individual files, for instance to separate manager
and metrics values in two files using the following commands. All files will be merged left to right by Helm.
Take notice however, that doing this will require all values files to be supplied in the same order any time
a helm upgrade is performed in the future.
helm upgrade acd-manager /mnt/esb3027/helm/charts/acd-manager --values /path/to/values1.yaml --values /path/to/values2.yaml
Before applying new configuration, it is recommended to perform a dry-run to ensure that the templates
can be rendered properly. This does not guarantee that the templates will be accepted by Kubernetes, only
that the templates can be properly rendered using the supplied values. The rendered templates will be output
to the console.
helm upgrade ... --dry-run
In the event that the helm upgrade fails to produce the desired results, e.g. if the correct configuration
did not propagate to all required pods, simply performing a helm uninstall acd-manager followed by the original
helm install command will force all pods to be redeployed. This is service affecting however and should only be
performed as a last-resort as all pods will be destroyed and recreated.
Configuration Reference
In this section, we break down the configuration file and look more in-depth into the options available.
Globals
The global section, is a special-case section in Helm, intended for sharing global values between charts.
most of the configuration properties here can be ignored, as they are intended as a means of globally
providing defaults that affect nested subcharts. The only necessary field here is the hosts configuration.
global:
hosts:
manager:
- host: manager.local
routers:
- name: default
address: 127.0.0.1
edns_proxy: []
geoip: []
| key | Type | Description |
|---|---|---|
| global.hosts.manager | Array | List of external IP addresses or DNS hostnames for all nodes in the Manager cluster |
| global.hosts.routers | Array | List of ESB3024 AgileTV CDN Director instances |
| global.hosts.edns_proxy | Array | List of EDNS Proxy addresses |
| global.hosts.geoip | Array | List of GeoIP Proxy addresses |
The global.hosts.manager record contains a list of objects containing a single host field. The first
of which is used by several internal services to contact Zitadel for user authentication and authorization.
Since Zitadel, which provides these services enforces CORS protections, this must match exacly the Origin
used to access Zitadel.
The global.hosts.routers record contains a list of objects each with a name and address field. The
name field is a unique identifier used in URLs to refer to the Director instance, and the address field
is the IP address or DNS name used to communicate with the Director node. Only Director instances run outside
of this cluster need to be specified here, as instances running in Kubernetes can utilize the cluster’s auto-
discovery system.
The global.hosts.edns_proxy record contains a list of objects each with an address and port field. This
list is currently unused.
The global.hosts.geoip record contains a list of objects each with an address and port field. This list
should refer to the GeoIP Proxies used by the Frontend GUI. Currently only one GeoIP proxy is supported.
Common Parameters
This section contains common parameters that are namespaced to the acd-manager chart. These should be left at their default values under most circumstances.
| Key | Type | Description |
|---|---|---|
| kubeVersion | String | Override the Kubernetes version reported by .Capabilities |
| apiVersion | String | Override the Kubernetes API version reported by .Capabilities |
| nameOverride | String | Partially override common.names.name |
| fullnameOverride | String | Fully override common.names.name |
| namespaceOverride | String | Fully override common.names.namespace |
| commonLabels | Object | Labels to add to all deployed objects |
| commonAnnotations | Object | Annotations to add to all deployed objects |
| clusterDomain | String | Kubernetes cluster domain name |
| extraDeploy | Array | List of extra Kubernetes objects to deploy with the release |
| diagnosticMode.enabled | Boolean | Enable Diagnostic mode (All probes will be disabled and the command will be overridden) |
| diagnosticMode.command | Array | Override the command when diagnostic mode is enabled |
| diagnosticMode.args | Array | Override the command line arguments when diagnostic mode is enabled |
Manager
This section represents the configuration options for the ACD Manager’s API server.
| Key | Type | Description |
|---|---|---|
| manager.image.registry | String | The docker registry |
| manager.image.repository | String | The docker repository |
| manager.image.tag | String | Override the image tag |
| manager.image.digest | String | Override a specific image digest |
| manager.image.pullPolicy | String | The image pull policy |
| manager.image.pullSecrets | Array | A list of secret names containing credentials for the configured image registry |
| manager.image.debug | boolean | Enable debug mode for the containers |
| manager.logLevel | String | Set the log level used in the containers |
| manager.replicaCount | Number | Number of manager replicas to deploy. This value is ignored if the Horizontal Pod Autoscaler is enabled |
| manager.containerPorts.http | Number | Port number exposed by the container for HTTP traffic |
| manager.extraContainerPorts | Array | List of additional container ports to expose |
| manager.livenessProbe | Object | Configuration for the liveness probe on the manager container |
| manager.readinessProbe | Object | Configuration for the readiness probe on the manager container |
| manager.startupProbe | Object | Configuration for the startup probe on the manager container |
| manager.customLivenessProbe | Object | Override the default liveness probe |
| manager.customReadinessProbe | Object | Override the default readiness probe |
| manager.customStartupProbe | Object | Override the default startup probe |
| manager.resourcePreset | String | Set the manager resources according to one common preset |
| manager.resources | Object | Set request and limits for different resources like CPU or memory |
| manager.podSecurityContext | Object | Set the security context for the manager pods |
| manager.containerSecurityContext | Object | Set the security context for all containers inside the manager pods |
| manager.maxmindDbVolume | String | Name of a Kubernetes volume containing Maxmind GeoIP, ASN, and Anonymous IP databases |
| manager.existingConfigmap | String | Reserved for future use |
| manager.command | Array | Command executed inside the manager container |
| manager.args | Array | Arguments passed to the command |
| manager.automountServiceAccountToken | Boolean | Mount Service Account token in manager pods |
| manager.hostAliases | Array | Add additional entries to /etc/hosts in the pod |
| manager.deploymentAnnotations | Object | Annotations for the manager deployment |
| manager.podLabels | Object | Extra labels for manager pods |
| manager.podAnnotations | Object | Extra annotations for the manager pods |
| manager.podAffinityPreset | String | Allowed values soft or hard |
| manager.podAntiAffinityPreset | String | Allowed values soft or hard |
| manager.nodeAffinityPreset.type | String | Allowed values soft or `hard |
| manager.nodeAffinityPreset.key | String | Node label key to match |
| manager.nodeAffinityPreset.values | Array | List of node labels to match |
| manager.affinity | Object | Override the affinity for pod assignments |
| manager.nodeSelector | Object | Node labels for manager pod assignments |
| manager.tolerations | Array | Tolerations for manager pod assignment |
| manager.updateStrategy.type | String | Can be set to RollingUpdate or Recreate |
| manager.priorityClassName | String | Manager pods’ priorityClassName |
| manager.topologySpreadConstraints | Array | Topology Spread Constraints for manager pod assignment spread across the cluster among failure-domains |
| manager.schedulerName | String | Name of the Kubernetes scheduler for manager pods |
| manager.terminationGracePeriodSeconds | Number | Seconds manager pods need to terminate gracefully |
| manager.lifecycleHooks | Object | Lifecycle Hooks for manager containers to automate configuration before or after startup |
| manager.extraEnvVars | Array | List of extra environment variables to add to the manager containers |
| manager.extraEnvVarsCM | Array | List of Config Maps containing extra environment variables to pass to the Manager pods |
| manager.extraEnvVarsSecret | Array | List of Secrets containing extra environment variables to pass to the Manager pods |
| manager.extraVolumes | Array | Optionally specify extra list of additional volumes for the manager pods |
| manager.extraVolumeMounts | Array | Optionally specify extra list of additional volume mounts for the manager pods |
| manager.sidecars | Array | Add additional sidecar containers to the manager pods |
| manager.initContainers | Array | Add additional init containers to the manager pods |
| manager.pdb.create | Boolean | Enable / disable a Pod Disruption Budget creation |
| manager.pdb.minAvailable | Number | Minimum number/precentage of pods that should remain scheduled |
| manager.pdb.maxUnavailable | Number | Maximum number/percentage of pods that may be made unavailable |
| manager.autoscaling.vpa | Object | Vertical Pod Autoscaler Configuration. Not used for self-hosted clusters |
| manager.autoscaling.hpa | Object | Horizontal Pod Autoscaler. Automatically scale the number of replicas based on resource utilization |
Gateway
The parameters under the gateway namespace are mostly identical to those of the manager section above, but
which affect the NGinx Proxy Gateway service. The additional properites here are described in the following
table.
| Key | Type | Description |
|---|---|---|
| gateway.service.type | String | Service Type |
| gateway.service.ports.http | Number | The service port |
| gateway.service.nodePorts | Object | Allows configuring the exposed node port if the service.type is “NodePort” |
| gateway.service.clusterIP | String | Override the ClusterIP address if the service.type is “ClusterIP” |
| gateway.service.loadBalancerIP | String | Override the LoadBalancer IP address if the service.type is “LoadBalancer” |
| gateway.service.loadBalancerSourceRanges | Array | Source CIDRs for the LoadBalancer |
| gateway.service.externalTrafficPolicy | String | External Traffic Policy for the service |
| gateway.service.annotations | Object | Additional custom annotations for the manager service |
| gateway.service.extraPorts | Array | Extra ports to expose in the manager service. (Normally used with the sidecar value) |
| gateway.service.sessionAffinity | String | Control where client requests go, to the same pod or round-robin |
| gateway.service.sessionAffinityConfig | Object | Additional settings for the sessionAffinity |
Selection Input
The parameters under the selectionInput namespace are mostly identical to those of the manager section above,
but which affect the selection input consumer service. The additional properties here are described in the
following table.
| Key | Type | Description |
|---|---|---|
| selectionInput.kafkaTopic | String | Name of the selection input kafka topic |
Metrics Aggregator
The parameters under the metricsAggregator namespace are mostly identical to those of the manager section above,
but which affect the metrics aggregator service.
Traffic Exposure
These parameters determine how the various services are exposed over the network.
| Key | Type | Description |
|---|---|---|
| service.type | String | Service Type |
| service.ports.http | Number | The service port |
| service.nodePorts | Object | Allows configuring the exposed node port if the service.type is “NodePort” |
| service.clusterIP | String | Override the ClusterIP address if the service.type is “ClusterIP” |
| service.loadBalancerIP | String | Override the LoadBalancer IP address if the service.type is “LoadBalancer” |
| service.loadBalancerSourceRanges | Array | Source CIDRs for the LoadBalancer |
| service.externalTrafficPolicy | String | External Traffic Policy for the service |
| service.annotations | Object | Additional custom annotations for the manager service |
| service.extraPorts | Array | Extra ports to expose in the manager service. (Normally used with the sidecar value) |
| service.sessionAffinity | String | Control where client requests go, to the same pod or round-robin |
| service.sessionAffinityConfig | Object | Additional settings for the sessionAffinity |
| networkPolicy.enabled | Boolean | Specifies whether a NetworkPolicy should be created |
| networkPolicy.allowExternal | Boolean | Doesn’t require server labels for connections |
| networkPolicy.allowExternalEgress | Boolean | Allow the pod to access any range of port and all destinations |
| networkPolicy.allowExternalClientAccess | Boolean | Allow access from pods with client label set to “true” |
| networkPolicy.extraIngress | Array | Add extra ingress rules to the Network Policy |
| networkPolicy.extraEgress | Array | Add extra egress rules to the Network Policy |
| networkPolicy.ingressPodMatchLabels | Object | Labels to match to allow traffic from other pods. |
| networkPolicy.ingressNSMatchLabels | Object | Labels to match to allow traffic from other namespaces. |
| networkPolicy.ingressNSPodMatchLabels | Object | Pod labels to match to allow traffic from other namespaces. |
| ingress.enabled | Boolean | Enable the ingress record generation for the manager |
| ingress.pathType | String | Ingress Path Type |
| ingress.apiVersion | String | Force Ingress API version |
| ingress.hostname | String | Match HOST header for the ingress record |
| ingress.ingressClassName | String | Ingress Class that will be used to implement the Ingress |
| ingress.path | String | Default path for the Ingress record |
| ingress.annotations | Object | Additional annotations for the Ingress resource. |
| ingress.tls | Boolean | Enable TLS configuration for the host defined at ingress.hostname |
| ingress.selfSigned | Boolean | Create a TLS secret for this ingress record using self-signed certificates generated by Helm |
| ingress.extraHosts | Array | An array with additional hostnames to be covered by the Ingress record. |
| ingress.extraPaths | Array | An array of extra path entries to be covered by the Ingress record. |
| ingress.extraTls | Array | TLS configuration for additional hostnames to be covered with this Ingress record. |
| ingress.secrets | Array | Custom TLS certificates as secrets |
| ingress.extraRules | Array | Additional rules to be covered with this Ingress record. |
Persistence
The following values control how persistent storage is used by the manager. Currently these have no effect as the Manager does not use any persistent volume claims, however they are documented here as the same properties are used in several subcontainers to configure persistence.
| Key | Type | Description |
|---|---|---|
| persistence.enabled | Boolean | Enable persistence using Persistent Volume Claims |
| persistence.mountPath | String | Path where to mount the volume |
| persistence.subPath | String | The subdirectory of the volume to mount |
| persistence.storageClass | String | Storage class of backing Persistent Volume Claim |
| persistence.annotations | Object | Persistent Volume Claim annotations |
| persistence.accessModes | Array | Persistent Volume Access Modes |
| persistence.size | String | Size of the data volume |
| persistence.dataSource | Object | Custom PVC data source |
| persistence.existingClaim | String | The name of an existing PVC to use for persistence |
| persistence.selector | Object | Selector to match existing Persistent Volume for data PVC |
Other Values
The following are additional parameters for the chart.
| Key | Type | Description |
|---|---|---|
| defaultInitContainers | Object | Configuration for default init containers. |
| rbac.create | Boolean | Specifies whether Role-Based Access Control Resources should be created. |
| rbac.rules | Object | Custom RBAC rules to apply |
| serviceAccount.create | Boolean | Specifies whether a ServiceAccount should be created |
| serviceAccount.name | String | Override the ServiceAccount name. If not set, a name will be generated automatically. |
| serviceAccount.annotations | Object | Additional Service Account annotations (evaluated as a template) |
| serviceAccount.automountServiceAccountToken | Boolean | Automount the service account token for the service account. |
| metrics.enabled | Boolean | Enable the export of Prometheus metrics. Not currently implemented |
| metrics.serviceMonitor.enabled | Boolean | If true, creates a Prometheus Operator ServiceMonitor |
| metrics.serviceMonitor.namespace | String | Namespace in which Prometheus is running |
| metrics.serviceMonitor.annotations | Object | Additional custom annotations for the ServiceMonitor |
| metrics.serviceMonitor.labels | Object | Extra labels for the ServiceMonitor |
| metrics.serviceMonitor.jobLabel | String | The name of the label on the target service to use as the job name in Prometheus |
| metrics.serviceMonitor.honorLabels | Boolean | Chooses the metric’s labels on collisions with target labels |
| metrics.serviceMonitor.tlsConfig | Object | TLS configuration used for scrape endpoints used by Prometheus |
| metrics.serviceMonitor.interval | Number | Interval at which metrics should be scraped. |
| metrics.serviceMonitor.scrapeTimeout | Number | Timeout after which the scrape is ended. |
| metrics.serviceMonitor.metricRelabelings | Array | Specify additional relabeling of metrics. |
| metrics.serviceMonitor.relabelings | Array | Specify general relabeling |
| metrics.serviceMonitor.selector | Object | Prometheus instance selector labels |
Sub-components
Confd
| Key | Type | Description |
|---|---|---|
| confd.enabled | Boolean | Enable the embedded Confd instance |
| confd.service.ports.internal. | Number | Port number to use for internal communication with the Confd TCP socket |
MIB Frontend
There are many additional properties that can be configured for the MIB Frontend service which are not
specified in the configuration file. The mib-frontend helm Chart follows the same basic template
as the acd-manager chart so documenting them all here would be unnecessarily repeatative. Virtually every
property in this chart can be configured under the mib-frontend namespace and be valid.
| Key | Type | Description |
|---|---|---|
| mib-frontend.enabled | Boolean | Enable the Configuration GUI |
| mib-frontend.frontend.resourcePreset | String | Use a preset resource configuration. |
| mib-frontend.frontend.resources | Object | Use custom resource configuration. |
| mib-frontend.frontend.autoscaling.hpa | Object | Horizontal Pod Autoscaler configuration for MIB Frontend component |
ACD Metrics
There are many additional properties that can be configured for the ACD metrics service which are not
specified in the configuration file. The acd-metrics helm Chart follows the same basic template
as the acd-manager chart, as do each of its subcharts. Documenting them all here would mostly be
unnecessarily repeatative. Virtually any property in this chart can be configured under the acd-metrics
namespace and be valid. For example, setting the resource preset for grafana can be achieved by setting
acd-metrics.grafana.resourcePreset etc.
| Key | Type | Description |
|---|---|---|
| acd-metrics.enabled | Boolean | Enable the ACD Metrics components |
| acd-metrics.telegraf.enabled | Boolean | Enable the Telegraf Database component |
| acd-metrics.prometheus.enabled | Boolean | Enable the Prometheus Service Instance |
| acd-metrics.grafana.enabled | Boolean | Enable the Grafana Service Instance |
| acd-metrics.victoria-metrics-single.enabled | Boolean | Enable Victoria Metrics Service instance |
Zitadel
Zitadel does not follow the same template as many of the other services. Below is a list of Zitadel specific properties.
| Key | Type | Description |
|---|---|---|
| zitadel.enabled | Boolean | Enable the Zitadel instance |
| zitadel.replicaCount | Number | Number of replicas in the Zitadel deployment |
| zitadel.image.repository | String | The full name of the image registry and repository for the Zitadel container |
| zitadel.setupJob | Object | Configuration for the initial setup job to configure the database |
| zitadel.zitadel.masterkeySecretName | String | The name of an existing Kubernetes secret containing the Zitadel Masterkey |
| zitadel.zitadel.configmapConfig | Object | The Zitadel configuration. See Configuration Options in ZITADEL |
| zitadel.zitadel.configmapConfig.ExternalDomain | String | The external domain name or IP address to which all requests must be made. |
| zitadel.service | Ojbect | Service configuration options for Zitadel |
| zitadel.ingress | Object | Traffic exposure parameters for Zitadel |
The zitadel.zitadel.configmapConfig.ExternalDomain MUST be configured with the same
value used as the first entry in in global.hosts.manager. Cross-Origin Resource Sharing (CORS)
is enforced with Zitadel, and only this origin specified here will be allowed to be used
to access Zitadel. The first entry in the global.hosts.manager Array will be used by
internal services, and if this does not match, authentication requests will not be accepted.
For example, if the global.hosts.manager entries look like this:
global:
hosts:
manager:
- host: foo.example.com
- host: bar.example.com
The Zitadel ExternalDomain must be set to foo.example.com, and all requests to Zitadel
must use foo.example.com. e.g https://foo.example.com/ui/console. Requests made to
bar.example.com will result in HTTP 404 errors.
Redis and Kafka
Both the redis and kafka subcharts follow the same basic structure as the acd-manager
chart, and the configurable values in each are nearly identical. Documenting the configuration
of these charts here would be unnecessarily redundant. However, the operator may wish to
adjust the resource configuration for these charts at the following locations:
| Key | Type | Description |
|---|---|---|
| redis.master.resources | Object | Resource configuration for the Redis master instance |
| redis.replica.resources | Object | Resource configuration for the Redis read-only replica instances |
| redis.replica.replicaCount | Number | Number of Read-only Redis replica instances |
| kafka.controller.resources | Object | Resource configuration for the Kafka controller |
| kafka.controller.replicaCount | Number | Number of Kafka controller replica instances to deploy |
Resource Configuration
All resource configuration blocks follow the same basic schema which is defined here.
| Key | Type | Description |
|---|---|---|
| resources.limits.cpu | String | The maximum CPU which can be consumed before the Pod is terminated. |
| resources.limits.memory | String | The maximum amount of memory the pod may consume before being killed. |
| resources.limits.ephemeral-storage | String | The maximum amount of storage a pod may consume |
| resources.requests.cpu | String | The minimum available CPU cores for each Pod to be assigned to a node. |
| resources.requests.memory | String | The minimum available Free Memory on a node for a pod to be assigned. |
| resources.requests.ephemeral-storage | String | The minimum amount of storage a pod requires to be assigned to a node. |
CPU values are specified in units of 1/1000 of a CPU e.g. “1000m” represents 1 core, “250m” is 1/4 of 1 core. Memory and Storage values are specified with the SI suffix, e.g. “250Mi” is 250MB, “3Gi” is 3GB, etc.
Most services also include a resourcePreset value which is a simple String representing
some common configurations.
The presets are as follows:
| Preset | Request CPU | Request Memory | Request Storage | Limit CPU | Limit Memory | Limit Storage |
|---|---|---|---|---|---|---|
| nano | 100m | 128Mi | 50Mi | 150m | 192Mi | 2Gi |
| micro | 250m | 256Mi | 50Mi | 375m | 384Mi | 2Gi |
| small | 500m | 512Mi | 50Mi | 750m | 768Mi | 2Gi |
| medium | 500m | 1024Mi | 50Mi | 750m | 1536Mi | 2Gi |
| large | 1.0 | 2048Mi | 50Mi | 1.5 | 3072Mi | 2Gi |
| xlarge | 1.0 | 3072Mi | 50Mi | 3.0 | 6144Mi | 2Gi |
| 2xlarge | 1.0 | 3072Mi | 50Mi | 6.0 | 12288Mi | 2Gi |
When considering the resource requests vs. limits, the request values should represent the minimum resource usage necessary to run the service, while the limits represent the maximum resources each pod in the deployment will be allowed to consume. The resource request and limits are per pod, so a service using “large” presets with 3 replicas will need a minimum of 3 full cores, and 6GB of available memory to start and may consume up to a maximum of 4.5 Cores and 9GB of memory across all nodes in the cluster.
Security Contexts
Most charts used in the deployment contain configuration for both Pod and Container security contexts. Below is additional information about the parameters there-in.
| Key | Type | Description |
|---|---|---|
| podSecurityContext.enabled | Boolean | Enable the Pod Security Context |
| podSecurityContext.fsGroupChangePolicy | String | Set filesystem group change policy for the nodes |
| podSecurityContext.sysctls | Array | Set kernel settings using sysctl interface for the pods |
| podSecurityContext.supplementalGroups | Array | Set filesystem extra groups for the pods |
| podSecurityContext.fsGroup | Number | Set Filesystem Group ID for the pods |
| containerSecurityContext.enabled | Boolean | Enable the container security context |
| containerSecurityContext.seLinuxOptions | Object | Set SELinux options for each container in the Pod |
| containerSecurityContext.runAsUser | Number | Set runAsUser in the containers Security Context |
| containerSecurityContext.runAsGroup | Number | Set runAsGroup in the containers Security Context |
| containerSecurityContext.runAsNonRoot | Boolean | Set runAsNonRoot in the containers Security Context |
| containerSecurityContext.readOnlyRootFilesystem | Boolean | Set readOnlyRootFilesystem in the containers Security Context |
| containerSecurityContext.privileged | Boolean | Set privileged in the container Security Context |
| containerSecurityContext.allowPrivilegeEscalation | Boolean | Set allowPrivilegeEscalation in the container’s security context |
| containerSecurityContext.capabilities.drop | Array | List of capabilities to be dropped in the container |
| containerSecurityContext.seccompProfile.type | String | Set seccomp profile in the container |
Probe Configuration
Each Pod uses healthcheck probes to determine the readiness of the pod. Three probe types are defined. startupProbe, readinessProbe, and livenessProbe. They all contain exactly the same configuration options, the only difference between the probe types is when they are executed.
Liveness Probe: Checks if the container is running. If this probe fails, Kubernetes restarts the container, assuming it is stuck or unhealthy.
Readiness Probe: Determines if the container is ready to accept traffic. If it fails, the container is removed from the service load balancer until it becomes ready again.
Startup Probe: Used during container startup to determine if the application has started successfully. It helps to prevent the liveness probe from killing a container that is still starting up.
The following table describes each of these properties:
| Property | Description |
|---|---|
| enabled | Determines whether the probe is active (true) or disabled (false). |
| initialDelaySeconds | Time in seconds to wait after the container starts before performing the first probe. |
| periodSeconds | How often (in seconds) to perform the probe. |
| timeoutSeconds | Number of seconds to wait for a probe response before considering it a failure. |
| failureThreshold | Number of consecutive failed probes before considering the container unhealthy (for liveness) or unavailable (for readiness). |
| successThreshold | Number of consecutive successful probes required to consider the container healthy or ready (usually 1). |
| httpGet | Specifies that the probe performs an HTTP GET request to check container health. |
| httpGet.path | The URL path to request during the HTTP GET probe. |
| httpGet.port | The port number or name where the HTTP GET request is sent. |
| exec | Specifies that the probe runs the specified command inside the container and expects a successful exit code to indicate health. |
| exec.command | An array of strings representing the command to run |
Only one of httpGet or exec may be specified in a single probe. These configurations are mutually exclusive.
7 - Networking
Port Usage
The following table describes the minimal firewall setup required between each node in the cluster for the Kubernetes cluster to function properly. Unless otherwise specified, these rules must allow traffic to pass between any nodes in the cluster.
| Protocol | Port | Source | Destination | Description |
|---|---|---|---|---|
| TCP | 2379-2380 | Server | Server | Etcd Service |
| TCP | 6443 | Any | Server | K3s Supervisor and Kubernetes API Server |
| UDP | 8472 | Any | Any | Flannel VXLAN |
| TCP | 10250 | Any | Any | Kubelet Metrics |
| TCP | 5001 | Any | Server | Spegel Registry Mirror |
| TCP | 9500 | Any | Any | Longhorn Management API |
| TCP | 8500 | Any | Any | Longhorn Agent |
| Any | N/A | 10.42.0.0/16 | Any | K3s Pods |
| Any | N/A | 10.43.0.0/16 | Any | K3s Services |
| TCP | 80 | Any | Any | Optional Ingress HTTP traffic |
| TCP | 443 | Any | Any | Ingress HTTPS Traffic |
The following table describes the required ports which must be allowed through any firewalls for the manager application. Access to these ports must be allowed from any client which requires access to these services towards any node in the cluster.
| Protocol | Port | Description |
|---|---|---|
| TCP | 443 | Ingress HTTPS Traffic |
| TCP | 3000 | Grafana |
| TCP | 9095 | Kafka |
| TCP | 9093 | Alertmanager |
| TCP | 9090 | Prometheus |
| TCP | 6379 | Redis |
Note: Port 443 is duplicated in both of the above tables. Port 443 is used by the internal applications running within the cluster to access Zitadel so all nodes in the cluster must have access to that port, and it’s also used to provide ingress services from outside the cluster for multiple applications.
Firewall Rules
What follows is an example script that can be used to open the required ports using
firewalld. Adjust the commands as necessary to fit the environment.
# Allow Kubernetes cluster ports (between nodes)
firewall-cmd --permanent --add-port=2379-2380/tcp
firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=8472/udp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=5001/tcp
firewall-cmd --permanent --add-port=9500/tcp
firewall-cmd --permanent --add-port=8500/tcp
# Allow all traffic from specific subnets for K3s pods/services
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.42.0.0/16" accept'
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.43.0.0/16" accept'
# Allow optional ingress HTTP/HTTPS traffic
firewall-cmd --permanent --add-port=80/tcp
firewall-cmd --permanent --add-port=443/tcp
# Allow ports for the manager application (from anywhere)
firewall-cmd --permanent --add-port=443/tcp
firewall-cmd --permanent --add-port=3000/tcp
firewall-cmd --permanent --add-port=9095/tcp
firewall-cmd --permanent --add-port=9093/tcp
firewall-cmd --permanent --add-port=9090/tcp
firewall-cmd --permanent --add-port=6379/tcp
# Reload firewalld to apply changes
firewall-cmd --reload
IP Routing
Proper IP routing is critical for cluster communication. The network must allow nodes to route traffic to each other’s pod CIDRs (e.g., 10.42.0.0/16, 10.43.0.0/16) and external clients to reach ingress and services. Verify that your network infrastructure permits routing between these subnets; otherwise, nodes may not communicate properly, impacting cluster functionality.
Handling Multiple Zones with Kubernetes Interfaces
Kubernetes creates virtual network interfaces for pods within the node’s network namespace. These interfaces are
typically not associated with any specific firewalld zone by default. Firewalld applies rules to the primary
physical interface (such as eth0), not directly to the pod interfaces.
8 - Storage Guide
Overview
Longhorn is an open-source distributed block storage system designed specifically for Kubernetes. It provides persistent storage for stateful applications by creating and managing storage volumes that are replicated across multiple nodes to ensure high availability. Longhorn integrates seamlessly with Kubernetes, allowing users to dynamically provision, attach, and manage persistent disks through standard Kubernetes PersistentVolumeClaims (PVCs).
Longhorn deploys a set of controller and replica engines as containers on each node, forming a distributed storage system. When a volume is created, Longhorn replicates data across multiple nodes, ensuring durability even in the event of node failures. The system also handles snapshots, backups, and restores, offering robust data protection. Kubernetes automatically mounts these volumes into Pods, providing persistent storage for stateful applications to operate reliably.
graph TD
subgraph Cluster Nodes
Node1["Node 1"]
Node2["Node 2"]
Node3["Node 3"]
end
subgraph Longhorn Components
Controller["Longhorn Controller"]
Replica1["Replica (Node 1)"]
Replica2["Replica (Node 2)"]
Replica3["Replica (Node 3)"]
end
subgraph Storage Volume
Volume["Persistent Volume"]
end
Node1 -->|Runs| Replica1
Node2 -->|Runs| Replica2
Node3 -->|Runs| Replica3
Controller -->|Manages| Volume
Replica1 & Replica2 & Replica3 -->|Replicate Data| VolumeAccessing the configuration GUI
Longhorn provides a web-based frontend for managing storage configurations across the Kubernetes cluster. This UI allows users to configure various aspects of the storage engine, such as the number of replicas, backup settings, snapshot management, and more.
Since this frontend does not include any authentication mechanisms and improper use could lead to significant data loss, access is restricted. To securely access the UI, a manual port-forward must be established.
You can set up a temporary connection to the Longhorn frontend using the following
kubectl port-forward command:
kubectl port-forward -n longhorn-system --address 0.0.0.0 svc/longhorn-frontend 8888:80
This command forwards local port 8888 to the Longhorn frontend service in the cluster. You can then access the UI by navigating to:
http://k3s-server:8888
This connection remains active as long as the port-forward command is running. To stop it, simply press
Ctrl+C. Make sure to run this command only when needed, and avoid leaving the UI accessible without
proper authentication.
9 - Metrics and Monitoring
The ESB3027 AgileTV CDN Manager includes a built-in metrics and monitoring solution based on Telegraf, Prometheus, and Grafana. A set of default Grafana dashboards provides visibility into CDN performance, displaying host metrics such as CPU, memory, network, and disk utilization—collected from the Director and Cache nodes via Telegraf—as well as streaming metrics from each Director instance. These metrics are stored in a Time-Series Database and visualized through Grafana dashboards. Additionally, the system supports custom dashboards using Prometheus as a data source, offering flexibility for customers to monitor all aspects of the CDN according to their specific needs.
Accessing Grafana
To access Grafana, point a browser towards any node in the cluster on port 3000. e.g. http://manager.local:3000/ and log in using the default administrator account credentials listed below.
Known Limitation: Grafana does not currently support Single-Sign-On (SSO) using Zitadel accounts.
Username: admin
Password: edgeware
On the left column, click Dashboards and select the Dashboard you wish to view.
Custom Dashboards
The grafana instance uses persistent storage within the cluster for data storage. Any custom dashboards or modifications to existing dashboards will be saved in the persistent storage volume, and will persist across software upgrades.
Billing and Licensing
A separate VictoriaMetrics Time-Series Database is included within the metrics component of the manager. It periodically scrapes usage data from Prometheus to calculate aggregated statistics and verify license compliance. This data is retained for at least one year. Grafana can also use this database as a source to display long-term usage metrics.
10 - Operations Guide
Overview
This guide details some of the common commands that will be necessary to operate the ESB3027 AgileTV CDN Manager software. Before starting, you will need at least a basic understanding of the following command line tooling.
Getting and Describing Kubernetes Resources
The two most common commands in Kubernetes are get and describe for a specific resource
such as a Pod or Service. Using kubectl get typically lists all resources of a particular
type; for example, kubectl get pods will display all pods in the current namespace. To obtain
more detailed information about a specific resource, use kubectl describe <resource>, such as
kubectl describe pod postgresql-0 to view details about that particular pod.
When describing a pod, the output includes a recent Event history at the bottom. This can be extremely helpful for troubleshooting issues, such as why a pod failed to deploy or was restarted. However, keep in mind that this event history only reflects the most recent events from the past few hours, so it may not provide insights into problems that occurred days or weeks ago.
Obtaining Logs
Each Pod maintains its own logs for each container. To fetch the logs of a specific pod, use
kubectl logs <pod_name>. Adding the -f flag will stream the logs in follow mode, allowing
real-time monitoring. If a pod contains multiple containers, by default, only the logs from the
primary container are shown. To view logs from a different container within the same pod, use
the -c <container_name> flag.
Since each pod maintains its own logs, retrieving logs from all replicas of a Deployment or StatefulSet may be necessary to get a complete view. You can use label selectors to collect logs from all pods associated with the same application. For example, to fetch logs from all pods belonging to the “acd-manager” deployment, run:
kubectl logs -l app.kubernetes.io/name=acd-manager
To find the labels associated with a specific Deployment or ReplicaSet, describe the resource and look for the “Labels” field.
The following table describes the common labels currently used by deployments in the cluster.
Component Labels
| Label (key=value) | Description |
|---|---|
| app.kubernetes.io/component=manager | Identifies the ACD Manager service |
| app.kubernetes.io/component=confd | Identifies the confd service |
| app.kubernetes.io/component=frontend | Identifies the GUI (frontend) service |
| app.kubernetes.io/component=gateway | Identifies the API gateway service |
| app.kubernetes.io/component=grafana | Identifies the Grafana monitoring service |
| app.kubernetes.io/component=metrics-aggregator | Identifies the metrics aggregator service |
| app.kubernetes.io/component=mib-frontend | Identifies the MIB frontend service |
| app.kubernetes.io/component=server | Identifies the Prometheus server component |
| app.kubernetes.io/component=selection-input | Identifies the selection input service |
| app.kubernetes.io/component=start | Identifies the Zitadel startup/init component |
| app.kubernetes.io/component=primary | Identifies the PostgreSQL primary node |
| app.kubernetes.io/component=controller-eligible | Identifies the Kafka controller-eligible node |
| app.kubernetes.io/component=alertmanager | Identifies the Prometheus Alertmanager |
| app.kubernetes.io/component=master | Identifies the Redis master node |
| app.kubernetes.io/component=replica | Identifies the Redis replica node |
Instance, Name, and Part-of Labels
| Label (key=value) | Description |
|---|---|
| app.kubernetes.io/instance=acd-manager | Helm release instance name (acd-manager) |
| app.kubernetes.io/instance=acd-cluster | Helm release instance name (acd-cluster) |
| app.kubernetes.io/name=acd-manager | Resource name: acd-manager |
| app.kubernetes.io/name=confd | Resource name: confd |
| app.kubernetes.io/name=grafana | Resource name: grafana |
| app.kubernetes.io/name=mib-frontend | Resource name: mib-frontend |
| app.kubernetes.io/name=prometheus | Resource name: prometheus |
| app.kubernetes.io/name=telegraf | Resource name: telegraf |
| app.kubernetes.io/name=zitadel | Resource name: zitadel |
| app.kubernetes.io/name=postgresql | Resource name: postgresql |
| app.kubernetes.io/name=kafka | Resource name: kafka |
| app.kubernetes.io/name=redis | Resource name: redis |
| app.kubernetes.io/name=victoria-metrics-single | Resource name: victoria-metrics-single |
| app.kubernetes.io/part-of=prometheus | Part of the Prometheus stack |
| app.kubernetes.io/part-of=kafka | Part of the Kafka stack |
Restarting a Pod
Since Kubernetes maintains a fixed number of replicas for each Deployment or ReplicaSet, deleting a
pod will cause Kubernetes to immediately recreate it, effectively restarting the pod. For example,
to restart the pod acd-manager-6c85ddd747-5j5gt, run:
kubectl delete pod acd-manager-6c85ddd747-5j5gt
Kubernetes will automatically detach that pod from any associated Service, preventing new connections from reaching it. It then spawns a new instance, which goes through startup, liveness, and readiness probes. Once the new pod passes the readiness probes and is marked as ready, the Service will start forwarding new traffic to it.
If multiple replicas are running, traffic will be distributed among the existing pods while the new pod is initializing, ensuring a seamless, zero-downtime operation.
Stopping and Starting a Deployment
Unlike traditional services, Kubernetes does not have a concept of stopping a service directly. Instead, you can temporarily scale a Deployment to zero replicas, which has the same effect.
For example, to stop the acd-manager Deployment, run:
kubectl scale deployment acd-manager --replicas=0
To restart it later, scale the deployment back to its original number of replicas, e.g.,
kubectl scale deployment acd-manager --replicas=1
If you want to perform a simple restart of all pods within a deployment, you can delete all pods with a specific label, and Kubernetes will automatically recreate them. For example, to restart all pods with the component label “manager,” use:
kubectl delete pod -l app.kubernetes.io/component=manager
This command causes Kubernetes to delete all matching pods, which are then recreated, effectively restarting the service without changing the deployment configuration.
Running command inside a pod
Sometimes it is necessary to run a command inside an existing Pod such as obtaining a bash shell.
Using the kubectl exec -it <podname> -- <command> can be used to do just that. Assuming we need to
run the confcli tool inside the confd pod acd-manager-confd-558f49ffb5-n8dmr that can be accomplished
using the following command:
kubectl exec -it acd-manager-confd-558f49ffb5-n8dmr -- /usr/bin/python3.11 /usr/local/bin/confcli
Note: The confd container does not have a shell, so specifying the python interpreter is necessary on this image.
Monitoring resource usage
Kubernetes includes an internal metrics API which can give some insight into the resource usage of the Pods and of the Nodes.
To list the current usage of the Pods in the cluster issue the following:
kubectl top pods
This will give output similar to the following:
NAME CPU(cores) MEMORY(bytes)
acd-cluster-postgresql-0 3m 44Mi
acd-manager-6c85ddd747-rdlg6 4m 15Mi
acd-manager-confd-558f49ffb5-n8dmr 1m 47Mi
acd-manager-gateway-7594479477-z4bbr 0m 10Mi
acd-manager-grafana-78c76d8c5-c2tl6 18m 144Mi
acd-manager-kafka-controller-0 19m 763Mi
acd-manager-kafka-controller-1 19m 967Mi
acd-manager-kafka-controller-2 25m 1127Mi
acd-manager-metrics-aggregator-f6ff99654-tjbfs 4m 2Mi
acd-manager-mib-frontend-67678c69df-tkklr 1m 26Mi
acd-manager-prometheus-alertmanager-0 2m 25Mi
acd-manager-prometheus-server-768f5d5c-q78xb 5m 53Mi
acd-manager-redis-master-0 12m 18Mi
acd-manager-redis-replicas-0 15m 14Mi
acd-manager-selection-input-844599bc4d-x7dct 3m 3Mi
acd-manager-telegraf-585dfc5ff8-n8m5c 1m 27Mi
acd-manager-victoria-metrics-single-server-0 2m 10Mi
acd-manager-zitadel-69b6546f8f-v9lkp 1m 76Mi
acd-manager-zitadel-69b6546f8f-wwcmx 1m 72Mi
Querying the metrics API for the nodes gives the aggregated totals for each node:
kubectl top nodes
Yields output similar to the following:
NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
k3d-local-agent-0 118m 0% 1698Mi 21%
k3d-local-agent-1 120m 0% 661Mi 8%
k3d-local-agent-2 84m 0% 1054Mi 13%
k3d-local-server-0 115m 0% 1959Mi 25%
Taking a node out of service
To temporarily take a node out of service for maintenance, you can do so with minimal downtime, provided there are enough resources on other nodes in the cluster to handle the pods from the target node.
Step 1: Cordon the node.
This prevents new pods from being scheduled on the node:
kubectl cordon <node-name>
Step 2: Drain the node.
This moves existing pods off the node, respecting DaemonSets and local data:
kubectl drain <node-name> --ignore-daemonsets --delete-local-data
- The
--ignore-daemonsetsflag skips DaemonSet-managed pods, which are typically managed separately. - The
--delete-local-dataflag removes any local ephemeral data stored on the node.
Once drained, the node is effectively out of service.
To bring the node back into service:
Uncordon the node with:
kubectl uncordon <node-name>
This allows Kubernetes to schedule new pods on the node. It won’t automatically move existing pods back; you may need to manually restart or reschedule pods if desired. Since the node now has more available resources, Kubernetes will attempt to schedule new pods there to balance the load across the cluster.
Backup and restore of persistent volumes
The Longhorn storage driver, which provides the persistent storage used in the cluster, (See the Storage Guide for more details) provides built-in mechanisms for backup, restore, and snapshotting volumes. This can be performed entirely from within the Longhorn WebUI. See the relevant section of the Storage Guide for details on accessing that UI, since it requires setting up a port forward, which is described there.
See the relevant Longhorn Documentation for how to configure Longhorn and to manage Snapshotting and Backup and Restore.
11 - Post Installation Guide
After installing the cluster, there are a few steps that should be taken to complete the setup.
Create an Admin User
The ESB3027 AgileTV CDN Manager ships with a default user account, but this account is only intended as a way to log in and create an actual user. Attempting to authenticate other services such as the MIB Frontend Configuration GUI, may not work using this pre-provisioned account.
You will need the IP address or DNS name specified in the configuration as both the first manager host and the Zitadel External Domain.
global:
hosts:
manager:
- host: manager.local
Using a web browser, connect to the following URL, replacing manager.local with the IP or DNS name
from the configuration above:
https://manager.local/ui/console
You must authenticate using the default credentials:
Username: admin@agiletv.dev
Password: Password1!
It will ask you to set up Multi-Factor Authentication, however you MUST skip this step for now, as it is not currently supported everywhere in the manager’s APIs.
On the menu bar at the top of the screen, click “Users” and proceed to create a New User. Enter the required information, and for now, ensure the “Email Verified” and “Set Initial Password” boxes are checked. Zitadel will attempt to send a confirmation EMail if the “Email Verified” box is not checked, however on initial installation, the SMTP server details have not been configured.
You should now be able to authenticate to the MIB Frontend GUI at https://manager.local/gui using
the credentials for the new user.
Configure an SMTP Server
Zitadel requires an SMTP server to be configured in order to send validation emails and support
communication with users for password resets, etc. If you have an SMTP server, you can configure
it by logging back into the Zitadel Web UI at https://manager.local/ui/console, clicking on
“Default Settings” at the top of the page, and configuring the SMTP provider from the menu on the
left. After this has been performed, if a new user account is created, an E-Mail will be sent to
the configured E-Mail address with a verification link, which must be clicked before the account
will be valid.
12 - Releases
12.1 - Release esb3027-1.4.0
Build date
2025-10-23
Release status
Type: production
Included components
- ACD Configuration GUI 2.3.9
Compatibility
This release has been tested with the following product versions:
- AgileTV CDN Director, ESB3024-1.22.0
Breaking changes from previous release
A full installation is required for this version
If the field confd.confd.image.tag is set in the present configuration file it must be removed or updated before upgrading
Change log
- NEW: Monitoring and Metrics support [ESB3027-17]
- NEW: Support for horizontal scaling [ESB3027-63]
- NEW: Deploy GUI container with Manager [ESB3027-67]
- NEW: Support Kafka redundancy [ESB3027-125]
- NEW: Support for Redis high availability [ESB3027-126]
- NEW: Add Prometheus Container [ESB3027-130]
- NEW: Add Grafana Container [ESB3027-131]
- NEW: External DNS Name configuration should be global [ESB3027-180]
- NEW: Deploy hardware metrics services acd-metrics-aggregator and acd-telegraf-metrics-database in k8s cluster [ESB3027-189]
- NEW: REST API Performance Improvements [ESB3027-208]
- NEW: “Star”/Make a Grafana dashboard the home page [ESB3027-243]
- NEW: Support for remote TCP connections for confd subscribers [ESB3027-244]
- NEW: Persist long term usage data [ESB3027-248]
- NEW: New billing dashboard [ESB3027-249]
- NEW: [ANSSI-BP-028] System Settings - Network Configuration and Firewalls [ESB3027-258]
- NEW: [ANSSI-BP-028] System Settings - SELinux [ESB3027-260]
- NEW: Support deploying GUI independently from manager [ESB3027-278]
- NEW: Automatically generate Zitadel secret [ESB3027-280]
- NEW: Deprecate the generate-ssl-secret command [ESB3027-281]
- NEW: Deprecate the generate-zitadel-mastekey command [ESB3027-285]
- FIXED: Access to services restricted with SELinux in Enforcing mode [ESB3027-32]
- FIXED: Authentication token payload contains invalid user details [ESB3027-47]
- FIXED: Unexpected 200 OK response to non-existent confd endpoint [ESB3027-154]
- FIXED: Multiple restarts encountered for selection-input service on startup [ESB3027-155]
- FIXED: Installer script requires case-sensitive hostnames [ESB3027-158]
- FIXED: Installer script does not support configuring additional options [ESB3027-214]
- FIXED: Selection input API accepts keys containing non-urlsafe characters [ESB3027-216]
- FIXED: Installation fails on minimal RHEL installation [ESB3027-287]
- FIXED: Kafka consumer configuration warning logged on startup [ESB3027-294]
Deprecated functionality
None
System requirements
Known limitations
Installation of the software is only supported using a self-hosted configuration.
12.2 - Release esb3027-1.2.1
Build date
2025-05-22
Release status
Type: production
Compatibility
This release is compatible with the following product versions:
- AgileTV CDN Director, ESB3024-1.20.1
Breaking changes from previous release
None
Change log
- FIXED: Installer changes ownership of /var, /etc/ and /usr [ESB3027-146]
- FIXED: K3s installer should not be left on root filesystem [ESB3027-149]
Deprecated functionality
None
System requirements
- A minimum CPU architecture level of x86-64-v2 due to inclusion of Oracle Linux 9 inside the container. While all modern CPUs support this archetecture level, virtual hypervisors may default to a CPU type that has more compatibility with older processors. If this minimum CPU architecture level is not attained the containers may refuse to start. See Operating System Compatibility and Building Red Hat Enterprise Linux 9 for the x86-64-v2 Microarchitecture Level for more information.
Known limitations
Installation of the software is only supported using a self-hosted configuration.
12.3 - Release esb3027-1.2.0
Build date
2025-05-14
Release status
Type: production
Compatibility
This release is compatible with the following product versions:
- AgileTV CDN Director, ESB3024-1.20.1
Breaking changes from previous release
None
Change log
- NEW: Remove
.shextension from all scripts on the ISO [ESB3027-102] - NEW: The script
load-certificates.shshould be calledgenerate-ssl-secret[ESB3027-104] - NEW: Add support for High Availability [ESB3027-108]
- NEW: Enable the K3s Registry Mirror [ESB3027-110]
- NEW: Support for Air-Gapped installations [ESB3027-111]
- NEW: Basic hardware monitoring support for nodes in K8s Cluster [ESB3027-122]
- NEW: Separate docker containers from ISO [ESB3027-124]
- FIXED: GUI is unable to make DELETE request on api/v1/selection_input/modules/blocked_referrers [ESB3027-112]
Deprecated functionality
None
System requirements
- A minimum CPU architecture level of x86-64-v2 due to inclusion of Oracle Linux 9 inside the container. While all modern CPUs support this archetecture level, virtual hypervisors may default to a CPU type that has more compatibility with older processors. If this minimum CPU architecture level is not attained the containers may refuse to start. See Operating System Compatibility and Building Red Hat Enterprise Linux 9 for the x86-64-v2 Microarchitecture Level for more information.
Known limitations
Installation of the software is only supported using a self-hosted configuration.
12.4 - Release esb3027-1.0.0
Build date
2025-04-17
Release status
Type: production
Compatibility
This release is compatible with the following product versions:
- AgileTV CDN Director, ESB3024-1.20.0
Breaking changes from previous release
None
Change log
This is the first production release
Deprecations from previous release
None
System requirements
- A minimum CPU architecture level of x86-64-v2 due to inclusion of Oracle Linux 9 inside the container. While all modern CPUs support this archetecture level, virtual hypervisors may default to a CPU type that has more compatibility with older processors. If this minimum CPU architecture level is not attained the containers may refuse to start. See Operating System Compatibility and Building Red Hat Enterprise Linux 9 for the x86-64-v2 Microarchitecture Level for more information.
Known limitations
Installation of the software is only supported using a self-hosted, single-node configuration.
13 - API Guides
13.1 - Healthcheck API
This API provides endpoints to verify the liveness and readiness of the service.
Liveness Check
Endpoint:GET /api/v1/health/alive
Purpose:
Ensures that the service is running and accepting connections. This check does not verify
dependencies or internal health, only that the service process is alive and listening.
Response:
- Success (200 OK):
{
"status": "ok"
}
- Failure (503 Service Unavailable):
Indicates the service is not alive, possibly due to a critical failure.
Example Request
GET /api/v1/health/alive HTTP/1.1
Host: your-host
Accept: */*
Example Response
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "ok"
}
Readiness Check
Endpoint:GET /api/v1/health/ready
Purpose:
Verifies if the service is ready to handle requests, including whether all dependencies (such as
databases or external services) are operational.
Response:
- Success (200 OK):
{
"status": "ok"
}
- Failure (503 Service Unavailable):
Indicates the service or its dependencies are not yet ready.
Example Request
GET /api/v1/health/ready HTTP/1.1
Host: your-host
Accept: */*
Example Response
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "ok"
}
Notes
- These endpoints are typically used by load balancers, orchestrators like Kubernetes, or monitoring systems to assess service health.
- The liveness endpoint confirms the process is running; the readiness endpoint confirms the service and its dependencies are fully operational and ready to serve traffic.
13.2 - Authentication API
The manager offers a simplified authentication and authorization API that integrates with the Zitadel IAM system. This flow is a streamlined custom OAuth2-inspired process:
Session Establishment:
Users authenticate by sending their credentials to the Login endpoint, which returns a session ID and session token.Token Exchange:
The session token is exchanged for a short-lived, signed JWT access token via the Token Grant flow. This access token can be used to authorize API requests, and its scopes determine what resources and actions are permitted. The token should be protected, as it grants the bearer the rights specified by its scopes as long as it is valid.
Login
Send user credentials to initiate a session:
POST /api/v1/auth/login HTTP/1.1
Accept: application/json, */*;q=0.5
Content-Type: application/json
Host: localhost:4464
{
"email": "test@example.com",
"password": "test"
}
Response:
{
"expires_at": "2025-01-29T15:49:47.062354+00:00",
"session_id": "304646367786041347",
"session_token": "12II6yYYfN8UJ5ij-bac6IRRXX6t9qG_Flrlow_fukXKqvo9HFDVZ7a76Exj7Gn-uVRx04_reCaXew",
"verified_at": "2025-01-28T15:49:47.054169+00:00"
}
Logout
To terminate a session, send:
POST /api/v1/auth/logout HTTP/1.1
Accept: application/json
Content-Type: application/json
Host: localhost:4464
{
"session_id": "304646367786041347",
"session_token": "12II6yYYfN8UJ5ij-bac6IRRXX6t9qG_Flrlow_fukXKqvo9HFDVZ7a76Exj7Gn-uVRx04_reCaXew"
}
Response:
{
"status": "Ok"
}
Token Grant
After establishing a session, exchange the session token for a short-lived access token:
POST /api/v1/auth/token HTTP/1.1
Accept: application/json
Content-Type: application/json
Host: localhost:4464
{
"grant_type": "session",
"scope": "foo bar baz",
"session_id": "304646818908602371",
"session_token": "wfCelUhfSb4DKJbLCwg9dr59rTeaC13LF2TXH1tMqXz68ojL8LE9M-dCcwsKgrwjcXkjj9y49wWvdQ"
}
Note: The scope parameter is a space-delimited string defining the permissions requested. The
API responds with an access token, which is a JWT that contains embedded scopes and other claims,
and must be kept secret.
Response example:
{
"access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFUzI1NiIsImp3ayI6eyJ1c2UiOiJzaWciLCJhbGciOiJFUzI1NiIsImtpZCI6ImFjZC1tYW5hZ2VyLWVzMjU2LWtleSIsImt0eSI6IkVDIiwiY3J2IjoiUC0yNTYiLCJ4IjoiWWxpYVVoSXpnaTk1SjV4NXdaU0tGRUhyWldFUTdwZDZUR2JrTEN6MGxLcyIsInkiOiJDcWNWY1MzQ1pFMjB2enZiWFdxRERRby00UXEzYnFfLUlPZWNPMlZudkFzIn0sImtpZCI6ImFjZC1tYW5hZ2VyLWVzMjU2LWtleSJ9.eyJleHAiOjE3MzgwODAwMjIsImlhdCI6MTczODA3OTcyMiwibmJmIjoxNzM4MDc5NzIyLCJzdWIiOiJ0ZXN0QGV4YW1wbGUuY29tIiwiZ2l2ZW5fbmFtZSI6IiIsImZhbWlseV9uYW1lIjoiVGVzdCBVc2VyIiwiZW1haWwiOiJ0ZXN0QGV4YW1wbGUuY29tIiwic2NvcGUiOiJmb28gYmFyIGJheiJ9.uRmmszZfkrbJpQxIRpxmHf4gL6omvsOQHeuQYd00Bj8PNwQejNA2ZJO3Q_PsE0qb1IrMX5bsCC_k9lWUFMNQ1w",
"expires_in": 300,
"scope": "foo bar baz",
"token_type": "bearer"
}
The access token can then be included in API requests via the Authorization header as Bearer <token>.
13.3 - Router API
The /api/v1/routing/validate endpoint evaluates routing rules for a specified IP
address. If the IP is blocked according to the configured rules, the endpoint
responds with a 401 Unauthorized.
Limitations
- Supported Classifier Types: Only classifiers of type GeoIP, Anonymous IP, and IPRange are supported. Other classifiers require additional information which is not available to the Manager, so they are assumed not to match.
- Policy Behavior: Since the exact path taken through the rules during the initial
request is unknown, a “default allow” policy is in effect. This means that unless
an IP explicitly matches a rule that denies it, the response will be
200 OK, indicating the IP is allowed.
Request
Method:GET /api/v1/routing/validate?ip=<IP_ADDRESS>
Headers:
Accept: */* (or as needed)
Example:
GET /api/v1/routing/validate?ip=1.1.1.1 HTTP/1.1
Accept: */*
Host: localhost
User-Agent: HTTPie/3.2.4
Response
- Blocked IP:
Returns401 Unauthorizedif the IP matches a block rule.
HTTP/1.1 401 Unauthorized
- Allowed IP:
Returns200 OKif the IP does not match a block rule (or if no matching rule is found due to the “default allow” policy).
HTTP/1.1 200 OK
Default-Allow Policy
The routing validation API uses a default-allow policy: if a request does not match any rule, it is allowed. This approach is intentional and designed to prevent valid sessions from being accidentally dropped if your configuration uses advanced features or rule types that are not fully supported by the Manager. Since the Manager only supports a subset of all possible classifier types and rule logic, it cannot always determine the exact path a request would take through the full configuration. By defaulting to allow, the system avoids inadvertently blocking legitimate traffic due to unsupported or unrecognized configuration elements.
To ensure sensitive or restricted IPs are blocked, you must add explicit deny
rules at the top of your ruleset. Rules are evaluated in order, and the first match
applies.
Best Practice: Place your most specific
denyrules first, followed by generalallowrules. This ensures that deny conditions are always checked before any allow conditions.
Example Ruleset (confd/confcli syntax)
{
"rules": [
{
"name": "deny-restricted",
"type": "deny",
"condition": "in_session_group('Restricted')",
"onMiss": "allow-general"
},
{
"name": "allow-general",
"type": "allow",
"condition": "always()",
"onMatch": "main-host"
}
]
}
- The first rule denies requests from the
Restrictedsession group. - The second rule allows all other requests.
Note: With a default-allow policy, any request not explicitly denied will be permitted. Always review your ruleset to ensure that deny rules are comprehensive and prioritized.
13.4 - Selection Input API
This API allows you to store arbitrary JSON data in synchronization across all Director instances via Kafka. It is based on the Selection Input API provided by the Director. You can create, delete, and fetch selection input entries at arbitrary paths.
Known Limitations
- Parent Path Access: Accessing a parent path (e.g.,
/foo) will not return all nested structures under that path. - Field Access Limitation: It is not possible to query nested fields directly. For example, if
/foo/barcontains{"baz": {"bam": "boom"}}, querying/foo/bar/baz/bamwill not return"boom". You can only query/foo/bar/bazto retrieve{"bam": "boom"}.
API Usage
Create New Keys
Create multiple entries under a specified path by POSTing a JSON object where each key-value pair corresponds to a key and its associated data.
Request:
POST /api/v1/selection_input/<path>
Body Example:
{
"key1": {...},
"key2": {...}
}
Example:
POST to /api/v1/selection_input/modules/keys with the above body creates:
/modules/keys/key1with value{...}/modules/keys/key2with value{...}
Delete a Key
Remove a specific key at a given path.
Request:
DELETE /api/v1/selection_input/<path>/<key>
Example:
To delete key2 under /modules/keys:
DELETE /api/v1/selection_input/modules/keys/key2
Fetch a Key
Retrieve the data stored under a specific key.
Request:
GET /api/v1/selection_input/<path>/<key>
Example:
To fetch key1 under /modules/keys:
GET /api/v1/selection_input/modules/keys/key1
Response:
{
"key1": {...}
}
Fetch All Keys Under a Path
Retrieve all selection input data stored under a parent path.
Request:
GET /api/v1/selection_input/<path>
Example:
To get all keys under /modules/keys:
GET /api/v1/selection_input/modules/keys
Response:
{
"key1": {...},
"key2": {...}
}
Filtering, Sorting, and Limiting Results
You can refine the list of keys returned by adding query parameters:
search=<string>: Filter results to include only keys matching the search string.sort=<asc|desc>: Sort keys in ascending or descending order before filtering.limit=<number>: Limit the number of results returned (positive integer).
Note:
- Sorting occurs prior to filtering and limiting.
- The order of query parameters does not affect the request.
Example:
GET /api/v1/selection_input/modules/keys?search=foo&sort=asc&limit=10
13.5 - Operator UI API
This API provides endpoints to retrieve and manage blocked tokens, user agents, and referrers used within the Operator UI.
Endpoints
Retrieve List of Blocked Tokens
GET /api/v1/operator_ui/modules/blocked_tokens/
Fetches a list of blocked tokens, supporting optional filtering, sorting, and limiting.
Query Parameters:
search(optional): Filter tokens matching the search term.limit(optional): Limit number of results.sort(optional): Sort order,"asc"or"desc"(default:"asc").
Responses:
200 OKwith JSON array of blocked tokens.404 Not Foundif no tokens found.500 Internal Server Erroron failure.
Retrieve a Specific Blocked Token
GET /api/v1/operator_ui/modules/blocked_tokens/{token}
Fetches details of a specific blocked token.
Path Parameter:
token: The token string to retrieve.
Responses:
200 OKwith JSON object of the token.404 Not Foundif token does not exist.500 Internal Server Erroron failure.
Retrieve List of Blocked User Agents
GET /api/v1/operator_ui/modules/blocked_user_agents/
Fetches a list of blocked user agents, with optional sorting and limiting.
Query Parameters:
limit(optional): Limit number of results.sort(optional):"asc"or"desc"(default:"asc").
Responses:
200 OKwith JSON array of user agents.404 Not Foundif none found.500 Internal Server Erroron failure.
Retrieve a Specific Blocked User Agent
GET /api/v1/operator_ui/modules/blocked_user_agents/{user_agent}
Retrieves details of a specific blocked user agent.
Path Parameter:
user_agent: URL-safe Base64 encoded string (without padding). Decode before use; if decoding fails, the server returns400 Bad Request.
Responses:
200 OKwith JSON object of the user agent.404 Not Foundif not found.500 Internal Server Erroron failure.
Retrieve List of Blocked Referrers
GET /api/v1/operator_ui/modules/blocked_referrers/
Fetches a list of blocked referrers, with optional sorting and limiting.
Query Parameters:
limit(optional): Limit number of results.sort(optional):"asc"or"desc"(default:"asc").
Responses:
200 OKwith JSON array of referrers.404 Not Foundif none found.500 Internal Server Erroron failure.
Retrieve a Specific Blocked Referrer
GET /api/v1/operator_ui/modules/blocked_referrers/{referrer}
Retrieves details of a specific blocked referrer.
Path Parameter:
referrer: URL-safe Base64 encoded string (without padding). Decode before use; if decoding fails, return400 Bad Request. The response includes the decoded referrer.
Responses:
200 OKwith JSON object containing the referrer.404 Not Foundif not found.500 Internal Server Erroron failure.
Additional Notes
- For User Agents and Referrers, the path parameters are URL-safe Base64 encoded (per RFC 4648,
using
-and_instead of+and/) with padding (=) removed. Clients should remove padding when constructing requests and restore it before decoding. - All endpoints returning specific items will respond with
404 Not Foundif the item does not exist. - Errors during processing will return
500 Internal Server Errorwith an error message.
14 - Use Cases
14.1 - Custom Deployments
In some environments, it may not be necessary to run all components of the ESB3027 AgileTV CDN Manager—such as when certain features are not used, or when components like the MIB Frontend Configuration GUI are hosted separately, for example, in a public cloud environment. The examples in this guide illustrate scenarios and the configuration properties needed to achieve specific configurations.
Manager Without Metrics and Monitoring Support
If metrics and monitoring are not required—perhaps because an existing monitoring solution is in place—it is possible to disable the deployment of Telegraf, Prometheus, Grafana, and VictoriaMetrics. You can choose to skip the entire metrics suite or disable individual components as needed.
Keep in mind, that disabling certain components may require adjustments elsewhere in the configuration. For example, disabling Prometheus will necessitate modifications to Grafana and VictoriaMetrics configurations, since they depend on Prometheus being available.
To disable all metrics components, set:
acd-metrics.enabled: false
Applying this configuration will prevent the deployment of the entire metrics suite. To disable
individual components within the metrics framework, set their respective enabled flags to false.
For example, to disable only Grafana but keep other metrics components active:
acd-metrics.grafana.enabled: false
Manager Without the MIB Frontend Configuration GUI
If the MIB-Frontend GUI will not be used to configure the ESB3024 AgileTV CDN Director instances, this component can be disabled by setting:
mib-frontend.enabled: false
This is also useful if the frontend is hosted in a separate cluster—such as in a public cloud like AWS —or if the manager is deployed within a customer’s network without the frontend.
15 - Troubleshooting Guide
This guide helps diagnose common issues with the acd-manager deployment and its associated pods.
1. Check Pod Status
Verify all pods are running:
kubectl get pods
Expected:
- Most pods should be in
Runningstate withREADYas1/1or2/2. - Pods marked as
0/1or0/2are not fully ready, indicating potential issues.
2. Investigate Unready or Failed Pods
Example:
kubectl describe pod acd-manager-6c85ddd747-rdlg6
- Look for events such as
CrashLoopBackOff,ImagePullBackOff, orErrImagePull. - Check container statuses for error messages.
3. Check Pod Logs
Fetch logs for troubleshooting:
kubectl logs acd-manager-6c85ddd747-rdlg6
- For pods with multiple containers:
kubectl logs acd-manager-<pod_name> -c <container_name>
- Focus on recent errors or exceptions.
4. Verify Connectivity and Dependencies
- PostgreSQL: Confirm the
acd-cluster-postgresql-0pod is healthy and accepting connections. - Kafka: Check
kafka-controllerpods are running and not experiencing issues. - Redis: Ensure Redis master and replicas are healthy.
- Grafana, Prometheus, VictoriaMetrics: Confirm these services are operational.
5. Check Resource Usage
High CPU or memory can cause pods to crash or become unresponsive:
kubectl top pods
Actions:
- Scale resources if needed.
- Review resource quotas and limits.
6. Check Events in Namespace
kubectl get events --sort-by='.lastTimestamp'
- Look for warnings or errors related to pod scheduling, network issues, or resource constraints.
7. Restart Problematic Pods
Sometimes, restarting pods can resolve transient issues:
kubectl delete pod <pod_name>
Kubernetes will automatically recreate the pod.
8. Verify Configurations and Secrets
- Check ConfigMaps and Secrets for correctness:
kubectl get configmaps
kubectl get secrets
- Confirm environment variables and mounted volumes are correctly configured.
9. Check Cluster Network
- Ensure network policies or firewalls are not blocking communication between pods and external services.
10. Additional Tips
- Upgrade or Rollback: If recent changes caused issues, consider rolling back or upgrading the deployment.
- Monitoring: Use Grafana and VictoriaMetrics dashboards for real-time insights.
- Documentation: Consult application-specific logs and documentation for known issues.
Summary Table
| Issue Type | Common Checks | Commands |
|---|---|---|
| Pod Not Ready | Describe pod, check logs | kubectl describe pod, kubectl logs |
| Connectivity | Verify service endpoints | kubectl get svc, curl from within pods |
| Resource Limits | Monitor resource usage | kubectl top pods |
| Events & Errors | Check cluster events | kubectl get events |
| Configuration | Validate configs and secrets | kubectl get configmaps, kubectl get secrets |
If issues persist, consider scaling down and up components or consulting logs and metrics for deeper analysis.
16 - Glossary
- Access Token
- A credential used to authenticate and authorize access to resources or APIs on behalf of a user, usually issued by an authorization server as part of an OAuth 2.0 flow. It contains the necessary information to verify the user’s identity and define the permissions granted to the token holder.
- Bearer Token
- A type of access token that allows the holder to access
protected resources without needing to provide additional
credentials. It’s typically included in the HTTP Authorization
header as
Authorization: Bearer <token>, and grants access to any resource that recognizes the token. - Chart
- A Helm Chart is a collection of files that describe a related set of Kubernetes resources required to deploy an application, tool, or service. It provides a structured way to package, configure, and manage Kubernetes applications.
- Cluster
- A group of interconnected computers or nodes that work together as a single system to provide high availability, scalability and redundancy for applications or services. In Kubernetes, a cluster usually consists of one primary node, and multiple worker or agent nodes.
- Confd
- An AgileTV backend service that hosts the service configuration. Comes with an API, a CLI and a GUI.
- ConfigMap (Kubernetes)
- A Kubernetes resource used to store non-sensitive configuration data in key-value pairs, allowing applications to access configuration settings without hardcoding them into the container images.
- Containerization
- The practice of packaging applications and their dependencies into lightweight portable containers that can run consistently across different computing environments.
- Deployment (Kubernetes)
- A resource object that provides declarative updates to applications by managing the creation and scaling of a set of Pods.
- Director
- The AgileTV Delivery OTT router and related services.
- ESB
- A software bundle that can be separately installed and upgraded, and is released as one entity with one change log. Each ESB is identified with a number. Over time, features and functions within an ESB can change.
- Helm
- A package manager for Kubernetes that simplifies the development and management of applications by using pre-configured templates called charts. It enables users to define, install, and upgrade complex applications on Kubernetes.
- Ingress
- A Kubernetes resource that manages external access to services within a cluster, typically HTTP. It provides routing rules to manage traffic to various services based on hostnames and paths.
- K3s
- A lightweight Kubernetes cluster developed by Rancher Labs. This is a complete Kubernetes system deployed as a single portable binary.
- K8s
- A common abbreviation for Kubernetes.
- Kafka
- Apache Kafka is an open-source distributed event streaming platform designed for building real-time data pipelines and streaming applications. It enables the publication, subscription, storage, and processing of streams of records in a fault-tolerant and scalable manner.
- Kubectl
- The command-line tool for interacting with Kubernetes clusters, allowing users to deploy applications, manage cluster resources, and inspect logs or configurations.
- Kubernetes
- An open-source container orchestration platform designed to automate scaling, and management of containerized applications. It enables developers and operations teams to manage complex applications consistently across various environments.
- LoadBalancer
- A networking tool that distributes network traffic across multiple servers or Pods to ensure no single server becomes overwhelmed, improving reliability and performance.
- Manager
- The AgileTV Management Software and related services.
- Namespace
- A mechanism for isolating resources within a Kubernetes cluster, allowing multiple teams or applications to coexist without conflict by providing a scope for names.
- OAuth2
- An open standard for authorization that allows third-party applications to gain limited access to a user’s resources on a server without exposing the user’s credentials.
- Pod
- The smallest deployable unit in Kubernetes that encapsulates one or more containers, sharing the same network and storage resources. It serves as a logical host for tightly coupled applications, allowing them to communicate and function effectively within a cluster.
- Router
- Unless otherwise specified, an HTTP router that manages an OTT session using HTTP redirect. There are also ways to use DNS instead of HTTP.
- Secret (Kubernetes)
- A resource used to store sensitive information, such as passwords, API keys, or tokens in a secure manner. Secrets are encoded in base64 and can be made available to Pods as environment variables or mounted as files, ensuring that sensitive data is not exposed in the application code or configuration files.
- Service (Kubernetes)
- An abstraction that defines a logical set of Pods and a policy to access them, enabling stable networking and load balancing to ensure reliable communication among application components.
- Session Token
- A session token is a temporary, unique identifier generated by a server and issued to a user upon successful authentication.
- Stateful Set (Kubernetes)
- A Kubernetes deployment which guarantees ordering and uniqueness of Pods, typically used for applications that require stable network identities and persistent storage such as with databases.
- Topic (Kafka)
- A category or feed name to which records (messages) are published. Messages flow through a topic in the order in which they are produced, and multiple consumers can subscribe to the stream to process the records in real time.
- Volume (Kubernetes)
- A persistent storage resource in Kubernetes that allows data to be stored and preserved beyond the lifecycle of individual Pods, facilitating data sharing and durability.
- Zitadel
- An open-source identity and access management (IAM) platform designed to handle user authentication and authorization for applications. It provides features like single-sign-on (SSO), multi-factor authentication (MFA), and support for various authentication protocols.