Architecture Guide

Detailed system architecture and component overview
You're viewing a development version of manager, the latest released version is v1.4.1
Go to the latest released version

Overview

The AgileTV CDN Manager (ESB3027) is a cloud-native Kubernetes application designed for managing CDN operations. This guide provides a detailed description of the system architecture, component interactions, and scaling considerations.

High-Level Architecture

The CDN Manager follows a microservices architecture deployed on Kubernetes. The system is organized into logical layers:

graph LR
    Clients[API Clients] --> Ingress[Ingress Controller]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Manager --> Redis[(Redis)]
    Manager --> Kafka[(Kafka)]
    Manager --> PostgreSQL[(PostgreSQL)]
    Manager --> Zitadel[Zitadel IAM]
    Manager --> Confd[Configuration Service]
    Grafana --> VM[(VictoriaMetrics)]
    Confd -.-> Gateway[NGinx Gateway]
    Gateway --> Director[CDN Director]

Component Architecture

Ingress Layer

The ingress layer manages all incoming traffic to the cluster:

ComponentRole
Ingress ControllerPrimary ingress for all cluster traffic; routes requests to internal services based on path
NGinx GatewayReverse proxy for routing traffic to external CDN Directors; used by MIB Frontend to communicate with remote Confd instances on CDN Director nodes

Traffic flow:

  • API clients and Operator UI connect via the Ingress Controller at /api and /gui paths respectively
  • Grafana dashboards are accessed via the Ingress Controller at /grafana
  • Zitadel authentication console is accessed via the Ingress Controller at /ui/console
  • MIB Frontend uses NGinx Gateway when communicating with external Confd instances on CDN Director nodes

Application Services

The application layer contains the core CDN Manager services:

ComponentRoleScaling
Core ManagerMain REST API server (v1/v2 endpoints); handles authentication, configuration, routing, and discoveryHorizontally scalable via HPA
MIB FrontendWeb-based configuration GUI for operatorsHorizontally scalable via HPA
ConfdConfiguration service for routing configuration; synchronizes with Core Manager applicationSingle instance
GrafanaMonitoring and visualization dashboardsSingle instance
Selection Input WorkerConsumes selection input events from Kafka and updates configurationSingle instance
Metrics AggregatorCollects and aggregates metrics from CDN componentsSingle instance
TelegrafSystem-level metrics collection from cluster nodesDaemonSet (one per node)
AlertmanagerAlert routing and notification managementSingle instance

Data Layer

The data layer provides persistent and ephemeral storage:

ComponentRoleScaling
RedisIn-memory caching, session storage, and ephemeral stateMaster + replicas (read-only)
KafkaEvent streaming for selection input and metrics; provides durable message queueController cluster (odd count)
PostgreSQLPersistent configuration and state storage3-node cluster with HA
VictoriaMetrics (Analytics)Real-time and short-term metrics for operational dashboardsSingle instance
VictoriaMetrics (Billing)Long-term metrics retention (1+ years) for billing and license complianceSingle instance

External Integrations

ComponentRole
Zitadel IAMIdentity and access management; provides OAuth2/OIDC authentication
CDN Director (ESB3024)Edge routing infrastructure; receives configuration from Confd

Detailed Component Descriptions

Core Manager

The Core Manager is the central application server that exposes the REST API. It is implemented in Rust using the Actix-web framework.

Key Responsibilities:

  • Authentication and session management via Zitadel
  • Configuration document storage and retrieval
  • Selection input CRUD operations
  • Routing rule evaluation and GeoIP lookups
  • Service discovery for CDN Directors and edge servers
  • Operator UI helper endpoints

API Endpoints:

  • /api/v1/auth/* - Authentication (login, token, logout)
  • /api/v1/configuration - Configuration management
  • /api/v1/selection_input/* - Selection input operations
  • /api/v2/selection_input/* - Enhanced selection input with list operations
  • /api/v1/routing/* - Routing evaluation and validation
  • /api/v1/discovery/* - Host and namespace discovery
  • /api/v1/metrics - System metrics
  • /api/v1/health/* - Liveness and readiness probes
  • /api/v1/operator_ui/* - Operator helper endpoints

Runtime Modes: The Core Manager supports multiple runtime modes, each deployed as a separate container:

  • http-server - Primary HTTP API server (default)
  • metrics-aggregator - Background worker for metrics collection
  • selection-input - Background worker for Kafka selection input consumption

MIB Frontend

The MIB Frontend provides a web-based GUI for configuration management.

Key Features:

  • Intuitive web interface for CDN configuration
  • Real-time configuration validation
  • Integration with Zitadel for SSO authentication
  • Uses NGinx Gateway for external Director communication

Confd (Configuration Service)

Confd provides routing configuration services and synchronizes with the Core Manager application.

Key Responsibilities:

  • Hosts the service configuration for routing decisions
  • Provides API and CLI for configuration management
  • Synchronizes routing configuration with Core Manager
  • Maintains configuration state in PostgreSQL

Selection Input Worker

The Selection Input Worker processes selection input events from the Kafka stream.

Key Responsibilities:

  • Consumes messages from the selection_input Kafka topic
  • Validates and transforms input data
  • Updates configuration in the data store
  • Maintains message ordering within partitions

Scaling Limitation: The Selection Input Worker cannot be scaled beyond a single consumer per Kafka partition, as message ordering must be preserved.

Metrics Aggregator

The Metrics Aggregator collects and processes metrics from CDN components.

Key Responsibilities:

  • Polls metrics from Director instances
  • Aggregates usage statistics
  • Writes data to VictoriaMetrics (Analytics) for dashboards
  • Writes long-term data to VictoriaMetrics (Billing) for compliance

Telegraf

Telegraf is deployed as a DaemonSet to collect host-level metrics.

Key Responsibilities:

  • CPU, memory, disk, and network metrics from each node
  • Container-level resource usage
  • Kubernetes cluster metrics
  • Forwards metrics to VictoriaMetrics

Grafana

Grafana provides visualization and dashboard capabilities.

Features:

  • Pre-built dashboards for CDN monitoring
  • Custom dashboard support
  • VictoriaMetrics as data source
  • Alerting integration with Alertmanager

Access: https://<host>/grafana

Alertmanager

Alertmanager handles alert routing and notifications.

Key Responsibilities:

  • Receives alerts from Grafana and other sources
  • Deduplicates and groups alerts
  • Routes to notification channels (email, webhook, etc.)
  • Manages alert silencing and inhibition

Data Storage

Redis

Redis provides in-memory storage for:

  • User sessions and authentication tokens
  • Ephemeral configuration cache
  • Real-time state synchronization

Deployment: Master + read replicas for high availability

Kafka

Kafka provides durable event streaming for:

  • Selection input events
  • Metrics data streams
  • Inter-service communication

Deployment: Controller cluster with 3 replicas for production, 1 replica for lab deployments

Node Affinity: Kafka replicas must be scheduled on separate nodes to ensure high availability. The Helm chart configures pod anti-affinity rules to enforce this distribution.

Topics:

  • selection_input - Selection input events
  • metrics - Metrics data streams

Note: For lab/single-node deployments, the Kafka replica count must be set to 1 in the Helm values. Production deployments require 3 replicas for fault tolerance.

PostgreSQL

PostgreSQL provides persistent storage for:

  • Configuration documents
  • User and permission data
  • System state

Deployment: 3-node cluster managed by Cloudnative PG (CNPG) operator

High Availability: The CNPG operator manages automatic failover and ensures high availability:

  • One primary node handles read/write operations
  • Two replica nodes provide redundancy and can be promoted to primary on failure
  • Automatic failover occurs within seconds of primary node failure
  • Synchronous replication ensures data consistency

Note: The PostgreSQL cluster is deployed and managed automatically by the CNPG operator. Manual intervention is typically not required for normal operations.

VictoriaMetrics

Two VictoriaMetrics instances serve different purposes:

VictoriaMetrics (Analytics):

  • Real-time and short-term metrics storage
  • Supports Grafana dashboards
  • Retention: Configurable (typically 30-90 days)

VictoriaMetrics (Billing):

  • Long-term metrics retention
  • Billing and license compliance data
  • Retention: Minimum 1 year

Authentication and Authorization

Zitadel Integration

Zitadel provides identity and access management:

Authentication Flow:

  1. User accesses MIB Frontend or API
  2. Redirected to Zitadel for authentication
  3. Zitadel validates credentials and issues session token
  4. Session token exchanged for access token
  5. Access token included in API requests (Bearer authentication)

Default Credentials: See the Glossary for default login credentials.

Access Paths:

  • Zitadel Console: /ui/console
  • API authentication: /api/v1/auth/*

CORS Configuration

Zitadel enforces Cross-Origin Resource Sharing (CORS) policies. The external hostname configured in Zitadel must match the first entry in global.hosts.manager in the Helm values.

Network Architecture

Traffic Flow

graph TB
    External[External Clients] --> Ingress[Ingress Controller]
    External --> Redis[(Redis)]
    External --> Kafka[(Kafka)]
    External --> Telegraf[Telegraf]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Ingress --> Zitadel[Zitadel]

Note: Certain services (Redis, Kafka, Telegraf) can be accessed directly by external clients without traversing the ingress controller. This is typically used for metrics collection, event streaming, and direct data access scenarios.

Internal Communication

All internal services communicate over the Kubernetes overlay network (Flannel VXLAN). Services discover each other via Kubernetes DNS.

External Communication

  • CDN Directors: Accessed via NGinx Gateway for simplified routing
  • MaxMind GeoIP: Local database files (no external calls)

Scaling

Horizontal Pod Autoscaler (HPA)

The following components support automatic horizontal scaling via HPA:

ComponentMinimumMaximumScale Metrics
Core Manager38CPU (50%), Memory (80%)
NGinx Gateway24CPU (75%), Memory (80%)
MIB Frontend24CPU (75%), Memory (90%)

Note: HPA is enabled by default in the Helm chart. The default configuration is tuned for production deployments. Adjust min/max values based on expected load and available cluster capacity.

Manual Scaling

Components can also be scaled manually by setting replica counts in the Helm values:

manager:
  replicaCount: 3
mib-frontend:
  replicaCount: 2

Important: When manually setting replica counts, you must disable the Horizontal Pod Autoscaler (HPA) for the corresponding component. If HPA remains enabled, it will override manual replica settings. To disable HPA, set autoscaling.hpa.enabled: false for the component in your Helm values.

Components That Do Not Scale

The following components do not support horizontal scaling:

ComponentReason
ConfdSingle instance required for configuration consistency
PostgreSQLCloudnative PG cluster; scaled by adding replicas via operator configuration
KafkaScaled by adding controllers, not via replica count
VictoriaMetricsStateful; single instance per role
RedisMaster is single; replicas are read-only
GrafanaSingle instance sufficient for dashboard access
AlertmanagerSingle instance for alert routing
Selection Input WorkerKafka message ordering requires single consumer
Metrics AggregatorSingle instance for consistent metrics aggregation

Node Scaling

Additional Agent nodes can be added to the cluster at any time to increase workload capacity. Kubernetes automatically schedules pods to nodes with available resources.

Cluster Balancing

The CDN Manager deployment includes the Kubernetes Descheduler to maintain balanced resource utilization across cluster nodes:

  • Automatic Rebalancing: The descheduler periodically analyzes pod distribution and evicts pods from overutilized nodes
  • Node Balance: Helps prevent resource hotspots by redistributing workloads across available nodes
  • Integration with HPA: Works in conjunction with Horizontal Pod Autoscaler to optimize both pod count and placement

The descheduler runs as a background process and does not require manual intervention under normal operating conditions.

Resource Configuration

For detailed resource preset configurations and planning guidance, see the Configuration Guide.

High Availability

Server Node Redundancy

Production deployments require a minimum of 3 Server nodes:

  • Survives loss of 1 server node
  • Maintains quorum for etcd and Kafka

For enhanced availability, use 5 Server nodes:

  • Survives loss of 2 server nodes
  • Recommended for critical production environments

For large-scale deployments, 7 or more Server nodes can be used:

  • Survives loss of 3+ server nodes
  • Suitable for high-capacity production environments

Pod Distribution

Kubernetes automatically distributes pods across nodes to maximize availability:

  • Pods with the same deployment are scheduled on different nodes when possible
  • Pod Disruption Budgets (PDB) ensure minimum availability during maintenance

Data Replication

ComponentReplication Strategy
RedisSingle instance (backup via Longhorn snapshots)
KafkaReplicated partitions (default: 3)
PostgreSQL3-node cluster via Cloudnative PG
VictoriaMetricsSingle instance (backup via snapshots)
LonghornSingle replica with pod-node affinity

Longhorn Storage: Longhorn volumes are configured with a single replica by default. Pod scheduling is configured with node affinity to prefer scheduling pods on the same node as their persistent volume data. This approach optimizes I/O performance while maintaining data locality.

Next Steps

After understanding the architecture:

  1. Installation Guide - Deploy the CDN Manager
  2. Configuration Guide - Configure components for your environment
  3. Operations Guide - Day-to-day operational procedures
  4. Performance Tuning Guide - Optimize system performance
  5. Metrics & Monitoring - Set up monitoring and alerting