Architecture Guide

Detailed system architecture and component overview

You're viewing a development version of manager, the latest released version is 1.6.1

Overview

The AgileTV CDN Manager (ESB3027) is a cloud-native Kubernetes application designed for managing CDN operations. This guide provides a detailed description of the system architecture, component interactions, and scaling considerations.

High-Level Architecture

The CDN Manager follows a microservices architecture deployed on Kubernetes. The system is organized into logical layers:

graph LR
    Clients[API Clients] --> Ingress[Ingress Controller]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Manager --> Redis[(Redis)]
    Manager --> Kafka[(Kafka)]
    Manager --> PostgreSQL[(PostgreSQL)]
    Manager --> Zitadel[Zitadel IAM]
    Manager --> Confd[Configuration Service]
    Grafana --> VM[(VictoriaMetrics)]
    Confd -.-> Gateway[NGinx Gateway]
    Gateway --> Director[CDN Director]

Component Architecture

Ingress Layer

The ingress layer manages all incoming traffic to the cluster:

Component	Role
Ingress Controller	Primary ingress for all cluster traffic; routes requests to internal services based on path
NGinx Gateway	Reverse proxy for routing traffic to external CDN Directors; used by MIB Frontend to communicate with remote Confd instances on CDN Director nodes

Traffic flow:

API clients and Operator UI connect via the Ingress Controller at /api and /gui paths respectively
Grafana dashboards are accessed via the Ingress Controller at /grafana
Zitadel authentication console is accessed via the Ingress Controller at /ui/console
MIB Frontend uses NGinx Gateway when communicating with external Confd instances on CDN Director nodes

Application Services

The application layer contains the core CDN Manager services:

Component	Role	Scaling
Core Manager	Main REST API server (v1/v2 endpoints); handles authentication, configuration, routing, and discovery	Horizontally scalable via HPA
MIB Frontend	Web-based configuration GUI for operators	Horizontally scalable via HPA
Confd	Configuration service for routing configuration; synchronizes with Core Manager application	Single instance
Grafana	Monitoring and visualization dashboards	Single instance
Selection Input Worker	Consumes selection input events from Kafka and updates configuration	Single instance
Metrics Aggregator	Collects and aggregates metrics from CDN components	Single instance
Telegraf	System-level metrics collection from cluster nodes	DaemonSet (one per node)
Alertmanager	Alert routing and notification management	Single instance

Data Layer

The data layer provides persistent and ephemeral storage:

Component	Role	Scaling
Redis	In-memory caching, session storage, and ephemeral state	Master + replicas (read-only)
Kafka	Event streaming for selection input and metrics; provides durable message queue	Controller cluster (odd count)
PostgreSQL	Persistent configuration and state storage	3-node cluster with HA
VictoriaMetrics (Analytics)	Real-time and short-term metrics for operational dashboards	Single instance
VictoriaMetrics (Billing)	Long-term metrics retention (1+ years) for billing and license compliance	Single instance

External Integrations

Component	Role
Zitadel IAM	Identity and access management; provides OAuth2/OIDC authentication
CDN Director (ESB3024)	Edge routing infrastructure; receives configuration from Confd

Detailed Component Descriptions

Core Manager

The Core Manager is the central application server that exposes the REST API. It is implemented in Rust using the Actix-web framework.

Key Responsibilities:

Authentication and session management via Zitadel
Configuration document storage and retrieval
Selection input CRUD operations
Routing rule evaluation and GeoIP lookups
Service discovery for CDN Directors and edge servers
Operator UI helper endpoints

API Endpoints:

/api/v1/auth/* - Authentication (login, token, logout)
/api/v1/configuration - Configuration management
/api/v1/selection_input/* - Selection input operations
/api/v2/selection_input/* - Enhanced selection input with list operations
/api/v1/routing/* - Routing evaluation and validation
/api/v1/discovery/* - Host and namespace discovery
/api/v1/metrics - System metrics
/api/v1/health/* - Liveness and readiness probes
/api/v1/operator_ui/* - Operator helper endpoints

Runtime Modes: The Core Manager supports multiple runtime modes, each deployed as a separate container:

http-server - Primary HTTP API server (default)
metrics-aggregator - Background worker for metrics collection
selection-input - Background worker for Kafka selection input consumption

MIB Frontend

The MIB Frontend provides a web-based GUI for configuration management.

Key Features:

Intuitive web interface for CDN configuration
Real-time configuration validation
Integration with Zitadel for SSO authentication
Uses NGinx Gateway for external Director communication

Confd (Configuration Service)

Confd provides routing configuration services and synchronizes with the Core Manager application.

Key Responsibilities:

Hosts the service configuration for routing decisions
Provides API and CLI for configuration management
Synchronizes routing configuration with Core Manager
Maintains configuration state in PostgreSQL

Selection Input Worker

The Selection Input Worker processes selection input events from the Kafka stream.

Key Responsibilities:

Consumes messages from the selection_input Kafka topic
Validates and transforms input data
Updates configuration in the data store
Maintains message ordering within partitions

Scaling Limitation: The Selection Input Worker cannot be scaled beyond a single consumer per Kafka partition, as message ordering must be preserved.

Metrics Aggregator

The Metrics Aggregator collects and processes metrics from CDN components.

Key Responsibilities:

Polls metrics from Director instances
Aggregates usage statistics
Writes data to VictoriaMetrics (Analytics) for dashboards
Writes long-term data to VictoriaMetrics (Billing) for compliance

Telegraf

Telegraf is deployed as a DaemonSet to collect host-level metrics.

Key Responsibilities:

CPU, memory, disk, and network metrics from each node
Container-level resource usage
Kubernetes cluster metrics
Forwards metrics to VictoriaMetrics

Grafana

Grafana provides visualization and dashboard capabilities.

Features:

Pre-built dashboards for CDN monitoring
Custom dashboard support
VictoriaMetrics as data source
Alerting integration with Alertmanager

Access: https://<host>/grafana

Alertmanager

Alertmanager handles alert routing and notifications.

Key Responsibilities:

Receives alerts from Grafana and other sources
Deduplicates and groups alerts
Routes to notification channels (email, webhook, etc.)
Manages alert silencing and inhibition

Data Storage

Redis

Redis provides in-memory storage for:

User sessions and authentication tokens
Ephemeral configuration cache
Real-time state synchronization

Deployment: Master + read replicas for high availability

Kafka

Kafka provides durable event streaming for:

Selection input events
Metrics data streams
Inter-service communication

Deployment: Controller cluster with 3 replicas for production, 1 replica for lab deployments

Node Affinity: Kafka replicas must be scheduled on separate nodes to ensure high availability. The Helm chart configures pod anti-affinity rules to enforce this distribution.

Topics:

selection_input - Selection input events
metrics - Metrics data streams

Note: For lab/single-node deployments, the Kafka replica count must be set to 1 in the Helm values. Production deployments require 3 replicas for fault tolerance.

PostgreSQL

PostgreSQL provides persistent storage for:

Configuration documents
User and permission data
System state

Deployment: 3-node cluster managed by Cloudnative PG (CNPG) operator

High Availability: The CNPG operator manages automatic failover and ensures high availability:

One primary node handles read/write operations
Two replica nodes provide redundancy and can be promoted to primary on failure
Automatic failover occurs within seconds of primary node failure
Synchronous replication ensures data consistency

Note: The PostgreSQL cluster is deployed and managed automatically by the CNPG operator. Manual intervention is typically not required for normal operations.

VictoriaMetrics

Two VictoriaMetrics instances serve different purposes:

VictoriaMetrics (Analytics):

Real-time and short-term metrics storage
Supports Grafana dashboards
Retention: Configurable (typically 30-90 days)

VictoriaMetrics (Billing):

Long-term metrics retention
Billing and license compliance data
Retention: Minimum 1 year

Authentication and Authorization

Zitadel Integration

Zitadel provides identity and access management:

Authentication Flow:

User accesses MIB Frontend or API
Redirected to Zitadel for authentication
Zitadel validates credentials and issues session token
Session token exchanged for access token
Access token included in API requests (Bearer authentication)

Default Credentials: See the Glossary for default login credentials.

Access Paths:

Zitadel Console: /ui/console
API authentication: /api/v1/auth/*

CORS Configuration

Zitadel enforces Cross-Origin Resource Sharing (CORS) policies. The external hostname configured in Zitadel must match the first entry in global.hosts.manager in the Helm values.

Network Architecture

Traffic Flow

graph TB
    External[External Clients] --> Ingress[Ingress Controller]
    External --> Redis[(Redis)]
    External --> Kafka[(Kafka)]
    External --> Telegraf[Telegraf]
    Ingress --> Manager[Core Manager]
    Ingress --> Frontend[MIB Frontend]
    Ingress --> Grafana[Grafana]
    Ingress --> Zitadel[Zitadel]

Note: Certain services (Redis, Kafka, Telegraf) can be accessed directly by external clients without traversing the ingress controller. This is typically used for metrics collection, event streaming, and direct data access scenarios.

Internal Communication

All internal services communicate over the Kubernetes overlay network (Flannel VXLAN). Services discover each other via Kubernetes DNS.

External Communication

CDN Directors: Accessed via NGinx Gateway for simplified routing
MaxMind GeoIP: Local database files (no external calls)

Scaling

Horizontal Pod Autoscaler (HPA)

The following components support automatic horizontal scaling via HPA:

Component	Minimum	Maximum	Scale Metrics
Core Manager	3	8	CPU (50%), Memory (80%)
NGinx Gateway	2	4	CPU (75%), Memory (80%)
MIB Frontend	2	4	CPU (75%), Memory (90%)

Note: HPA is enabled by default in the Helm chart. The default configuration is tuned for production deployments. Adjust min/max values based on expected load and available cluster capacity.

Manual Scaling

Components can also be scaled manually by setting replica counts in the Helm values:

manager:
  replicaCount: 3
mib-frontend:
  replicaCount: 2

Important: When manually setting replica counts, you must disable the Horizontal Pod Autoscaler (HPA) for the corresponding component. If HPA remains enabled, it will override manual replica settings. To disable HPA, set autoscaling.hpa.enabled: false for the component in your Helm values.

Components That Do Not Scale

The following components do not support horizontal scaling:

Component	Reason
Confd	Single instance required for configuration consistency
PostgreSQL	Cloudnative PG cluster; scaled by adding replicas via operator configuration
Kafka	Scaled by adding controllers, not via replica count
VictoriaMetrics	Stateful; single instance per role
Redis	Master is single; replicas are read-only
Grafana	Single instance sufficient for dashboard access
Alertmanager	Single instance for alert routing
Selection Input Worker	Kafka message ordering requires single consumer
Metrics Aggregator	Single instance for consistent metrics aggregation

Node Scaling

Additional Agent nodes can be added to the cluster at any time to increase workload capacity. Kubernetes automatically schedules pods to nodes with available resources.

Cluster Balancing

The CDN Manager deployment includes the Kubernetes Descheduler to maintain balanced resource utilization across cluster nodes:

Automatic Rebalancing: The descheduler periodically analyzes pod distribution and evicts pods from overutilized nodes
Node Balance: Helps prevent resource hotspots by redistributing workloads across available nodes
Integration with HPA: Works in conjunction with Horizontal Pod Autoscaler to optimize both pod count and placement

The descheduler runs as a background process and does not require manual intervention under normal operating conditions.

Resource Configuration

For detailed resource preset configurations and planning guidance, see the Configuration Guide.

High Availability

Server Node Redundancy

Production deployments require a minimum of 3 Server nodes:

Survives loss of 1 server node
Maintains quorum for etcd and Kafka

For enhanced availability, use 5 Server nodes:

Survives loss of 2 server nodes
Recommended for critical production environments

For large-scale deployments, 7 or more Server nodes can be used:

Survives loss of 3+ server nodes
Suitable for high-capacity production environments

Pod Distribution

Kubernetes automatically distributes pods across nodes to maximize availability:

Pods with the same deployment are scheduled on different nodes when possible
Pod Disruption Budgets (PDB) ensure minimum availability during maintenance

Data Replication

Component	Replication Strategy
Redis	Single instance (backup via Longhorn snapshots)
Kafka	Replicated partitions (default: 3)
PostgreSQL	3-node cluster via Cloudnative PG
VictoriaMetrics	Single instance (backup via snapshots)
Longhorn	Single replica with pod-node affinity

Longhorn Storage: Longhorn volumes are configured with a single replica by default. Pod scheduling is configured with node affinity to prefer scheduling pods on the same node as their persistent volume data. This approach optimizes I/O performance while maintaining data locality.

Next Steps

After understanding the architecture:

Installation Guide - Deploy the CDN Manager
Configuration Guide - Configure components for your environment
Operations Guide - Day-to-day operational procedures
Performance Tuning Guide - Optimize system performance
Metrics & Monitoring - Set up monitoring and alerting