Operations Guide

Day-to-day operational procedures and maintenance tasks
You're viewing a development version of manager, the latest released version is v1.4.1
Go to the latest released version

Overview

This guide covers day-to-day operational procedures for managing the AgileTV CDN Manager (ESB3027). Topics include routine maintenance, backup procedures, log management, and common operational tasks.

Prerequisites

Before performing operations, ensure you have:

  • kubectl access to the cluster
  • helm CLI installed
  • Access to the node where values.yaml is stored
  • Appropriate RBAC permissions for administrative tasks

Cluster Access

There are two supported methods for accessing the Kubernetes cluster:

  1. SSH to a Server Node (Recommended for operations staff) - SSH into any Server node and run kubectl commands directly
  2. Remote kubectl - Install kubectl on your local machine and configure it to connect to the cluster remotely

The kubectl command-line tool is pre-configured on all Server nodes and can be used directly without additional setup:

# SSH to any Server node
ssh root@<server-ip>

# Run kubectl commands directly
kubectl get nodes
kubectl get pods

This method is recommended for day-to-day operations as it requires no local configuration and provides direct access to the cluster.

Method 2: Remote kubectl from Local Machine

To use kubectl from your local workstation or laptop:

Step 1: Install kubectl

Download and install kubectl for your operating system:

  • Official Documentation: Install kubectl
  • macOS (Homebrew): brew install kubectl
  • Linux: Download from the official Kubernetes release page
  • Windows: Download from the official Kubernetes release page

Step 2: Copy kubeconfig from Server Node

# Copy kubeconfig from any Server node
scp root@<server-ip>:/etc/rancher/k3s/k3s.yaml ~/.kube/config

Step 3: Update kubeconfig

Edit the kubeconfig file to point to the correct server address:

# Replace localhost with the actual server IP
# macOS/Linux:
sed -i '' 's/127.0.0.1/<server-ip>/g' ~/.kube/config  # macOS
sed -i 's/127.0.0.1/<server-ip>/g' ~/.kube/config    # Linux

# Or manually edit ~/.kube/config and change:
# server: https://127.0.0.1:6443
# to:
# server: https://<server-ip>:6443

Step 4: Verify connectivity

kubectl get nodes

Managing Multiple Clusters

If you manage multiple Kubernetes clusters from the same machine, you can maintain multiple kubeconfig files:

# Set KUBECONFIG environment variable to include multiple config files
export KUBECONFIG=~/.kube/config-prod:~/.kube/config-lab

# View all contexts
kubectl config get-contexts

# Switch between clusters
kubectl config use-context <context-name>

# View current context
kubectl config current-context

For more information, see the official Kubernetes documentation: Organizing Cluster Access

Helm Commands

Helm releases are managed cluster-wide:

# List all releases
helm list

# View release history
helm history acd-manager

# Get deployed values
helm get values acd-manager -o yaml

# Get deployed manifest
helm get manifest acd-manager

Note: If using remote kubectl, ensure helm is installed on your local machine. See Helm Installation for instructions.

Helm Commands

Helm releases are managed cluster-wide:

# List all releases
helm list

# View release history
helm history acd-manager

# Get deployed values
helm get values acd-manager -o yaml

# Get deployed manifest
helm get manifest acd-manager

Backup Procedures

PostgreSQL Backup

PostgreSQL is managed by the Cloudnative PG operator, which provides continuous backup capabilities.

# Check backup status
kubectl get backup

# Create manual backup
kubectl apply -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: manual-backup-$(date +%Y%m%d-%H%M%S)
spec:
  cluster:
    name: acd-cluster-postgresql
EOF

# List available backups
kubectl get backup -o wide

# Restore from backup (requires downtime)
# See Upgrade Guide for restore procedures

Longhorn Volume Backups

Longhorn provides snapshot and backup capabilities for persistent volumes:

# List all volumes
kubectl get volumes -n longhorn-system

# Create snapshot via Longhorn UI
# Port-forward to Longhorn UI (do not expose via ingress)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

# Access: http://localhost:8080
# WARNING: Longhorn UI grants access to sensitive storage information
# and should never be exposed through the ingress controller

Accessing Internal Services

For debugging and troubleshooting, you may need direct access to internal services.

PostgreSQL

PostgreSQL is managed by the Cloudnative PG operator. Connection details are stored in the acd-cluster-postgresql-app Secret:

# View connection details
kubectl describe secret acd-cluster-postgresql-app

# Extract individual fields
PG_HOST=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.host}' | base64 -d)
PG_USER=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.username}' | base64 -d)
PG_PASS=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.password}' | base64 -d)
PG_DB=$(kubectl get secret acd-cluster-postgresql-app -o jsonpath='{.data.dbname}' | base64 -d)

# Connect via psql
kubectl exec -it acd-cluster-postgresql-0 -- psql -U $PG_USER -d $PG_DB

Secret fields: The CNPG operator populates the following fields: username, password, host, port, dbname, uri, jdbc-uri, fqdn-uri, fqdn-jdbc-uri, pgpass.

Redis

Redis runs on port 6379 with no authentication:

# Connect via redis-cli
kubectl exec -it acd-manager-redis-master-0 -- redis-cli

# Or connect from another pod
kubectl run redis-test --rm -it --image=redis -- redis-cli -h acd-manager-redis-master

Kafka

Kafka is accessible on port 9095 from any cluster node:

# Connect from within cluster
kubectl exec -it acd-manager-kafka-controller-0 -- kafka-topics.sh --bootstrap-server localhost:9092 --list

# Connect from external (via any node IP)
kafka-topics.sh --bootstrap-server <node-ip>:9095 --list

The selection_input topic is pre-configured for selection input events.

Longhorn Storage

Longhorn is a distributed block storage system for Kubernetes that provides persistent volumes for stateful applications such as PostgreSQL and Kafka.

Architecture

Longhorn deploys controller and replica engines on each node, forming a distributed storage system. When a volume is created, Longhorn replicates data across multiple nodes to ensure durability even in the event of node failures.

Storage Protocols:

  • iSCSI: Used for standard Read-Write-Once (RWO) volumes
  • NFS: Used for Read-Write-Many (RWX) volumes that can be mounted by multiple pods simultaneously

Configuration

The CDN Manager deploys Longhorn with a single replica configuration, which differs from the Longhorn default of 3 replicas. This configuration is optimized for the cluster architecture where:

  • Pod-node affinity is configured to schedule pods on the same node as their persistent volume data
  • This optimizes I/O performance by reducing network traffic
  • Data locality is maintained while still providing volume portability

Capacity Planning

Longhorn storage requires an additional 30% capacity headroom for internal operations and scaling. If less than 30% of the total partition capacity is available, Longhorn may mark volumes as “full” and prevent further writes.

For detailed storage requirements and disk partitioning guidance, see the System Requirements Guide.

Configuration Backup

Always backup your Helm values before making changes:

# Export current values
helm get values acd-manager -o yaml > ~/values-backup-$(date +%Y%m%d).yaml

# Backup custom values files
cp ~/values.yaml ~/values-backup-$(date +%Y%m%d).yaml

Backup Schedule Recommendations

ComponentFrequencyRetention
PostgreSQLDaily30 days
Longhorn SnapshotsBefore changes7 days
ConfigurationBefore each changeIndefinite

Updating MaxMind GeoIP Databases

The MaxMind GeoIP databases (GeoIP2-City, GeoLite2-ASN, GeoIP2-Anonymous-IP) are used for GeoIP-based routing and validation features. These databases should be updated periodically to ensure accurate IP geolocation data.

Prerequisites

  • Updated MaxMind database files (.mmdb format) obtained from MaxMind
  • Access to the cluster via kubectl
  • Helm CLI installed

Update Procedure

Step 1: Create New Volume with Updated Databases

Run the volume generation utility with a unique volume name that includes a revision identifier:

# Mount the installation ISO if not already mounted
mkdir -p /mnt/esb3027
mount -o loop,ro esb3027-acd-manager-X.Y.Z.iso /mnt/esb3027

# Generate new volume with updated databases
/mnt/esb3027/generate-maxmind-volume

When prompted:

  1. Provide the paths to the three database files:
    • GeoIP2-City.mmdb
    • GeoLite2-ASN.mmdb
    • GeoIP2-Anonymous-IP.mmdb
  2. Enter a unique volume name with a revision number or date, for example:
    • maxmind-geoip-2026-04
    • maxmind-geoip-v2

Tip: Using a revision-based naming convention simplifies rollback if needed.

Step 2: Update Helm Configuration

Edit your values.yaml file to reference the new volume:

manager:
  maxmindDbVolume: maxmind-geoip-2026-04

Replace maxmind-geoip-2026-04 with the volume name you specified in Step 1.

Step 3: Apply Configuration Update

Upgrade the Helm release with the updated configuration:

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager --values ~/values.yaml

Step 4: Rolling Restart (Optional)

To ensure all pods immediately use the new database files, perform a rolling restart of the manager deployment:

kubectl rollout restart deployment acd-manager

Monitor the rollout status:

kubectl rollout status deployment acd-manager

Step 5: Verify Update

Verify the pods are running with the new volume:

kubectl get pods
kubectl describe pod -l app.kubernetes.io/component=manager | grep -A 5 "Volumes"

Step 6: Clean Up Old Volume (Optional)

After verifying the new databases are working correctly, you can delete the old persistent volume:

# List persistent volumes to find the old one
kubectl get pv

# Delete the old volume
kubectl delete pv <old-volume-name>

Caution: Ensure the new volume is functioning correctly before deleting the old volume. Keep the old volume for at least 24-48 hours as a rollback option.

Rollback Procedure

If issues occur after updating the databases:

  1. Revert the maxmindDbVolume value in your values.yaml to the previous volume name
  2. Run helm upgrade with the reverted configuration
  3. Optionally restart the deployment: kubectl rollout restart deployment acd-manager

Update Frequency Recommendations

DatabaseRecommended Update Frequency
GeoIP2-CityWeekly or monthly
GeoLite2-ASNMonthly
GeoIP2-Anonymous-IPWeekly or monthly

MaxMind releases database updates on a regular schedule. Subscribe to MaxMind notifications to stay informed of new releases.

Log Management

Application Logs

# View manager logs
kubectl logs -l app.kubernetes.io/component=manager

# Follow logs in real-time
kubectl logs -l app.kubernetes.io/component=manager -f

# View logs from specific pod
kubectl logs <pod-name>

# View previous instance logs (after crash)
kubectl logs <pod-name> -p

# View logs with timestamps
kubectl logs <pod-name> --timestamps

# View logs from all containers in pod
kubectl logs <pod-name> --all-containers

Component-Specific Logs

# Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel

# Gateway logs
kubectl logs -l app.kubernetes.io/component=gateway

# Confd logs
kubectl logs -l app.kubernetes.io/component=confd

# MIB Frontend logs
kubectl logs -l app.kubernetes.io/component=mib-frontend

# PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql

# Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka

# Redis logs
kubectl logs -l app.kubernetes.io/name=redis

Log Aggregation

Logs are collected by Telegraf and sent to VictoriaMetrics:

# Access Grafana for log visualization
# https://<manager-host>/grafana

# Query logs via Grafana Explore
# Select VictoriaMetrics datasource and use log queries

Log Rotation

Container logs are automatically rotated by Kubernetes:

  • Default max size: 10MB per container
  • Default max files: 5 rotated files
  • Total per pod: ~50MB maximum

Scaling Operations

Manual Scaling

Note: If HPA (Horizontal Pod Autoscaler) is enabled for a deployment, manual scaling changes will be overridden by the HPA. To manually scale, you must first disable the HPA.

# Check if HPA is enabled
kubectl get hpa

# Disable HPA before manual scaling
kubectl patch hpa acd-manager -p '{"spec": {"minReplicas": null, "maxReplicas": null}}'

# Or delete the HPA entirely
kubectl delete hpa acd-manager

# Scale manager replicas
kubectl scale deployment acd-manager --replicas=3

# Scale gateway replicas
kubectl scale deployment acd-manager-gateway --replicas=2

# Scale MIB frontend replicas
kubectl scale deployment acd-manager-mib-frontend --replicas=2

HPA Configuration

# View HPA status
kubectl get hpa

# Describe HPA details
kubectl describe hpa acd-manager

# Edit HPA configuration
kubectl edit hpa acd-manager

Configuration Updates

Updating Helm Values

# Edit values file
vi ~/values.yaml

# Validate with dry-run
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --dry-run

# Apply changes
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

# Verify rollout
kubectl rollout status deployment/acd-manager

Rolling Back Changes

# View revision history
helm history acd-manager

# Rollback to previous revision
helm rollback acd-manager

# Rollback to specific revision
helm rollback acd-manager <revision>

# Verify rollback
helm history acd-manager

Certificate Management

Checking Certificate Expiration

# Check TLS secret expiration
kubectl get secret acd-manager-tls -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# Check via Grafana dashboard
# Certificate expiration metrics are available in Grafana

Renewing Certificates

# For Helm-managed self-signed certificates
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --set ingress.selfSigned=true

# For manual certificates, update the secret
kubectl create secret tls acd-manager-tls \
  --cert=new-tls.crt \
  --key=new-tls.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to pick up new certificate
kubectl rollout restart deployment acd-manager

Health Checks

Component Health

# Check all pods
kubectl get pods

# Check specific component
kubectl get pods -l app.kubernetes.io/component=manager

# Check persistent volumes
kubectl get pvc

# Check cluster status
kubectl get nodes

# Check ingress
kubectl get ingress

API Health Endpoints

# Liveness check
curl -k https://<manager-host>/api/v1/health/alive

# Readiness check
curl -k https://<manager-host>/api/v1/health/ready

Database Health

# PostgreSQL cluster status
kubectl get clusters -n default

# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql

# Kafka cluster status
kubectl get pods -l app.kubernetes.io/name=kafka

# Redis status
kubectl get pods -l app.kubernetes.io/name=redis

Maintenance Windows

Planned Maintenance

Before performing maintenance:

  1. Notify users of potential service impact
  2. Verify backups are current
  3. Document the maintenance procedure
  4. Prepare rollback plan

Node Maintenance

# Cordon node to prevent new pods
kubectl cordon <node-name>

# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Perform maintenance

# Uncordon node
kubectl uncordon <node-name>

Cluster Upgrades

See the Upgrade Guide for cluster upgrade procedures.

Troubleshooting Quick Reference

Common Commands

# Describe problematic pod
kubectl describe pod <pod-name>

# View pod events
kubectl get events --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pods
kubectl top nodes

# Exec into container
kubectl exec -it <pod-name> -- /bin/sh

# Check network policies
kubectl get networkpolicies

# Check service endpoints
kubectl get endpoints

Restarting Components

# Restart deployment
kubectl rollout restart deployment/<deployment-name>

# Restart statefulset
kubectl rollout restart statefulset/<statefulset-name>

# Delete pod (auto-recreated)
kubectl delete pod <pod-name>

Security Operations

Rotating Service Account Tokens

# Delete service account secret (auto-regenerated)
kubectl delete secret <service-account-token-secret>

# Tokens are automatically regenerated

Updating RBAC Permissions

# View current roles
kubectl get roles
kubectl get clusterroles

# View role bindings
kubectl get rolebindings
kubectl get clusterrolebindings

# Edit role
kubectl edit role <role-name>

Audit Log Access

# K3s audit logs location
/var/lib/rancher/k3s/server/logs/audit.log

# View recent audit events
tail -f /var/lib/rancher/k3s/server/logs/audit.log

Disaster Recovery

Pod Recovery

Pods are automatically recreated if they fail:

# Check pod status
kubectl get pods

# If pod is stuck in Terminating
kubectl delete pod <pod-name> --force --grace-period=0

# If pod is stuck in Pending, check resources
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'

Node Failure Recovery

When a node fails:

  1. Automatic: Pods are rescheduled on healthy nodes (after timeout)
  2. Manual: Force delete stuck pods
# Force delete pods on failed node
kubectl delete pod --all --force --grace-period=0 \
  --field-selector spec.nodeName=<failed-node>

Data Recovery

For data recovery scenarios, refer to:

  • PostgreSQL: Cloudnative PG backup/restore procedures
  • Longhorn: Volume snapshot restoration
  • Kafka: Partition replication handles node failures

Routine Maintenance Checklist

Daily

  • Review Grafana dashboards for anomalies
  • Check alert notifications
  • Verify backup completion

Weekly

  • Review pod restart counts
  • Check certificate expiration dates
  • Review log storage usage
  • Verify HPA is functioning correctly

Monthly

  • Test backup restoration procedure
  • Review and rotate credentials if needed
  • Update documentation if configuration changed
  • Review resource utilization trends

Next Steps

After mastering operations:

  1. Troubleshooting Guide - Deep dive into problem resolution
  2. Performance Tuning Guide - Optimize system performance
  3. Metrics & Monitoring Guide - Comprehensive monitoring setup
  4. API Guide - REST API reference and automation