Troubleshooting Guide

Common issues and resolution procedures

You're viewing a development version of manager, the latest released version is 1.6.1

Overview

This guide provides troubleshooting procedures for common issues encountered when operating the AgileTV CDN Manager (ESB3027). Use the diagnostic commands and resolution steps to identify and resolve problems.

Diagnostic Tools

Cluster Status

# Check node status
kubectl get nodes

# Check all pods
kubectl get pods -A

# Check events sorted by time
kubectl get events --sort-by='.lastTimestamp'

# Check resource usage
kubectl top nodes
kubectl top pods

Component Status

# Check deployments
kubectl get deployments

# Check statefulsets
kubectl get statefulsets

# Check persistent volumes
kubectl get pvc
kubectl get pv

# Check services
kubectl get services

# Check ingress
kubectl get ingress

Common Issues

Pods Stuck in Pending State

Symptoms: Pods remain in Pending state indefinitely.

Causes:

Insufficient cluster resources (CPU/memory)
No nodes match scheduling constraints
PersistentVolume not available

Diagnosis:

# Describe the pending pod
kubectl describe pod <pod-name>

# Check events for scheduling failures
kubectl get events --field-selector reason=FailedScheduling

# Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check available PVs
kubectl get pv

Resolution:

# Free up resources by scaling down non-critical workloads
kubectl scale deployment <deployment> --replicas=0

# Or add additional nodes to the cluster

# If PV is stuck, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

Pods Stuck in ContainerCreating

Symptoms: Pods remain in ContainerCreating state.

Causes:

Image pull failures
Volume mount issues
Network configuration problems

Diagnosis:

kubectl describe pod <pod-name>

# Check for image pull errors
kubectl get events | grep -i "failed to pull"

# Check volume mount status
kubectl get events | grep -i "mount"

Resolution:

# For image pull issues, verify image exists and credentials
kubectl get secret <pull-secret-name> -o yaml

# For volume issues, check Longhorn volume status
kubectl get volumes -n longhorn-system

# Delete stuck pod to trigger recreation
kubectl delete pod <pod-name> --force --grace-period=0

Persistent Volume Mount Failures

Symptoms: Pod fails to start with error “AttachVolume.Attach failed for volume… is not ready for workloads” or similar volume attachment errors.

Causes:

Longhorn volume created but unable to be successfully mounted
Network connectivity issues between nodes (Longhorn requires iSCSI and NFS traffic)
Longhorn service unhealthy
Incorrect storage class configuration

Diagnosis:

# Describe the failing pod to see the error
kubectl describe pod <pod-name>

# Check Longhorn volumes status
kubectl get volumes -n longhorn-system

# Check Longhorn UI for detailed volume status
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Access: http://localhost:8080

Resolution:

# Verify firewall allows Longhorn traffic between nodes
# Ports 9500 and 8500 must be open (see Networking Guide)

# Check Longhorn is healthy
kubectl get pods -n longhorn-system

# If volume is stuck, delete PVC and pod to trigger recreation
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

Pods in CrashLoopBackOff

Symptoms: Pods repeatedly crash and restart.

Causes:

Application configuration errors
Missing dependencies (database not ready)
Resource limits too low
Liveness probe failures

Diagnosis:

# View current logs
kubectl logs <pod-name>

# View previous instance logs
kubectl logs <pod-name> -p

# Describe pod for restart reasons
kubectl describe pod <pod-name>

# Check if dependencies are healthy
kubectl get pods | grep -E "(postgres|kafka|redis)"

Resolution:

# For dependency issues, wait for dependencies to be ready
kubectl wait --for=condition=Ready pod/<dependency-pod> --timeout=300s

# For resource issues, increase limits
kubectl edit deployment <deployment-name>

# For configuration issues, check ConfigMaps and Secrets
kubectl get configmap <configmap-name> -o yaml
kubectl get secret <secret-name> -o yaml

# Restart the deployment
kubectl rollout restart deployment/<deployment-name>

Pods in Terminating State

Symptoms: Pods stuck in Terminating state indefinitely.

Causes:

Volume detachment issues
Node communication problems
Finalizer blocking deletion

Diagnosis:

kubectl describe pod <pod-name>

# Check if node is reachable
kubectl get nodes

# Check finalizers
kubectl get pod <pod-name> -o jsonpath='{.metadata.finalizers}'

Resolution:

# Force delete the pod
kubectl delete pod <pod-name> --force --grace-period=0

# If node is unreachable, drain and remove from cluster
kubectl drain <node-name> --ignore-daemonsets --force
kubectl delete node <node-name>

Service Unreachable

Symptoms: Service endpoints not accessible.

Causes:

No ready pods backing the service
Network policy blocking traffic
Service port mismatch

Diagnosis:

# Check service endpoints
kubectl get endpoints <service-name>

# Check if pods are ready
kubectl get pods -l app=<label>

# Check network policies
kubectl get networkpolicies

# Test connectivity from within cluster
kubectl run test --rm -it --image=busybox -- wget -O- <service-name>:<port>

Resolution:

# Ensure pods are ready and matching service selector
kubectl get pods --show-labels

# Check service selector matches pod labels
kubectl get service <service-name> -o jsonpath='{.spec.selector}'

# Temporarily disable network policy for testing
kubectl edit networkpolicy <policy-name>

Ingress Not Working

Symptoms: External access via ingress fails.

Causes:

Traefik ingress controller not running
Ingress configuration errors
TLS certificate issues
DNS resolution problems

Diagnosis:

# Check Traefik pods
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

# Check ingress resources
kubectl get ingress

# Describe ingress for errors
kubectl describe ingress <ingress-name>

# Check Traefik logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik

# Test DNS resolution
nslookup <hostname>

Resolution:

# Restart Traefik
kubectl rollout restart deployment -n kube-system traefik

# Fix ingress configuration
kubectl edit ingress <ingress-name>

# Renew or recreate TLS secret
kubectl create secret tls <secret-name> --cert=tls.crt --key=tls.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Verify hostname matches certificate
openssl x509 -in tls.crt -noout -subject -issuer

Database Connection Failures

Symptoms: Application cannot connect to PostgreSQL.

Causes:

PostgreSQL cluster not ready
Connection pool exhausted
Network connectivity issues
Authentication failures

Diagnosis:

# Check PostgreSQL cluster status
kubectl get clusters

# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql

# Check PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql

# Test connectivity
kubectl exec -it <app-pod> -- psql -h <postgres-service> -U <user> -d <database>

Resolution:

# Wait for PostgreSQL to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=postgresql --timeout=300s

# Check connection string in application config
kubectl get secret <secret-name> -o jsonpath='{.data}' | base64 -d

# Restart application pods
kubectl rollout restart deployment/<deployment-name>

Kafka Connection Issues

Symptoms: Application cannot connect to Kafka.

Causes:

Kafka controllers not ready
Topic not created
Network connectivity issues

Diagnosis:

# Check Kafka pods
kubectl get pods -l app.kubernetes.io/name=kafka

# Check Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka

# List topics
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 --list

Resolution:

# Wait for Kafka controllers to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=kafka --timeout=300s

# Create missing topic
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic <topic-name> --partitions 3 --replication-factor 3

# Restart application to reconnect
kubectl rollout restart deployment/<deployment-name>

Redis Connection Issues

Symptoms: Application cannot connect to Redis.

Diagnosis:

# Check Redis pods
kubectl get pods -l app.kubernetes.io/name=redis

# Check Redis logs
kubectl logs -l app.kubernetes.io/name=redis

# Test connectivity
kubectl exec -it <redis-pod> -- redis-cli ping

Resolution:

# Wait for Redis to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=redis --timeout=300s

# Restart application
kubectl rollout restart deployment/<deployment-name>

High Memory Usage

Symptoms: Pods approaching or hitting memory limits.

Diagnosis:

# Check memory usage
kubectl top pods

# Check OOMKilled pods
kubectl get pods --field-selector=status.phase=Failed

# Check for memory leaks in logs
kubectl logs <pod-name> | grep -i "memory\|oom"

Resolution:

# Temporarily increase memory limit
kubectl edit deployment <deployment-name>

# Or scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>

# Long-term: Update values.yaml and perform helm upgrade

High CPU Usage

Symptoms: Pods consistently using high CPU.

Diagnosis:

# Check CPU usage
kubectl top pods

# Check for runaway processes
kubectl top pods --sort-by=cpu

Resolution:

# Scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>

# Or increase CPU limits
kubectl edit deployment <deployment-name>

Persistent Volume Issues

Symptoms: PVC not binding or volume errors.

Diagnosis:

# Check PVC status
kubectl get pvc

# Check PV status
kubectl get pv

# Check Longhorn volumes
kubectl get volumes -n longhorn-system

# Check Longhorn UI for details
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Resolution:

# For stuck PVC, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>

# For Longhorn issues, check Longhorn UI
# Access via http://localhost:8080

# Recreate Longhorn volume if necessary

Zitadel Authentication Failures

Symptoms: Users cannot authenticate via Zitadel.

Causes:

CORS configuration mismatch
External domain misconfigured
Zitadel pods not healthy

Diagnosis:

# Check Zitadel pods
kubectl get pods -l app.kubernetes.io/name=zitadel

# Check Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel

# Verify external domain configuration
helm get values acd-manager -o yaml | grep -A 5 zitadel

Resolution:

# Ensure global.hosts.manager[0].host matches zitadel.zitadel.ExternalDomain
# Update values.yaml if needed

helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml

# Restart Zitadel
kubectl rollout restart deployment -l app.kubernetes.io/name=zitadel

Certificate Errors

Symptoms: TLS/SSL errors in browser or API calls.

Diagnosis:

# Check certificate expiration
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -dates

# Check certificate subject
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -subject -issuer

Resolution:

# Renew self-signed certificate
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
  --values ~/values.yaml \
  --set ingress.selfSigned=true

# Or update manual certificate
kubectl create secret tls <secret-name> \
  --cert=new-cert.crt --key=new-key.key \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to pick up new certificate
kubectl rollout restart deployment <deployment-name>

Log Collection

Collecting Logs for Support

# Capture timestamp once to ensure consistency
TS=$(date +%Y%m%d-%H%M%S)

# Create log collection directory
mkdir -p ~/cdn-logs-$TS
cd ~/cdn-logs-$TS

# Collect pod logs
for pod in $(kubectl get pods -o name); do
  kubectl logs $pod > ${pod#pod/}.log 2>&1
  kubectl logs $pod -p > ${pod#pod/}.previous.log 2>&1 || true
done

# Collect cluster events
kubectl get events --sort-by='.lastTimestamp' > events.log

# Collect pod descriptions
for pod in $(kubectl get pods -o name); do
  kubectl describe $pod > ${pod#pod/}.describe.txt
done

# Compress for transfer
tar czf cdn-logs-$TS.tar.gz *.log *.txt

Emergency Procedures

Complete Cluster Recovery

If the cluster is completely down:

Assess node status:
```
kubectl get nodes
```
Restart K3s on nodes:
```
# On each node
systemctl restart k3s
```
If primary server failed:
- Promote another server node
- Update load balancer/DNS to point to new primary
Restore from backup if necessary:
- See Upgrade Guide for restore procedures

Data Recovery

For data recovery scenarios:

PostgreSQL: Use Cloudnative PG backup/restore
Longhorn: Restore from volume snapshots
Kafka: Replication handles most failures

Getting Help

If issues persist:

Collect logs using the procedure above
Check release notes for known issues
Contact support with log bundle and issue description

Next Steps

After resolving issues:

Operations Guide - Preventive maintenance procedures
Configuration Guide - Verify configuration is correct
Architecture Guide - Understand component dependencies