Troubleshooting Guide
Overview
This guide provides troubleshooting procedures for common issues encountered when operating the AgileTV CDN Manager (ESB3027). Use the diagnostic commands and resolution steps to identify and resolve problems.
Diagnostic Tools
Cluster Status
# Check node status
kubectl get nodes
# Check all pods
kubectl get pods -A
# Check events sorted by time
kubectl get events --sort-by='.lastTimestamp'
# Check resource usage
kubectl top nodes
kubectl top pods
Component Status
# Check deployments
kubectl get deployments
# Check statefulsets
kubectl get statefulsets
# Check persistent volumes
kubectl get pvc
kubectl get pv
# Check services
kubectl get services
# Check ingress
kubectl get ingress
Common Issues
Pods Stuck in Pending State
Symptoms: Pods remain in Pending state indefinitely.
Causes:
- Insufficient cluster resources (CPU/memory)
- No nodes match scheduling constraints
- PersistentVolume not available
Diagnosis:
# Describe the pending pod
kubectl describe pod <pod-name>
# Check events for scheduling failures
kubectl get events --field-selector reason=FailedScheduling
# Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check available PVs
kubectl get pv
Resolution:
# Free up resources by scaling down non-critical workloads
kubectl scale deployment <deployment> --replicas=0
# Or add additional nodes to the cluster
# If PV is stuck, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>
Pods Stuck in ContainerCreating
Symptoms: Pods remain in ContainerCreating state.
Causes:
- Image pull failures
- Volume mount issues
- Network configuration problems
Diagnosis:
kubectl describe pod <pod-name>
# Check for image pull errors
kubectl get events | grep -i "failed to pull"
# Check volume mount status
kubectl get events | grep -i "mount"
Resolution:
# For image pull issues, verify image exists and credentials
kubectl get secret <pull-secret-name> -o yaml
# For volume issues, check Longhorn volume status
kubectl get volumes -n longhorn-system
# Delete stuck pod to trigger recreation
kubectl delete pod <pod-name> --force --grace-period=0
Persistent Volume Mount Failures
Symptoms: Pod fails to start with error “AttachVolume.Attach failed for volume… is not ready for workloads” or similar volume attachment errors.
Causes:
- Longhorn volume created but unable to be successfully mounted
- Network connectivity issues between nodes (Longhorn requires iSCSI and NFS traffic)
- Longhorn service unhealthy
- Incorrect storage class configuration
Diagnosis:
# Describe the failing pod to see the error
kubectl describe pod <pod-name>
# Check Longhorn volumes status
kubectl get volumes -n longhorn-system
# Check Longhorn UI for detailed volume status
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Access: http://localhost:8080
Resolution:
# Verify firewall allows Longhorn traffic between nodes
# Ports 9500 and 8500 must be open (see Networking Guide)
# Check Longhorn is healthy
kubectl get pods -n longhorn-system
# If volume is stuck, delete PVC and pod to trigger recreation
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>
Pods in CrashLoopBackOff
Symptoms: Pods repeatedly crash and restart.
Causes:
- Application configuration errors
- Missing dependencies (database not ready)
- Resource limits too low
- Liveness probe failures
Diagnosis:
# View current logs
kubectl logs <pod-name>
# View previous instance logs
kubectl logs <pod-name> -p
# Describe pod for restart reasons
kubectl describe pod <pod-name>
# Check if dependencies are healthy
kubectl get pods | grep -E "(postgres|kafka|redis)"
Resolution:
# For dependency issues, wait for dependencies to be ready
kubectl wait --for=condition=Ready pod/<dependency-pod> --timeout=300s
# For resource issues, increase limits
kubectl edit deployment <deployment-name>
# For configuration issues, check ConfigMaps and Secrets
kubectl get configmap <configmap-name> -o yaml
kubectl get secret <secret-name> -o yaml
# Restart the deployment
kubectl rollout restart deployment/<deployment-name>
Pods in Terminating State
Symptoms: Pods stuck in Terminating state indefinitely.
Causes:
- Volume detachment issues
- Node communication problems
- Finalizer blocking deletion
Diagnosis:
kubectl describe pod <pod-name>
# Check if node is reachable
kubectl get nodes
# Check finalizers
kubectl get pod <pod-name> -o jsonpath='{.metadata.finalizers}'
Resolution:
# Force delete the pod
kubectl delete pod <pod-name> --force --grace-period=0
# If node is unreachable, drain and remove from cluster
kubectl drain <node-name> --ignore-daemonsets --force
kubectl delete node <node-name>
Service Unreachable
Symptoms: Service endpoints not accessible.
Causes:
- No ready pods backing the service
- Network policy blocking traffic
- Service port mismatch
Diagnosis:
# Check service endpoints
kubectl get endpoints <service-name>
# Check if pods are ready
kubectl get pods -l app=<label>
# Check network policies
kubectl get networkpolicies
# Test connectivity from within cluster
kubectl run test --rm -it --image=busybox -- wget -O- <service-name>:<port>
Resolution:
# Ensure pods are ready and matching service selector
kubectl get pods --show-labels
# Check service selector matches pod labels
kubectl get service <service-name> -o jsonpath='{.spec.selector}'
# Temporarily disable network policy for testing
kubectl edit networkpolicy <policy-name>
Ingress Not Working
Symptoms: External access via ingress fails.
Causes:
- Traefik ingress controller not running
- Ingress configuration errors
- TLS certificate issues
- DNS resolution problems
Diagnosis:
# Check Traefik pods
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik
# Check ingress resources
kubectl get ingress
# Describe ingress for errors
kubectl describe ingress <ingress-name>
# Check Traefik logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik
# Test DNS resolution
nslookup <hostname>
Resolution:
# Restart Traefik
kubectl rollout restart deployment -n kube-system traefik
# Fix ingress configuration
kubectl edit ingress <ingress-name>
# Renew or recreate TLS secret
kubectl create secret tls <secret-name> --cert=tls.crt --key=tls.key \
--dry-run=client -o yaml | kubectl apply -f -
# Verify hostname matches certificate
openssl x509 -in tls.crt -noout -subject -issuer
Database Connection Failures
Symptoms: Application cannot connect to PostgreSQL.
Causes:
- PostgreSQL cluster not ready
- Connection pool exhausted
- Network connectivity issues
- Authentication failures
Diagnosis:
# Check PostgreSQL cluster status
kubectl get clusters
# Check PostgreSQL pods
kubectl get pods -l app.kubernetes.io/name=postgresql
# Check PostgreSQL logs
kubectl logs -l app.kubernetes.io/name=postgresql
# Test connectivity
kubectl exec -it <app-pod> -- psql -h <postgres-service> -U <user> -d <database>
Resolution:
# Wait for PostgreSQL to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=postgresql --timeout=300s
# Check connection string in application config
kubectl get secret <secret-name> -o jsonpath='{.data}' | base64 -d
# Restart application pods
kubectl rollout restart deployment/<deployment-name>
Kafka Connection Issues
Symptoms: Application cannot connect to Kafka.
Causes:
- Kafka controllers not ready
- Topic not created
- Network connectivity issues
Diagnosis:
# Check Kafka pods
kubectl get pods -l app.kubernetes.io/name=kafka
# Check Kafka logs
kubectl logs -l app.kubernetes.io/name=kafka
# List topics
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 --list
Resolution:
# Wait for Kafka controllers to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=kafka --timeout=300s
# Create missing topic
kubectl exec -it <kafka-pod> -- kafka-topics.sh --bootstrap-server localhost:9092 \
--create --topic <topic-name> --partitions 3 --replication-factor 3
# Restart application to reconnect
kubectl rollout restart deployment/<deployment-name>
Redis Connection Issues
Symptoms: Application cannot connect to Redis.
Diagnosis:
# Check Redis pods
kubectl get pods -l app.kubernetes.io/name=redis
# Check Redis logs
kubectl logs -l app.kubernetes.io/name=redis
# Test connectivity
kubectl exec -it <redis-pod> -- redis-cli ping
Resolution:
# Wait for Redis to be ready
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=redis --timeout=300s
# Restart application
kubectl rollout restart deployment/<deployment-name>
High Memory Usage
Symptoms: Pods approaching or hitting memory limits.
Diagnosis:
# Check memory usage
kubectl top pods
# Check OOMKilled pods
kubectl get pods --field-selector=status.phase=Failed
# Check for memory leaks in logs
kubectl logs <pod-name> | grep -i "memory\|oom"
Resolution:
# Temporarily increase memory limit
kubectl edit deployment <deployment-name>
# Or scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>
# Long-term: Update values.yaml and perform helm upgrade
High CPU Usage
Symptoms: Pods consistently using high CPU.
Diagnosis:
# Check CPU usage
kubectl top pods
# Check for runaway processes
kubectl top pods --sort-by=cpu
Resolution:
# Scale horizontally if HPA is enabled
kubectl scale deployment <deployment-name> --replicas=<n>
# Or increase CPU limits
kubectl edit deployment <deployment-name>
Persistent Volume Issues
Symptoms: PVC not binding or volume errors.
Diagnosis:
# Check PVC status
kubectl get pvc
# Check PV status
kubectl get pv
# Check Longhorn volumes
kubectl get volumes -n longhorn-system
# Check Longhorn UI for details
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
Resolution:
# For stuck PVC, delete and recreate
kubectl delete pvc <pvc-name>
kubectl delete pod <pod-name>
# For Longhorn issues, check Longhorn UI
# Access via http://localhost:8080
# Recreate Longhorn volume if necessary
Zitadel Authentication Failures
Symptoms: Users cannot authenticate via Zitadel.
Causes:
- CORS configuration mismatch
- External domain misconfigured
- Zitadel pods not healthy
Diagnosis:
# Check Zitadel pods
kubectl get pods -l app.kubernetes.io/name=zitadel
# Check Zitadel logs
kubectl logs -l app.kubernetes.io/name=zitadel
# Verify external domain configuration
helm get values acd-manager -o yaml | grep -A 5 zitadel
Resolution:
# Ensure global.hosts.manager[0].host matches zitadel.zitadel.ExternalDomain
# Update values.yaml if needed
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
--values ~/values.yaml
# Restart Zitadel
kubectl rollout restart deployment -l app.kubernetes.io/name=zitadel
Certificate Errors
Symptoms: TLS/SSL errors in browser or API calls.
Diagnosis:
# Check certificate expiration
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
openssl x509 -noout -dates
# Check certificate subject
kubectl get secret <tls-secret> -o jsonpath='{.data.tls\.crt}' | base64 -d | \
openssl x509 -noout -subject -issuer
Resolution:
# Renew self-signed certificate
helm upgrade acd-manager /mnt/esb3027/charts/acd-manager \
--values ~/values.yaml \
--set ingress.selfSigned=true
# Or update manual certificate
kubectl create secret tls <secret-name> \
--cert=new-cert.crt --key=new-key.key \
--dry-run=client -o yaml | kubectl apply -f -
# Restart pods to pick up new certificate
kubectl rollout restart deployment <deployment-name>
Log Collection
Collecting Logs for Support
# Capture timestamp once to ensure consistency
TS=$(date +%Y%m%d-%H%M%S)
# Create log collection directory
mkdir -p ~/cdn-logs-$TS
cd ~/cdn-logs-$TS
# Collect pod logs
for pod in $(kubectl get pods -o name); do
kubectl logs $pod > ${pod#pod/}.log 2>&1
kubectl logs $pod -p > ${pod#pod/}.previous.log 2>&1 || true
done
# Collect cluster events
kubectl get events --sort-by='.lastTimestamp' > events.log
# Collect pod descriptions
for pod in $(kubectl get pods -o name); do
kubectl describe $pod > ${pod#pod/}.describe.txt
done
# Compress for transfer
tar czf cdn-logs-$TS.tar.gz *.log *.txt
Emergency Procedures
Complete Cluster Recovery
If the cluster is completely down:
Assess node status:
kubectl get nodesRestart K3s on nodes:
# On each node systemctl restart k3sIf primary server failed:
- Promote another server node
- Update load balancer/DNS to point to new primary
Restore from backup if necessary:
- See Upgrade Guide for restore procedures
Data Recovery
For data recovery scenarios:
- PostgreSQL: Use Cloudnative PG backup/restore
- Longhorn: Restore from volume snapshots
- Kafka: Replication handles most failures
Getting Help
If issues persist:
- Collect logs using the procedure above
- Check release notes for known issues
- Contact support with log bundle and issue description
Next Steps
After resolving issues:
- Operations Guide - Preventive maintenance procedures
- Configuration Guide - Verify configuration is correct
- Architecture Guide - Understand component dependencies