Limited Time Offer!
For Less Than the Cost of a Starbucks Coffee, Access All DevOpsSchool Videos on YouTube Unlimitedly.
Master DevOps, SRE, DevSecOps Skills!
Quick Definition (30โ60 words)
Kyverno is a Kubernetes-native policy engine that validates, mutates, and generates Kubernetes resources declaratively. Analogy: Kyverno is like a security guard and style guide for your cluster objects. Formal: It implements admission control policies using Kubernetes CustomResourceDefinitions and admission webhooks.
What is Kyverno?
Kyverno is a Kubernetes policy engine focused on writing policies as Kubernetes resources. It is NOT a general-purpose policy language like OPA Rego; instead it uses YAML-native policies that are easier for Kubernetes operators to author and maintain.
Key properties and constraints:
- Declarative policies authored as Kubernetes CRDs.
- Integrates with admission webhook flow for validate and mutate.
- Can generate resources and mutate requests at admission time.
- Policy scope is Kubernetes resources and metadata, not arbitrary external state (except via webhooks or external data sources in some setups).
- Operates inside the control plane as non-privileged pods with RBAC.
- Does not replace runtime security tools for containers or host-level enforcement.
Where it fits in modern cloud/SRE workflows:
- Shift-left policy enforcement in GitOps pipelines.
- Runtime admission control for preventing unsafe changes.
- Automated resource hygiene and guardrails to reduce human error.
- Integration point for compliance, supply chain security, and configuration drift prevention.
Text-only diagram description:
- Users commit YAML to Git.
- CI runs tests and Kyverno CLI policies locally.
- GitOps controller syncs to cluster.
- Kyverno installed in cluster watches policies CRDs.
- Admission webhook intercepts create/update requests.
- Kyverno validates and mutates requests; may generate resources.
- Events and policy violations stream to logging/monitoring systems.
Kyverno in one sentence
A Kubernetes-native policy engine that validates, mutates, and generates cluster resources using declarative policies written as Kubernetes objects.
Kyverno vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Kyverno | Common confusion |
|---|---|---|---|
| T1 | OPA Gatekeeper | Uses Rego language and ConstraintTemplates instead of YAML policies | People assume Rego is required for policy |
| T2 | Admission Webhook | Mechanism for intercepting requests not a policy engine | Confused as interchangeable with Kyverno |
| T3 | Pod Security Standards | Prescriptive security profiles not general policy engine | Viewed as full policy solution |
| T4 | Helm | Package manager for resources not a policy runtime | Helm hooks sometimes confused for policies |
| T5 | GitOps controllers | Sync tools not enforcing admission policies | Assumed to enforce runtime policies |
| T6 | Kubernetes RBAC | Access control not resource mutation or validation | Mistaken as covering policy validation |
| T7 | Policy-as-Code frameworks | Broad concept; Kyverno is specific to K8s CRD model | People mix tooling and pattern |
Row Details (only if any cell says โSee details belowโ)
- None.
Why does Kyverno matter?
Business impact:
- Reduces risk of misconfiguration leading to outages or security incidents.
- Protects revenue by preventing unauthorized resource changes.
- Maintains customer trust by enforcing compliance and data handling rules.
Engineering impact:
- Prevents common class of human errors, reducing incidents.
- Enables higher velocity by automating repetitive checks and fixes.
- Lowers review overhead by codifying guardrails.
SRE framing:
- SLIs: Policy pass rate, policy evaluation latency, policy generation success.
- SLOs: High availability of policy evaluation, low false positive rate.
- Error budgets: Violations allow controlled exceptions rather than system downtime.
- Toil reduction: Automates labeling, namespace quotas, security annotations.
- On-call: Faster triage when policies block deployments; clearer postmortems.
What breaks in production (realistic examples):
- A deployment is created with privileged containers bypassing runtime security.
- Resource requests are omitted causing node pressure and OOMs.
- Public services are exposed without ingress restrictions leading to data leaks.
- Image registries changed to untrusted registries introducing supply-chain risk.
- Namespaces created without network policies causing lateral movement.
Where is Kyverno used? (TABLE REQUIRED)
| ID | Layer/Area | How Kyverno appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Cluster control plane | Admission webhook policies enforcing rules | Policy evaluation latency | Kubernetes API server |
| L2 | Network layer | Enforce network policy labels and generation | Policy violation count | CNI plugins |
| L3 | Service layer | Validate service types and annotations | Rejection rate | Service meshes |
| L4 | Application config | Mutate and validate deployment YAML | Mutation success rate | GitOps controllers |
| L5 | CI CD pipeline | Pre-commit or CI policy tests | CI job pass rate | CI systems |
| L6 | Observability | Auto-inject sidecar config or labels | Injection success | Monitoring agents |
| L7 | Security/compliance | Enforce image signing or allowed registries | Violation incidents | Vulnerability scanners |
| L8 | Serverless/PaaS | Validate function resource limits and runtime | Deployment blocks | Serverless platforms |
Row Details (only if needed)
- None.
When should you use Kyverno?
When necessary:
- You need cluster-wide declarative guardrails.
- You must enforce compliance controls at admission time.
- You want to automate resource hygiene like labels or network policies.
When optional:
- Small teams with few clusters and manual checks might delay adoption.
- If your policies are highly dynamic and external-state dependent you may design alternatives.
When NOT to use / overuse it:
- For non-Kubernetes resources outside cluster without clear integration.
- As a band-aid for broken CI/CD processes; fix pipelines first.
- For very complex policy logic better suited to expressive policy languages if required.
Decision checklist:
- If you use Kubernetes and want admission-time guardrails -> Use Kyverno.
- If you already run Rego policies and need YAML-native simplicity -> Consider Kyverno.
- If you require policy across diverse non-K8s systems -> Consider centralized policy systems instead.
Maturity ladder:
- Beginner: Validate basic security and naming conventions.
- Intermediate: Mutate resources, auto-generate network policies, and integrate with CI.
- Advanced: Dynamic policies, external data checks, automated remediation, and telemetry-driven SLOs.
How does Kyverno work?
Components and workflow:
- Policy CRDs: Policy, ClusterPolicy, ClusterPolicyReport, PolicyReport.
- Kyverno controller: Watches policies and resources; enforces policies.
- Admission webhook: Intercepts API server admission requests.
- Background controller: Applies policies to existing resources for generate and mutate.
- CLI and test tooling: kyverno CLI to test policies locally and in CI.
Data flow and lifecycle:
- Admin installs Kyverno and defines policies as CRDs.
- API server sends admission requests to Kyverno webhook.
- Kyverno evaluates matching policies: validate, mutate, generate.
- Mutations are applied inline or as patches; validation may allow or deny.
- Generate can create auxiliary resources in target namespaces.
- Background reconciliation ensures policies are enforced for existing resources.
- PolicyReports are emitted and metrics exposed.
Edge cases and failure modes:
- Webhook downtime could block API requests depending on failurePolicy.
- Mutations that conflict with controllers like operators may race.
- Generate may produce duplicate resources if not idempotent.
- External data dependencies make policies brittle.
Typical architecture patterns for Kyverno
- Centralized guardrail pattern: Single Kyverno instance enforces cluster-wide policies; use when uniform rules required.
- Namespace delegation pattern: ClusterPolicy for base rules and NamespacePolicy for local overrides; use when tenants need autonomy.
- GitOps preflight pattern: Run kyverno CLI in CI to validate before merge; use for strict pipelines.
- Sidecar injection pattern: Mutate pod templates to inject sidecars or env vars; use for observability/security auto-injection.
- Policy-as-Code CI pattern: Policies tested with unit tests and policy reports in CI; use for mature DevSecOps.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Webhook unavailable | API requests timeout or blocked | Kyverno pods crashed or network | Set failurePolicy to ignore and restore pods | Increased admission latency |
| F2 | Conflicting mutations | Resources keep flipping between states | Multiple controllers mutating same fields | Coordinate owners and use filters | Resource churn metrics |
| F3 | Policy misconfiguration | Legitimate requests denied | Incorrect policy selectors or conditions | Rollback policy and fix tests | Spike in denied requests |
| F4 | Generate duplication | Duplicate generated resources | Non-idempotent generate policy | Use unique names and conditions | Duplicate resource events |
| F5 | Performance degradation | High policy eval time | Very large number of policies or heavy patterns | Optimize policies and use caching | Policy evaluation latency |
| F6 | External dependency failure | Policies reliant on external data fail | Remote service down or slow | Make policies resilient or cache data | Elevated error rates |
Row Details (only if needed)
- None.
Key Concepts, Keywords & Terminology for Kyverno
Glossary of 40+ terms. Each line: Term โ definition โ why it matters โ common pitfall
- Policy โ Declarative CRD defining rules โ Core of Kyverno โ Too broad policies cause false positives
- ClusterPolicy โ Cluster-scoped policy โ Enforces across cluster โ Overuse blocks tenants
- PolicyRule โ Single rule within policy โ Granular enforcement โ Misconfigured conditions fail silently
- Match โ Selector for resources โ Targets specific objects โ Incorrect match scope blocks resources
- Exclude โ Selector to skip resources โ Avoids protecting system objects โ Forgot excludes for system namespaces
- Validate โ Rule type that allows or denies โ Prevents bad changes โ Strict schemas may break workflows
- Mutate โ Rule type that changes requests โ Automates defaults โ Conflicts with other mutators
- Generate โ Rule type creating resources โ Helps bootstrap configs โ Can create duplicates if not idempotent
- Background controller โ Applies policies to existing resources โ Keeps cluster consistent โ Heavy load on large clusters
- Admission webhook โ Intercepts API requests โ Enables real-time enforcement โ Single point of failure if misconfigured
- CLI โ local kyverno tool โ Enables preflight testing โ Tests may differ from cluster behavior
- PolicyReport โ Resource summarizing results โ Used for compliance dashboards โ Not always emitted for mutations
- ClusterPolicyReport โ Cluster-scoped report โ Aggregates across namespaces โ Volume can be high
- JSON6902 patch โ Patch format used in mutate โ Precise mutations โ Fragile if resource schema changes
- JMESPath โ Query language used in conditions โ Enables deep matching โ Mistyped expressions cause misses
- DataSources โ External data used by policies โ Enables dynamic checks โ External failures affect policies
- Webhook failurePolicy โ How to behave on webhook failure โ Impacts availability โ Ignoring can skip enforcement
- ResourceName โ Specific resource targeting โ Exact control โ Hardcoded names reduce reuse
- NamespaceSelector โ Match on namespace labels โ Multi-tenant targeting โ Missing labels cause no-match
- Annotation โ Metadata used in policies โ Lightweight flags โ Overloaded annotations create coupling
- Label โ Key/value used in matching โ Primary selector method โ Missing labels break policies
- MutatingAdmissionWebhook โ Kubernetes webhook type โ Enables mutations โ Requires TLS and certs
- ValidatingAdmissionWebhook โ Kubernetes webhook type โ Enables denies โ Also requires certs
- Kyverno controller โ Main pod running logic โ Executes policy evaluation โ Resource constraints affect throughput
- RBAC โ Kubernetes access control โ Controls Kyverno actions โ Wrong RBAC causes failures
- Kyverno namespace โ Namespace where Kyverno runs โ Operational scope โ Overprivileged namespace is risk
- AdmissionReview โ K8s object representing request โ Input to policies โ Complex payloads may be misread
- Dry-run โ Non-blocking policy evaluation โ Safe testing โ Differences to real admission may exist
- Auto-gen labels โ Labels Kyverno can add โ Helps organization โ Label sprawl can occur
- Resource whitelist โ Allowed exceptions list โ Enables flexibility โ Be careful with security gaps
- Sidecar injection โ Mutate to attach containers โ Automates setup โ May increase pod startup time
- ImagePolicy โ Checks for allowed registries โ Prevents bad images โ Can block legitimate images if too strict
- Immutable fields โ Fields that cannot be changed later โ Important for safety โ Mutation attempts get rejected
- Policy ordering โ Which rules run first โ Affects predictable outcomes โ Not strictly ordered; avoid dependencies
- Controller leaders โ Leader election for controller โ Ensures single active reconciler โ Leader flaps cause temporary issues
- Policy namespace isolation โ Running policies per namespace โ Supports tenancy โ Increased management overhead
- API priority โ Admission webhooks order โ Affects interaction with other webhooks โ Misordering leads to conflicts
- Metrics endpoint โ Prometheus metrics from Kyverno โ Essential for SLOs โ Not enabled or scraped causes blindspots
- Audit mode โ Report only without deny โ Safe rollout โ Might miss blocked dangerous changes in runtime
- Templates โ Reusable policy fragments โ Reduce duplication โ Overly generic templates become hard to reason about
- ResourceTemplates โ Used by generate rules โ Helps create supporting objects โ Template drift can confuse operators
- Mutation patches โ Changes applied by mutate rules โ Automates defaults โ Complex patches are brittle
- Policy lifecycle โ Development, testing, rollout, maintenance โ Operational hygiene โ Neglecting lifecycle causes drift
- Policy drift โ Deviation between desired and actual policies โ Causes compliance gaps โ Monitor and reconcile
How to Measure Kyverno (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Policy evaluation latency | Time to evaluate policies per request | Histogram of admission eval time | P95 < 100ms | High variance under load |
| M2 | Policy pass rate | Percentage requests allowed | allowed / total requests | 99.9% allowed except exceptions | High pass doesn’t equal safe |
| M3 | Mutation success rate | Mutations applied successfully | applied mutations / attempted | 99.9% | Conflicts with other controllers |
| M4 | Deny rate | Percentage denied by policies | denied / total requests | Low single digit percent | Sudden spikes indicate issues |
| M5 | PolicyReport count | PolicyViolations over time | Count of PolicyReport resources | Trending down over time | Flooding when policies too strict |
| M6 | Webhook error rate | Failed admission webhook calls | 5xx webhook responses / total | < 0.1% | Networking issues cause spikes |
| M7 | Background reconcile time | Time to apply background policies | Time per reconcile job | Depends on cluster size | Large clusters longer times |
| M8 | Generated resource count | Number of resources created by generate | Count of generated CRs | Stable baseline | Duplicates inflate counts |
| M9 | Kyverno pod CPU | Resource usage of Kyverno | Pod CPU usage metrics | Provisioned headroom 30% | Underprovisioning causes latency |
| M10 | Kyverno pod memory | Memory usage of Kyverno | Pod memory metrics | Headroom 30% | Memory leaks cause restarts |
Row Details (only if needed)
- None.
Best tools to measure Kyverno
Tool โ Prometheus
- What it measures for Kyverno: Metrics like eval latency, errors, pod resource usage.
- Best-fit environment: Kubernetes clusters with Prometheus operator.
- Setup outline:
- Enable Kyverno metrics endpoint.
- Configure ServiceMonitor for scraping.
- Create Prometheus rules for SLIs.
- Strengths:
- Widely adopted and flexible.
- Good integration with alerting.
- Limitations:
- Requires tuning for cardinality.
- Long-term storage needs separate solution.
Tool โ Grafana
- What it measures for Kyverno: Visualization dashboards for metrics and trends.
- Best-fit environment: Teams using Prometheus or other TSDBs.
- Setup outline:
- Connect to Prometheus datasource.
- Import Kyverno dashboards or craft panels.
- Set up role-based access for stakeholders.
- Strengths:
- Powerful visualization.
- Supports alerting integrations.
- Limitations:
- Dashboards require maintenance.
- Not a metric collector.
Tool โ Loki
- What it measures for Kyverno: Kyverno logs for audit and debugging.
- Best-fit environment: Centralized logging on Kubernetes.
- Setup outline:
- Configure Fluentd/Fluent Bit to collect Kyverno logs.
- Index and query in Loki.
- Correlate logs with request IDs.
- Strengths:
- Efficient log aggregation.
- Good for debugging policy decisions.
- Limitations:
- Query performance depends on retention and indexing.
Tool โ Kyverno CLI
- What it measures for Kyverno: Local policy tests, dry-run outputs.
- Best-fit environment: CI and developer workstations.
- Setup outline:
- Install kyverno CLI in CI images.
- Run kyverno test and apply in dry-run.
- Fail builds on policy violations.
- Strengths:
- Fast feedback during development.
- Matches cluster policy semantics in many cases.
- Limitations:
- Cluster differences may exist.
Tool โ PolicyReport consumers (custom DB)
- What it measures for Kyverno: Aggregated policy violations for reporting.
- Best-fit environment: Compliance dashboards and reporting pipelines.
- Setup outline:
- Export PolicyReport CRs to external DB.
- Build dashboards and scheduled reports.
- Strengths:
- Persistent audit trail.
- Compliance-ready records.
- Limitations:
- Requires ETL pipeline.
Recommended dashboards & alerts for Kyverno
Executive dashboard:
- Panels:
- Cluster-wide policy pass rate โ shows health.
- Top violated policies โ business impact.
- Trend of deny rate over 30d โ compliance posture.
- Generated resource count โ potential drift indicator.
- Why: Quick view for leaders on risk and compliance.
On-call dashboard:
- Panels:
- Recent deny events with user and resource.
- Kyverno pod health and restarts.
- Admission latency P50/P95/P99.
- Webhook error rate.
- Why: Rapid triage for incidents impacting deployments.
Debug dashboard:
- Panels:
- Detailed logs for recent admission requests.
- Policy evaluation traces for specific requests.
- Background reconcile job timings.
- Resource churn and duplicate generation.
- Why: Deep investigation during root cause analysis.
Alerting guidance:
- Page vs ticket:
- Page: Webhook error rate spike, Kyverno pod crashlooping, P99 eval latency large causing blocking.
- Ticket: Increasing deny rate trend without operational impact, policy report growth for low-severity issues.
- Burn-rate guidance:
- When deny rate consumes error budget for deployment velocity, consider rolling back policy or temporary allow list.
- Noise reduction tactics:
- Deduplicate alerts by resource owner.
- Group related violations by policy and namespace.
- Use suppression windows for planned migrations.
Implementation Guide (Step-by-step)
1) Prerequisites – Kubernetes cluster with admission webhook capability. – RBAC configured for Kyverno service account. – Monitoring stack to collect metrics and logs. – GitOps or CI workflows for policy lifecycle.
2) Instrumentation plan – Expose Kyverno metrics and scrape with Prometheus. – Collect logs and route to centralized logging. – Export PolicyReports to compliance DB.
3) Data collection – Capture admission audit logs. – Record PolicyReport events. – Store background reconcile metrics.
4) SLO design – Define SLI for policy evaluation latency and success rate. – Draft SLOs with error budgets for deny rate. – Align with business needs for deployment velocity.
5) Dashboards – Create executive, on-call, debug dashboards. – Add panels for policy-specific metrics.
6) Alerts & routing – Configure alerts for webhook errors and high latency. – Route to platform on-call and security on-call based on policy category.
7) Runbooks & automation – Create runbooks for policy denial triage and policy rollback. – Automate remediation for common housekeeping violations.
8) Validation (load/chaos/game days) – Run chaos to simulate Kyverno pod restarts. – Load test with synthetic admission requests to test latency. – Conduct policy game days to exercise denial scenarios.
9) Continuous improvement – Review PolicyReports weekly. – Iterate policies based on false positives and developer feedback. – Maintain policy tests in CI.
Pre-production checklist
- Test policies in dry-run with kyverno CLI.
- Validate policy behavior in staging cluster.
- Ensure monitoring and alerts configured.
- Run performance tests to validate latency.
Production readiness checklist
- RBAC least privilege for Kyverno.
- Backup of policies and configuration.
- Monitoring, dashboards, and alerting active.
- Runbooks exist and on-call trained.
Incident checklist specific to Kyverno
- Identify impacted namespaces and resources.
- Check Kyverno pod status and logs.
- Determine if webhook is reachable from API server.
- If necessary set failurePolicy to ignore to restore API flow.
- Revert recent policy changes and test.
Use Cases of Kyverno
Provide 8โ12 use cases.
-
Enforce image registry allowlist – Context: Prevent untrusted images. – Problem: Developers pull images from public registries. – Why Kyverno helps: Validates image fields and denies disallowed registries. – What to measure: Deny rate and blocked deployments. – Typical tools: Kyverno, registry scanners.
-
Auto-inject sidecars for observability – Context: Ensure consistent telemetry. – Problem: Teams forget to add exporters. – Why Kyverno helps: Mutate pod spec to add sidecar. – What to measure: Injection success rate and pod start time. – Typical tools: Kyverno, Prometheus.
-
Enforce resource requests and limits – Context: Prevent noisy neighbor issues. – Problem: Pods without resources cause node pressure. – Why Kyverno helps: Validate or set defaults for CPU and memory. – What to measure: Number of pods missing requests, OOM events. – Typical tools: Kyverno, cluster autoscaler metrics.
-
Generate network policies per namespace – Context: Zero trust networking. – Problem: Lack of network policy leaves lateral access open. – Why Kyverno helps: Generate default network policies on namespace creation. – What to measure: Generated policy count and connectivity tests. – Typical tools: Kyverno, CNI plugin.
-
Enforce naming and label conventions – Context: Asset management and cost allocation. – Problem: Missing cost center labels. – Why Kyverno helps: Mutate resources to add labels or deny creation. – What to measure: Percentage of resources with required labels. – Typical tools: Kyverno, billing exporters.
-
Prevent privileged containers – Context: Security posture improvement. – Problem: Privileged containers escape isolation. – Why Kyverno helps: Validate PodSecurity settings or deny privileged true. – What to measure: Deny rate for privileged pods. – Typical tools: Kyverno, runtime security tools.
-
Enforce Pod Security Standards – Context: Align with security benchmarks. – Problem: Teams bypass pod security profiles. – Why Kyverno helps: Validate against Pod Security Standard profiles. – What to measure: Compliance rate to profile. – Typical tools: Kyverno, compliance dashboards.
-
Integrate with CI for preflight checks – Context: Shift-left enforcement. – Problem: Developers get blocked in production. – Why Kyverno helps: Run kyverno tests in CI to catch issues before merge. – What to measure: CI policy failure rate and time to fix. – Typical tools: Kyverno CLI, CI pipelines.
-
Automate namespace hygiene – Context: Multi-tenant cluster management. – Problem: Orphaned resources and missing quotas. – Why Kyverno helps: Generate quotas, limits, labels on namespace creation. – What to measure: Namespace violations and quota exhaustion events. – Typical tools: Kyverno, GitOps controllers.
-
Enforce network ingress restrictions – Context: Data exfiltration protection. – Problem: Services exposed publicly accidentally. – Why Kyverno helps: Validate Service types and ingress hostnames. – What to measure: Count of public services blocked. – Typical tools: Kyverno, ingress controllers.
Scenario Examples (Realistic, End-to-End)
Scenario #1 โ Prevent Privileged Containers (Kubernetes)
Context: Multi-tenant cluster with strict runtime security needs.
Goal: Prevent creation of pods with securityContext.privileged true.
Why Kyverno matters here: Blocks risky containers at admission time to reduce attack surface.
Architecture / workflow: Kyverno admission webhook intercepts Pod creates and updates. Policy validates securityContext.
Step-by-step implementation:
- Author ClusterPolicy with validate rule matching pods.
- Set deny with message and required fields.
- Deploy policy to cluster.
- Add policy tests in CI.
What to measure: Deny rate, blocked deployment attempts, on-call incidents.
Tools to use and why: Kyverno for enforcement, Prometheus for metrics, Grafana for dashboards.
Common pitfalls: Overly broad match blocking system pods.
Validation: Deploy a privileged pod in staging and ensure admission denial.
Outcome: Privileged pods blocked, reduced runtime risk.
Scenario #2 โ Auto-generate Network Policies on Namespace Creation (Serverless/Managed-PaaS)
Context: Managed PaaS where developers provision serverless functions as Kubernetes pods.
Goal: Automatically generate default deny network policy per namespace.
Why Kyverno matters here: Ensures consistent network isolation without manual steps.
Architecture / workflow: Generate policy triggers on Namespace create to create NetworkPolicy resources.
Step-by-step implementation:
- Create ClusterPolicy with generate rule targeting namespace creation.
- Define NetworkPolicy template referencing namespace metadata.
- Deploy policy and test in staging.
What to measure: Generated policy count, connectivity tests, function failures.
Tools to use and why: Kyverno, CNI for network enforcement, integration tests.
Common pitfalls: Generated policies blocking required control plane access.
Validation: Create namespace and run connectivity tests to required services.
Outcome: Default network isolation applied consistently.
Scenario #3 โ Incident-response: Policy Caused Outage (Postmortem)
Context: A deny policy accidentally blocked configmap updates causing app failure.
Goal: Triage, mitigate, and prevent recurrence.
Why Kyverno matters here: Policy decisions can impact production behavior and must be part of runbooks.
Architecture / workflow: Kyverno denies update requests; GitOps controller fails to sync.
Step-by-step implementation:
- Detect spike in denied requests via alerts.
- Identify offending policy and namespace.
- Rollback or set policy to audit/dry-run.
- Apply fix and re-enable enforcement.
What to measure: Time to detect, time to restore, number of failed reconciliations.
Tools to use and why: Logs, PolicyReports, GitOps controller logs.
Common pitfalls: Not including stakeholders in policy change approvals.
Validation: Postmortem and game day to rehearse rollback.
Outcome: Improved policy review process and emergency rollback runbook.
Scenario #4 โ Cost/Performance: Resource Requests Defaulting (Cost/Performance trade-off)
Context: Teams do not set requests leading to inefficient bin-packing and OOMs.
Goal: Mutate pods to add default requests or deny missing values.
Why Kyverno matters here: Automate defaults to balance density and stability.
Architecture / workflow: Kyverno mutate rule adds resource requests if missing.
Step-by-step implementation:
- Define mutate policy adding requests based on pod labels.
- Apply in dry-run then enforce.
- Monitor node utilization and OOM rates.
What to measure: Pod OOM rate, node utilization, denied/mutated pods.
Tools to use and why: Kyverno, Prometheus, cluster autoscaler metrics.
Common pitfalls: Default requests too low or too high causing cost or instability.
Validation: Load test workloads to observe behavior under defaults.
Outcome: Improved stability with monitored cost trade-offs.
Common Mistakes, Anti-patterns, and Troubleshooting
List of 20 common mistakes with symptom -> root cause -> fix.
- Symptom: Legitimate deployments denied. -> Root cause: Overbroad match selectors. -> Fix: Narrow the match scope and add excludes.
- Symptom: API calls blocked cluster-wide. -> Root cause: Webhook crashloop or certs expired. -> Fix: Restore Kyverno pods and rotate webhook certs.
- Symptom: Pod specs flip between values. -> Root cause: Conflicting mutate controllers. -> Fix: Coordinate mutation owners and use immutable fields.
- Symptom: Duplicate generated resources. -> Root cause: Non-idempotent generate rules. -> Fix: Use ownership annotations and conditional checks.
- Symptom: High admission latency. -> Root cause: Too many complex policies or heavy external checks. -> Fix: Simplify policies and cache external data.
- Symptom: PolicyReport explosion. -> Root cause: Policies too strict or running in background without filters. -> Fix: Add severity filters and limit scope.
- Symptom: False negatives in CI tests. -> Root cause: Kyverno CLI version mismatch with cluster. -> Fix: Align CLI and cluster Kyverno versions.
- Symptom: Developers bypass policies. -> Root cause: Lack of CI preflight enforcement. -> Fix: Add kyverno tests to CI and block merges.
- Symptom: Missing metrics for policy evaluation. -> Root cause: Metrics not exposed or scraped. -> Fix: Enable metrics and configure ServiceMonitor.
- Symptom: Silent policy drift. -> Root cause: No lifecycle governance. -> Fix: Establish policy review cadence and GitOps source of truth.
- Symptom: Network policies block control plane. -> Root cause: Generated policies too restrictive. -> Fix: Add required exceptions and validate connectivity.
- Symptom: High false positive denials. -> Root cause: Misunderstood JSON paths or JMESPath queries. -> Fix: Test queries against sample payloads.
- Symptom: Kyverno pod OOMs. -> Root cause: Underprovisioned memory. -> Fix: Increase resource limits and investigate memory usage.
- Symptom: Long background reconcile times. -> Root cause: Large cluster with broad policy scope. -> Fix: Use namespace selectors and optimize filters.
- Symptom: Audit mode never flipped to enforce. -> Root cause: Change management gaps. -> Fix: Define rollout plan and automation for promote to enforce.
- Symptom: Certificate renewal failures. -> Root cause: Incorrect cert manager config. -> Fix: Inspect cert manager logs and rotate certs manually if needed.
- Symptom: Alerts for low severity violations. -> Root cause: No alert grouping thresholds. -> Fix: Use aggregation and suppression for noisy policies.
- Symptom: Confusing policy ownership. -> Root cause: No clear owners for policies. -> Fix: Add labels for owners and maintainers.
- Symptom: Slow CI due to policy tests. -> Root cause: Running full cluster tests in every commit. -> Fix: Run lightweight checks pre-commit and full tests nightly.
- Symptom: Policy changes cause outages. -> Root cause: Lack of staged rollout and canary for policies. -> Fix: Roll out in audit mode, then stagger enforce across namespaces.
Observability pitfalls (at least 5 included above):
- Not scraping Kyverno metrics.
- Missing PolicyReport export.
- Logs not correlated with request IDs.
- Dashboards lacking context like recent policy versions.
- Alerting configured on raw counts without grouping.
Best Practices & Operating Model
Ownership and on-call:
- Platform team owns Kyverno installation and core ClusterPolicies.
- App teams own namespace-scoped policies.
- On-call rotation includes platform on-call for webhook or policy outages.
Runbooks vs playbooks:
- Runbook: Operational steps for restoring API flow when webhook fails.
- Playbook: Policy design and review process for proposing new policies.
Safe deployments:
- Start in audit/dry-run mode.
- Canary policies in a small set of namespaces.
- Auto-rollback if denial or error budget thresholds exceeded.
Toil reduction and automation:
- Automate label injection, quotas, and network policy generation.
- Automate PolicyReport export and weekly hygiene reports.
Security basics:
- Least privilege RBAC for Kyverno service account.
- Ensure webhook TLS certificates rotated.
- Harden Kyverno pods with resource limits and PodSecurity.
Weekly/monthly routines:
- Weekly: Review PolicyReports and top violations.
- Monthly: Review policies for relevance and remove stale ones.
- Quarterly: Policy audit and RBAC review.
What to review in postmortems related to Kyverno:
- Timeline of policy changes and approvals.
- Which policies triggered the outage and why.
- Why rollout strategy failed and remediation steps.
- Actions to improve policy testing and deployment.
Tooling & Integration Map for Kyverno (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Monitoring | Collect Kyverno metrics | Prometheus Grafana | Scrape metrics endpoint |
| I2 | Logging | Aggregate Kyverno logs | Fluentd Loki | Correlate with request IDs |
| I3 | CI | Policy tests in pipelines | GitHub Actions GitLab CI | Use kyverno CLI |
| I4 | GitOps | Store policies as code | Argo CD Flux | Policies applied via Git |
| I5 | CNI | Enforce generated network policies | Calico Cilium | Network enforcement dependent |
| I6 | Cert management | Manage webhook certs | cert-manager | TLS for webhook required |
| I7 | Security scanners | Image and vuln scans | Trivy Clair | Combine with image policies |
| I8 | Incident mgmt | Alert routing and paging | PagerDuty Opsgenie | Route Kyverno alerts |
| I9 | Policy reports DB | Store PolicyReport history | External DB pipeline | For compliance reporting |
| I10 | Secrets manager | Validate secret labs or inject | Vault SealedSecrets | Integrate for secret policies |
Row Details (only if needed)
- None.
Frequently Asked Questions (FAQs)
What is Kyverno best used for?
Kyverno is best for Kubernetes-native policy enforcement for validate mutate and generate use cases.
Can Kyverno replace OPA Gatekeeper?
Kyverno can replace many admission-policy use cases but differences in languages and ecosystems matter for complex Rego logic.
Does Kyverno require cluster admin to deploy?
Typically yes for initial installation because it needs webhook configuration and RBAC.
How do I test policies before rollout?
Use kyverno CLI in dry-run mode and run policies in CI against representative manifests.
Will a Kyverno outage block my API server?
If failurePolicy is not set to ignore, webhook unavailability can block requests; configure accordingly.
How do I avoid conflicting mutations?
Coordinate mutation ownership, scope matches, and prefer single mutator per field.
Can Kyverno generate resources across namespaces?
Generate can create resources in target namespaces; be careful with ownership and idempotency.
Is Kyverno suitable for multi-cluster?
Kyverno can be used per cluster; multi-cluster orchestration typically handled by GitOps or central controllers.
How to mitigate policy performance issues?
Optimize policy matches, reduce expensive conditions, and monitor evaluation latency.
How do I handle exceptions?
Use resource whitelists, annotations or labels to exclude certain resources from rules.
Can Kyverno read external data during evaluation?
Kyverno supports limited external data approaches; heavy reliance on external services makes policies brittle.
How should policies be versioned?
Store policies in Git and use GitOps workflows with CI testing and staged rollouts.
What metrics are essential for Kyverno SLOs?
Policy evaluation latency, deny rate, webhook error rate, and mutation success rate.
How do I recover from accidental blocking policy?
Roll back the policy, set it to dry-run, or temporarily set failurePolicy to ignore depending on impact.
Are there any security concerns with Kyverno?
Ensure RBAC least privilege, webhook cert rotation, and audit policy changes.
How to scale Kyverno for large clusters?
Shard policies, narrow match selectors, increase Kyverno replicas and resource allocations.
Does Kyverno support multi-tenant policies?
Yes, via namespace selectors and ClusterPolicies with careful scoping.
What is the best approach for policy lifecycle?
Develop in Git, test in CI, stage in audit mode, promote to enforce with canary.
Conclusion
Kyverno provides a pragmatic, Kubernetes-native path to implement admission-time guardrails, automate configuration hygiene, and improve security posture. When integrated with CI, observability, and a mature operating model, Kyverno reduces incidents and supports velocity. Start small, iterate policies, and instrument everything.
Next 7 days plan:
- Day 1: Install Kyverno in a staging cluster and enable metrics.
- Day 2: Write one validate and one mutate policy and test in dry-run.
- Day 3: Integrate kyverno CLI into your CI preflight checks.
- Day 4: Create basic dashboards for policy pass rate and latency.
- Day 5: Run a policy game day to simulate a denial incident.
Appendix โ Kyverno Keyword Cluster (SEO)
- Primary keywords
- Kyverno
- Kyverno policies
- Kyverno tutorial
- Kyverno examples
- Kyverno guide
- Kyverno admission controller
- Kyverno mutate
- Kyverno validate
- Kyverno generate
-
Kyverno CRD
-
Secondary keywords
- Kubernetes policy engine
- admission webhook Kyverno
- Kyverno vs Gatekeeper
- Kyverno best practices
- Kyverno metrics
- Kyverno SLOs
- Kyverno troubleshooting
- Kyverno CI integration
- Kyverno GitOps
-
Kyverno CLI
-
Long-tail questions
- How to write a Kyverno policy for resource limits
- How Kyverno mutates Kubernetes pods
- How to test Kyverno policies in CI
- How to measure Kyverno performance
- What to do when Kyverno blocks deployments
- How to auto-generate network policies with Kyverno
- How to enforce image registries with Kyverno
- How to debug Kyverno admission webhooks
- How to integrate Kyverno with Prometheus
-
How to roll out Kyverno policies safely
-
Related terminology
- ClusterPolicy
- PolicyReport
- Background controller
- JSON6902 patch
- JMESPath queries
- PodSecurity Standards
- MutatingAdmissionWebhook
- ValidatingAdmissionWebhook
- Policy lifecycle
- Dry-run mode
- Audit mode
- PolicyReport exporter
- NamespaceSelector
- Label-based matching
- ResourceTemplates
- Kyverno metrics endpoint
- Webhook certificate rotation
- Policy ownership labels
- Canary policy rollout
- Policy audit cadence

0 Comments
Most Voted