Building a Multi-DC Zabbix Environment: Rights, Wrongs, and Everything in Between

A few years back I had the opportunity to present at a Zabbix conference about something we'd been grinding through at Sentia (NL branch was sold off and now became Accenture)— building a large-scale, multi-datacenter Zabbix monitoring environment from scratch, focusing as much as possible on open-source tooling. This post is a write-up of that talk, covering the decisions we made, the problems we hit, and what we'd do differently.

The Goal: 99.99% Uptime

The whole project was driven by one deceptively simple goal: maximum availability. That breaks down into a bunch of concrete requirements:

Uncompromised access (users can always reach monitoring)
Data duplication (no single point of data loss)
Multi-component redundancy (no single point of failure in the stack)
Server duplication across datacenters
A solid failover and failback methodology
Fast inter-component communication

The target was 99.99% uptime. And honestly? We believed it was attainable.

The high-level idea was to have two datacenters — AM6 and GS — where traffic naturally flows to AM6, but if AM6 goes down, everything seamlessly switches to GS.

flowchart TD U[User / Device] --> Q{AM6 Online?} Q -->|Yes| AM6_LB[AM6-MON-LB] AM6_LB --> AM6[AM6 Monitoring Platform Stack] Q -->|No| GS_LB[GS-MON-LB] GS_LB --> GS[GS Monitoring Platform Stack] AM6 -. sync .-> GS

What We Found First: The Discovery Phase

Before building anything, we had to understand what we were replacing. And what we found was... a mess.

No standards whatsoever. The environment had Zabbix in some places, Nagios in others, PRTG, AWS native monitoring, Azure Monitor, IBM Radar, WhatsUp Gold, System Center Operations Manager, custom scripts — you name it, it was there. No ticketing framework, no unified notification system, no centralization at all.

The scale was real though:

4,500+ hosts being monitored
650,000+ items collected
3TB+ of data per year
250+ users depending on this

Consolidating all of that into one coherent platform was the actual challenge. Choosing Zabbix as the backbone wasn't hard — it was already the most capable open-source option in the mix.

What We Wanted: The Wishlist

Once we knew what we were dealing with, we sat down and wrote the wishlist for the new platform:

Start fresh, no legacy cruft
Cover all features people actually use
Open-source as much as possible
Multi-datacenter is non-negotiable
Multi-active databases for failover
Keep 10 years of data (yes, really)
Push notifications to multiple systems
Capture metrics from everything in operations
Dashboards everywhere (Grafana)
Elasticsearch for long-term storage
In-house integration tooling
Automation via Ansible

The stack we landed on: Zabbix + Elasticsearch + Percona MySQL + Grafana + Ansible.

Storage: Elasticsearch or MySQL?

Zabbix at the time (4.x) let you split storage between MySQL and Elasticsearch. The idea was to use MySQL for config/short-term data and offload historical time-series to Elasticsearch.

flowchart LR Z[Zabbix] --> ES[Elasticsearch] Z --> P[Percona MySQL]

Testing this was straightforward. But production was a different story — there were limitations and broken promises that we had to work around.

The Failover Architecture

We went from a simple two-node failover concept to a more resilient four-node cross-datacenter setup.

Simple version (two nodes):

flowchart LR AM6[AM6 - Node Cluster 01] -->|failover| GS[GS - Node Cluster 02]

What we actually built (four nodes, multiple failover paths):

flowchart TB subgraph AM6["Datacenter AM6"] A1[Node Cluster 01] A2[Node Cluster 02] end subgraph GS["Datacenter GS"] G1[Node Cluster 01] G2[Node Cluster 02] end A1 -->|First Failover| A2 A2 -->|Second Failover| G2 G2 -->|Third Failover| G1

MySQL Percona Active/Active and Elasticsearch in a cross-datacenter setup is entirely feasible — it just takes work.

Data Independence: The Key Design Principle

One of the most important architectural decisions was data independence between datacenters. The idea: if AM6 explodes, GS keeps running without skipping a beat. And when AM6 comes back, re-syncing should be easy.

flowchart LR subgraph AM6["Datacenter AM6"] ZA[Zabbix] --> ESA[Elasticsearch] ZA --> PA[Percona] end subgraph GS["Datacenter GS"] ESG[Elasticsearch] --> ZG[Zabbix] PG[Percona] --> ZG end ESA -->|replicate| ESG PA -->|replicate| PG

Each DC runs its own full stack. The databases replicate between them. If the link dies, both sides carry on independently.

Networking: The Stretched VLAN

The network setup was key to making all of this work. Each DC has its own local VLAN, but all internal component communication runs over a stretched VLAN that spans both datacenters.

flowchart TB subgraph AM6["Datacenter AM6 - VMware / OpenStack"] AM6_IEP[Internet Entry Point] AM6_VLAN[Local VLAN 10.x.1.0] end subgraph GS["Datacenter GS - VMware / OpenStack"] GS_IEP[Internet Entry Point] GS_VLAN[Local VLAN 10.x.2.0] end STRETCHED[Stretched VLAN 172.x.99.0] AM6_VLAN --- STRETCHED GS_VLAN --- STRETCHED

All internal traffic between Zabbix, MySQL, Elasticsearch etc. goes through the stretched VLAN. This is what makes cross-DC clustering and replication work without needing complex routing.

Two Zabbix Servers Per Datacenter

Here's something that took some design thought: we needed two Zabbix server instances per datacenter — one for infrastructure monitoring and one for client/team operations. They share the same Elasticsearch and Percona backends but are logically separated.

flowchart TD subgraph "Per Datacenter" ZI[Zabbix - Infrastructure] ZC[Zabbix - Teams / Clients] ES[Elasticsearch] P[Percona MySQL] ZI --> ES ZI --> P ZC --> ES ZC --> P end

MySQL was fine with this — you can point each Zabbix instance to a different database name, port, or host. Elasticsearch, on the other hand, originally only supported one server and one index. That was a problem.

The Elasticsearch Index Prefix Problem (and the Fix)

With two Zabbix servers writing to the same Elasticsearch cluster, the indices would collide. The solution was an index prefix per Zabbix instance:

flowchart TD ZI[Zabbix Infrastructure] -->|index: infra-uint-dd-mm-yyyy| ES[Elasticsearch] ZC[Zabbix Teams/Clients] -->|index: clients-uint-dd-mm-yyyy| ES

This required a patch to Zabbix itself (tracked as ZBXNEXT-4968). The $HISTORY_PREFIX variable gets added to the frontend config:

$HISTORY_PREFIX = 'infra'; // **## SENTIA PATCH ONLY ##**

Elasticsearch in Production: The Problems

Once we had volume going through Elasticsearch, we started seeing errors:

cannot get values from elasticsearch, HTTP status code: 503
cannot get values from elasticsearch, HTTP status code: 429
cannot get values from elasticsearch, HTTP status code: 404

503 = Service Unavailable
429 = Too Many Requests (rate limiting)
404 = Index not found
400 = Bad Request

The 400 errors were particularly annoying. They came from Elasticsearch's default http.max_initial_line_length being only 4KB. When Zabbix sends a DELETE request to clean up old scroll contexts, the URL can get enormous — easily blowing past 4KB.

Fix: bump it in your elasticsearch.yml:

http.max_initial_line_length: 16kb

PaceMaker: Only One Zabbix Server Active at a Time

This is critical and easy to get wrong. Zabbix cannot run in active/active mode — if two instances are writing to the same database simultaneously, you get data corruption and chaos. So even though we have 4 Zabbix server instances across the two DCs, only one per "role" should be active at any given time.

flowchart LR subgraph AM6["Datacenter AM6"] Z1[Zabbix 1 - ACTIVE] Z2[Zabbix 2 - standby] end subgraph GS["Datacenter GS"] Z3[Zabbix 3 - standby] Z4[Zabbix 4 - standby] end Z2 & Z3 & Z4 -.->|managed by| PC[Pacemaker / Corosync] Z1 -->|primary| PC

Pacemaker + Corosync handles which instance is active. Standby nodes are kept ready but not running until needed.

The Frontend: Multiple Zabbix UIs on One Server

We needed to serve multiple Zabbix frontends — one for infra monitoring, one for global/client monitoring — from the same web server, through HAProxy, across both DCs.

flowchart TD U1[infra-monitoring] --> HAP[HA Proxy] U2[global-monitoring] --> HAP HAP --> AM6_IFR[AM6: vhost /var/zabbix-inframon] HAP --> AM6_GLB[AM6: vhost /var/zabbix-global] HAP --> GS_IFR[GS: vhost /var/zabbix-inframon] HAP --> GS_GLB[GS: vhost /var/zabbix-global]

Making this work requires copying the Zabbix PHP frontend to separate directories and editing two files in each copy:

# Copy PHP frontend
cp -r /usr/share/zabbix/ /usr/share/zabbix-infra/

# Copy config
cp -r /etc/zabbix /etc/zabbix-infra

Then edit these two files to point to the correct config path:

include/classes/core/ZBase.php:276 — path to maintenance.php
include/classes/core/CConfigFile.php:27 — path to zabbix.conf.php

Each frontend gets its own zabbix.conf.php pointing to the right database and Elasticsearch endpoint.

The Full Frontend Access Layer

The complete access layer looks like this — PowerDNS at the top doing health-aware DNS, HAProxy in the middle load-balancing across both DCs, and then the full Zabbix stack behind it:

flowchart TD DNS[PowerDNS with LUA] -->|health-aware routing| HAP["HAProxy (AM6 + GS)"] HAP --> AV[Apache Vhost] HAP --> LDAP[LDAP Auth] HAP --> ZS[Zabbix Server] HAP --> MY[MySQL / Percona] HAP --> ES[Elasticsearch]

PowerDNS + LUA: Smart DNS Failover

The clever bit at the DNS layer is using PowerDNS with LUA scripting to do health-check-based DNS failover. The ifurlup function checks whether a monitor URI is alive and returns the appropriate IP:

inframon-sentia.net  1  IN  LUA A
  "ifurlup('https://infra-monitoring/site-alive', {
    {'185.133.x.x'}, {'213.264.x.x'}
  })"

For external DNS failover:

DNS lives outside both DCs
Uses HAProxy's monitor-uri feature as the health source
ifurlup has orderable targets (AM6 preferred, GS fallback)

For internal DNS failover:

DNS lives inside both DCs
Same monitor-uri health check
Points to internal addresses for replication traffic

MySQL Percona: Choosing the Right Replication Mode

We evaluated three Percona XtraDB cluster topologies:

The principle: the more automatic the better. Active/Active is the most resilient but also the most complex.

The XtraDB Headaches

With Active/Active at scale, we hit some gnarly problems:

InnoDB: BF-BF X lock conflict, mode: 1027 supremum: 0
Slave SQL: Could not execute Delete_rows event on table zabbix_infra.problem;
Can't find record in 'problem', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND

Key lessons:

Multi-database writes are not an advantage in Galera-based clusters — they create BF-BF (brute-force vs brute-force) lock conflicts
Async failures stop the whole cluster — one node falling behind can stall everything
Watch out for: disk speed, data volume, and latency between nodes

Other Things Worth Knowing

A few operational observations that didn't fit neatly elsewhere:

Percona requires an Arbiter — one per DC, ideally in a separate Pacemaker cluster to avoid split-brain
Pacemaker can stretch across both DCs or run as two separate clusters coordinated by a Booth Cluster Ticket Manager
Elasticsearch can run in Cross-Cluster Search (CCS) mode, or if latency is very low, you can stretch a single cluster
Kibana, Grafana, Zabbix FE and other web services can all co-exist on the same servers as multiple instances
All internal component traffic goes through the stretched VLAN — not the internet entry points
Ansible handles all server provisioning and configuration deployment
HAProxy MySQL load balancing needs custom health check scripts — standard TCP checks don't tell you if the Galera node is read-write capable
Grafana had compatibility issues with Elasticsearch at the time — version pinning was necessary

Migration: How We Got There

Moving 4,500 hosts from a chaotic multi-tool environment into this new stack followed a four-phase approach:

timeline title Migration Timeline First Steps : Export configurations from each client/group : Preparation work Connectivity : Import and test all config on new server : Firewall rules No Historical Data : Proxy config pointing to new destination : Manual work and verification Observation : Notification tests : ACL cascading : Harmonization and finalization

The "no historical data" phase is just a reality you have to accept — historical metrics don't migrate, only the configuration. Users need to be prepared for that.

What We Concluded

After all of this, here's what we'd tell anyone attempting something similar:

Multi-datacenter Zabbix works — it's feasible, we proved it
Elasticsearch for history is better suited to lower data volumes (at least up to Zabbix 4.0.x — things may have improved since)
Percona Active/Active is promising but imperfect — diagnosing where problems originate is hard
Build flexibility into your failover — there are many ways to configure fallback, and having options means small changes can recover from architectural failures
Increase your proxy buffers — bigger buffers protect you during failovers, updates, and unexpected load spikes
Keep each DC as independent as possible — the more self-sufficient each site is, the more resilient the whole system becomes

Conference Historical Page

📄 Zabbix-Environment-Conference.pptx.pdf

Zabbix Summit 2019