Once your MongoDB change streams are flowing through Kafka and landing in ClickHouse, you’ve got a live, queryable event history for every state change in your systems.
The obvious next step: start using it immediately — even before you build full-blown dashboards or machine learning models.
1. Detect Missing or Late Events
One of the fastest wins is catching when something doesn’t happen. If you know a collection normally sees certain events every day, you can query for absences:
SELECT ns_db, ns_coll, count() AS events
FROM record_raw
WHERE ts >= today()
GROUP BY ns_db, ns_coll
HAVING events = 0;
You can also set up a query to check if an expected supplier or consumer is late:
SELECT JSON_VALUE(tags,'$.consumer_id') AS consumer,
max(ts) AS last_event
FROM record_raw
WHERE ts >= now() - INTERVAL 2 DAY
GROUP BY consumer
HAVING last_event < now() - INTERVAL 1 HOUR;
2. Monitor Latency Trends
If you store durations as an attribute (e.g. latency_sec
), you can plot changes over time:
SELECT toStartOfInterval(ts, INTERVAL 15 MINUTE) AS bucket,
avg(toFloat64OrZero(JSON_VALUE(attrs,'$.latency_sec'))) AS avg_latency
FROM record_raw
WHERE ts >= now() - INTERVAL 24 HOUR
GROUP BY bucket
ORDER BY bucket;
This is immediately useful for spotting slowdowns before they cause SLA violations.
3. Trace Multi-Step Workflows
Because you kept tags flexible, you can reconstruct sequences of events for a specific job, consumer, or transaction:
SELECT ts, event_type, tags
FROM record_raw
WHERE JSON_VALUE(tags,'$.job_id') = '20250809-abc123'
ORDER BY ts;
This turns ClickHouse into a forensics tool — no extra instrumentation needed.
4. Build Simple SLO Dashboards
Even without Grafana, you can run queries that give you health snapshots:
SELECT event_type,
count() AS events,
round(events / (SELECT count() FROM record_raw WHERE ts >= now() - INTERVAL 1 HOUR), 2) AS pct
FROM record_raw
WHERE ts >= now() - INTERVAL 1 HOUR
GROUP BY event_type
ORDER BY pct DESC;
This works for system-wide monitoring until you formalize SLIs/SLOs.
5. Train Your First Baseline Models
With historical data accumulating, you can start with lightweight predictive checks:
- ARIMA on event counts to forecast expected volume
- Shallow neural nets to flag unusual patterns
- Simple rules like “if supplier X’s latency > mean + 3σ, alert”
Because ClickHouse is fast, you can export data for training without disrupting queries.
6. Automate Interventions
Once you have a stream of tagged, timestamped events, you can:
- Trigger follow-up jobs if certain sequences are incomplete
- Send alerts only when patterns deviate, not on every single event
- Replace some manual SRE “eyes-on-glass” work with automated state checks
Bottom line: As soon as CDC data hits ClickHouse, you can monitor, alert, and analyze — before you invest in bigger architecture changes. It’s the shortest path from “we captured the events” to “we’re making better decisions in real time.”
Related Posts