Concept Note: Governance for Self-Managing Event Systems

This note outlines a potential PhD research direction focused on enabling large-scale event-driven systems to self-discover their operational structure, assess risk, and take safe, explainable actions. The work combines temporal modeling, machine learning, and governance principles, with applications in data infrastructure and AI safety. Problem Statement Modern data infrastructures (pipelines, schedulers, CDC systems) produce massive streams of events. Operators (SREs, data engineers) currently monitor, correlate, and intervene manually to handle failures or delays. The goal is to formalize this process: can a system learn from its own history to automatically surface what should happen, when, and what to do when things go wrong—without hand-maintained DAGs or crontabs? ...

August 30, 2025 · 2 min · Ted Strall