Active Directory security · pre-registered study

Anatomy of an undetected kill chain

We set out to answer a basic question: which structural feature of an Active Directory network best predicts how exposed it is to attack? We generated 500 synthetic AD topologies and scored each one two ways — with a validated risk model, and with an independent reachability oracle that never consults the model. The density of abusable permissions and delegation rights, not the size of the network, most strongly predicts the reachable attack surface. We also found that the risk model, while it ranks attacks accurately, largely compresses this structural signal. We think that gap is worth understanding before anyone relies on a model's risk scores.

data 500 synthetic topologies, none excluded cost $0, runs locally, no attack executed model frozen, validated in earlier work

01The question

Defenders are often told that bigger networks are more dangerous, or that deep group nesting is the thing to worry about. In practice it is not obvious which structural property of a directory most determines whether an attacker can actually reach a privileged target. We wanted to measure this directly, and to do so in a way that did not depend on trusting any single model's judgment.

We also wanted to pre-register the analysis. Before generating a single graph, we wrote down the predictors we would test, the metrics we would use, and the threshold a correlation would need to clear to count as a finding. This prevents the result from being a post-hoc choice of what to report.

02How we measured it

We generated 500 BloodHound-shaped graphs spanning three generator families, five size tiers from 500 to 10,000 principals, and three density regimes. For each graph we recorded six structural properties and then scored 24 candidate attack paths in two independent ways.

generate500 topologiesprocedural AD graphs spanning delegation, permission, nesting, and size regimes.

score · modelcomposed risk surfacethe validated model's reachable-and-undetected risk channel, averaged over eight held-out defensive postures.

score · oracleground-truth reachabilityan independent breadth-first search over the graph that never consults the model.

figure 1 the two scoring paths. one uses the validated risk model; the other is a model-free reachability oracle. comparing them lets us separate what is true of the network from what the model perceives. the pre-registered threshold for a finding was a Spearman correlation of at least 0.30 (p < 0.01).

The model is the same frozen checkpoint that, in earlier work, ranked real attack outcomes on a live forest at a Spearman correlation of 0.89. The oracle is deliberately simple: it computes which targets are reachable from an assumed foothold and how far away they are, using nothing but the graph. If a structural property matters for real exposure, it should show up against the oracle regardless of what the model believes.

03What we found

Against the ground-truth oracle, two properties stand out, and network size is not among them. The density of abusable permissions and the density of delegation rights are the strongest predictors of the reachable attack surface. The number of principals in the network — its size — is weakly and slightly negatively related.

ρ_permissions=+0.45p = 5×10⁻²⁶, n=500permission density vs reachable surface

ρ_delegation=+0.42 delegation density vs reachable surface

ρ_size=−0.13 principal count (network size)

ρ_model=+0.13 same predictors against the model's risk score

0.45permission density → reachable surface
strongest predictor

0.42delegation density → reachable surface
second strongest

−0.13network size → reachable surface
weak, slightly negative

0.13structure → model risk score
the signal the model compresses

Ranking all six structural properties by their correlation with the ground-truth reachable surface makes the ordering clear.

permission density+0.45

delegation density+0.42

lateral-access density+0.38

group nesting depth+0.18

local-admin density+0.13

principal count (size)−0.13

The full correlation matrix shows both scoring paths at once. The two right-hand columns are the model-free oracle; the two left columns are the model. The density properties correlate strongly with the oracle and weakly with the model.

structure × risk — Spearman ρ (n=500). teal = positive, amber = negative, intensity ∝ |ρ|.
structural property	model risk	reachable surf	oracle reachable	oracle score
delegation density	+0.13	+0.22	+0.42	+0.25
permission density	+0.12	+0.23	+0.45	+0.28
lateral-access density	-0.03	+0.00	+0.38	+0.47
group nesting depth	+0.04	+0.09	+0.18	+0.11
local-admin density	-0.05	+0.15	+0.13	-0.00
principal count (size)	+0.05	-0.15	-0.13	+0.00

04Why the model and the ground truth disagree

The same structural density that strongly drives the ground-truth reachable surface barely moves the model's risk score. This is the most useful finding in the study, and it is worth being precise about its cause.

ground truth · density → reachable surface0.45

model · density → composed risk0.13

the same 500 networks, the same structural axisSpearman ρ

figure 2 structural density correlates with the ground-truth reachable surface at 0.45 but with the model's composed risk score at only 0.13.

The model is a strong ranker of attack outcomes; that is what it was built and validated to do. But internally it reduces each candidate path to a short host-chain, so it effectively sees two things about a target: whether it is reachable, and how many hops away it is. It does not see the fine-grained permission and delegation structure that, in the ground truth, multiplies the number of reachable paths. The result is a model that ranks well but is less sensitive to the structural features that generate the risk it is ranking.

We do not read this as a flaw to hide. A model can be accurate at its task and still under-weight inputs that a defender cares about, and knowing where that happens is a prerequisite for trusting the model's scores. It also suggests a concrete improvement: a risk model that represents permission and delegation density more directly should track the ground truth more closely.

05The highest-risk topologies

The ten networks the model scored as highest-risk, with their structural fingerprints, illustrate the same point from the other direction: the list is not sorted by size.

top-10 highest-risk topologies, by model-predicted risk
topology	family	principals	deleg	perms	nest	reachable
n500 · dense	C	504	0.036	0.298	4	19
n10000 · sparse	C	10004	0.002	0.040	2	18
n10000 · dense	C	10004	0.037	0.300	4	18
n10000 · deep	C	10004	0.012	0.100	12	18
n500 · sparse	C	504	0.002	0.040	2	17
n500 · deep	C	504	0.012	0.099	12	17
n5000 · high	A	5004	0.020	0.180	8	16
n500 · dense	C	504	0.036	0.298	4	20

06Limitations

We want to be clear about what this study does and does not establish.

scope

The risk surface here is model-predicted, not measured against real detection. This synthetic harness has no logs; "undetected" is the model's internal risk channel, validated for ranking in other work but not against real detection on these graphs. The study measures what a validated ranker weights, not a measured detection outcome.

scope

The structural predictors are partially correlated by construction, because the generator's regimes vary permission, delegation, and nesting density together; only one generator family decouples them. We report the simple per-predictor correlations rather than over-interpreting the joint regression. Cross-domain trust is held constant across all graphs, so we cannot assess it, and we say so rather than imputing a result.

scope

These are synthetic topologies. They capture density and depth regimes but not the organic permission sprawl of a real enterprise forest. The direction of the findings should transfer; the exact magnitudes may not.

reproducibility

The model is treated as a frozen black box, and no result is fit to its own test data. Every number above is reconstructable from the per-graph records, and the full run regenerates deterministically. The analysis plan was committed and tagged before any graph was scored.