Skip to content
Great Blue Heron

Trans: Latin prefix implying “across” or “Beyond”, often used in gender nonconforming situations Scend: Archaic word describing a strong “surge” or “wave”, originating with 15th century english sailors Survival: 15th century english compound word describing an existence only worth transcending

Jess Sullivan

WinRM Quotas, Plugin Confusion, and Why PSRP Has Been the Answer Since 2018

· 5 min read · devops

Find the full demo repo here - link

How this started

(Well, my first error was agreeing to touch Windows VMs, but anyway)

As part of the grind to move the Sun and Moon around a bit and automate an EMS upgrade + deployment on campus (and the pleathora of integrations that integrate with this this crazy thing) I’ve been running batches of pytests, molecule tests etc as part of the slog of ansible fact gathering —> do a bunch of stuff —> evaulate what went wrong arc. Unfortunately, my quest for more ways to parallelize led me to getting me locked out, here, there everywhere on campus. :eyes:

This is what Ansible gives you when WinRM runs out of capacity:

fatal: [win-target]: UNREACHABLE! => {
    "msg": "Task failed: ntlm: the specified credentials were rejected by the server"
}

“Credentials rejected.” My keys are fine damnit! What’s actually happening is that the Windows Remote Management service ran out of room- but the error it sends back through the NTLM handshake looks identical to a real auth failure. This matters enormously because Active Directory counts each of these as a failed login attempt. The typical lockout threshold is 5 failures in 15 minutes. I sent 41 in a single burst. It took me a while to piece together what was going on. Most people know WinRM has a quota system. What I didn’t know- and I suspect most Ansible-on-Windows shops don’t know either- is that there are actually three quota layers stacked on top of each other. The effective limit for any connection is the minimum across all of them.

That bottom layer- the plugin layer at WSMan:\localhost\Plugin\microsoft.powershell\Quotas- is the one that got me. It sits underneath the shell-level quotas everyone googles, and its defaults are actually lower:

SettingShell DefaultPlugin DefaultEffective
MaxShellsPerUser302525
MaxConcurrentUsers1055
MaxProcessesPerShell251515

So you can go raise MaxShellsPerUser to 100 at the shell level and still get blocked by the plugin’s MaxConcurrentUsers of 5. These defaults haven’t changed since WinRM 2.0 shipped with Server 2008 R2- seventeen years of the same values across every Windows Server version through 2025.

I put together a fairly thorough quota behavior writeup covering the exact error codes, SOAP faults, and the surprisingly confusing relationship between Set-Item WSMan:\ and service restarts.

Why pywinrm makes this a forkbomb

Ok so this is the big one.

The default ansible.builtin.winrm connection plugin uses [pywinrm and pywinrm creates a new WinRM shell with a fresh NTLM authentication for every single Ansible task. No connection pooling. No session reuse. No buffering or piping.

The math gets bad fast:

parallel_molecule_processes × ansible_forks × tasks_per_role = total_shell_attempts
            4              ×       5        ×       15       = 300

Three hundred shell creation attempts against MaxConcurrentUsers=5. Five get through, the rest come back as “credentials rejected,” and each of those is a real NTLM auth failure against Active Directory. One shared service account across all your managed Windows hosts means one lockout touches everything.

This per-task connection model also means that any credential plugin you’re using during execution- KeePassXC lookups, SOPS decryption, 1Password CLI- has to resolve on every single connection rather than once per session. Under parallel load that overhead compounds quickly.

Reproducing it

I put together a demo repo to reproduce this in a controlled way. One thing that tripped me up initially (and I am not proud of how long this took)- Ansible’s forks setting only controls parallelism across hosts. With a single target host you can set forks=50 and everything still runs serially.

The trick is a pressure test inventory with 50 entries all pointing at the same machine:

# ansible/inventory/pressure-test.yml
pressure_targets:
  hosts:
    pressure-01: {}
    pressure-02: {}
    # ... 48 more
    pressure-50: {}
  vars:
    ansible_host: localhost
    ansible_connection: winrm
    ansible_port: 15986

With the shell-level quotas set to defaults (MaxConcurrentUsers=10):

Result
Total connections50
SUCCESS9
UNREACHABLE41
AD lockout threshold5

Nine connections got through- roughly matching MaxConcurrentUsers=10. The other 41 went straight to AD as failed auth attempts. That’s 8x the lockout threshold in a single burst.

Finding PSRP

After staring at the forks vs quotas problem for a while, I stumbled onto ansible.builtin.psrp. From the docs-

Run commands or put/fetch on a target via PSRP (WinRM plugin). This is similar to the ansible.builtin.winrm connection plugin which uses the same underlying transport but instead runs in a PowerShell interpreter.

This solves many of my core qualms with ansible.builtin.winrm- specifically, buffering and piping are inherently possible with PSRP. And critically for this problem- it allows for plugin-level connection pooling. One authenticated connection per fork, multiplexing all commands over a persistent PowerShell Runspace Pool. No per-task shell creation, no per-task NTLM handshake.

Same pressure test, same 50 connections, but with ansible_connection=psrp:

$ ansible -i inventory/pressure-test.yml pressure_targets -m win_ping -f 50 
    -e ansible_connection=psrp -e ansible_psrp_auth=ntlm
pywinrmpypsrp
Successes924
UNREACHABLE (auth failure)410
AD lockout riskHIGHNone

Zero authentication failures. The remaining PSRP failures were TCP timeouts from my SSH tunnel- an infrastructure bottleneck, not an auth problem. The connection pooling also means that credential plugin resolution (my KeePassXC lookups, SOPS, whatever you’re using) happens once per connection rather than once per task.

The psrp plugin has been in ansible.builtin (ansible-core) since Ansible 2.7- October 2018. Same author as pywinrm- Jordan Borean. It’s been sitting there for seven years. Better late to the party than never.

# group_vars/windows.yml
ansible_connection: psrp
ansible_psrp_auth: ntlm
ansible_psrp_protocol: https
ansible_psrp_cert_validation: false

The quota limits are still possible issues with enough forks, but given I have admin credentials I can set those on the fly during development time. The authentication flood problem- the thing that actually locks you out of AD- that’s just gone with PSRP.

The other footgun I found

While resetting quotas to Windows defaults for benchmarking, I tried restarting WinRM over WinRM:

- name: restart winrm
  ansible.windows.win_service:
    name: WinRM
    state: restarted

This is, in retrospect, obviously a bad idea. The service stops (killing the connection I’m using to issue the restart), fails to come back up properly, and I’m left with a box that won’t accept remote management at all. Start-Service WinRM from RDP also failed- the service was genuinely corrupted, not just stopped. Full OS reboot was the only way back.

The good news is that WSMan quota changes take effect immediately on new connections without a restart. I didn’t need the handler at all. Removed it, documented the finding, moved on.

yo check it out

Everything I found during this investigation- the quota research, the benchmark data, the Ansible roles for managing all of this- is in a demo repo with a companion docs site. A few things in there that might be useful if you’re running into similar problems:

  • A winrm_quota_config role that manages both shell-level and plugin-level quotas idempotently
  • A winrm_monitoring role that deploys Prometheus metrics for active shell counts and quota utilization
  • Dhall-typed benchmark profiles for systematic forkbomb reproduction
  • The full quota research covering defaults by Windows version, GPO override behavior, registry paths, and what the DISA STIGs actually say (spoiler- they don’t constrain quota values at all)

References


-Jess

Related Posts

Comments

Loading comments...