Skip to content
Great Blue Heron

Trans: Latin prefix implying “across” or “Beyond”, often used in gender nonconforming situations Scend: Archaic word describing a strong “surge” or “wave”, originating with 15th century english sailors Survival: 15th century english compound word describing an existence only worth transcending

Jess Sullivan

Host Characterization with Chapel, Fancy Testing & Dhall

· 4 min read · hardware

As part of my descent into multiprocessor & timing madness I’ve needed a repeatable characterization of behavior. Using these lovelies I have have screwed together something of a minimally viable EEG, XR stream and audio buffer “timing-friendlyish” characterization bench. 👀

  • quickchpl Hey look, this cool person named Jess Sullivan wrote this handy library for property-based testing in Chapel language, how else would I structure timing invariants?
  • Dhall for typed evidence records (go read Gabby’s blog)
  • Nix for hermetic compiler sourcing, of course ^w^

It needs Chapel Lang

HostNumaProbe

HostNumaTiming

TimingProofs

TestTimingProofs

TestHostNumaTiming

quickchpl

Probing

HostNumaProbe.chpl partitions synthetic channel data across NUMA nodes, then measures serial versus parallel reduction:

use Time;
use HostNumaTiming;
use TimingProofs;

config const numChannels = 100;
config const numSamples = 250 * 10;
config const sampleRateHz = 250;
config const partitions = 2;

The config const declarations are Chapel’s mechanism for runtime-configurable parameters — compile once, vary at invocation. 250 * 10 samples at 250 Hz gives a 10-second synthetic EEG window.

Data generation uses forall to distribute across cores:

var data: [0..<numChannels, 0..<numSamples] real;
forall ch in 0..<numChannels {
  const signal = syntheticSignal(numSamples, ch);
  for s in 0..<numSamples do
    data[ch, s] = signal[s];
}

The timing surface is a serial/parallel comparison. Serial is a single-threaded nested loop; parallel distributes the outer loop via forall:

var serialTimer = new stopwatch();
serialTimer.start();
var serialTotal: real = 0.0;
for ch in 0..<numChannels {
  for s in 0..<numSamples do
    serialTotal += abs(data[ch, s]);
}
serialTimer.stop();

var parallelTimer = new stopwatch();
parallelTimer.start();
var channelTotals: [0..<numChannels] real;
forall ch in 0..<numChannels {
  var total: real = 0.0;
  for s in 0..<numSamples do
    total += abs(data[ch, s]);
  channelTotals[ch] = total;
}

Chapel’s forall is not #pragma omp parallel for. It is a first-class parallel iterator that the compiler can map to the locale model’s task layer. Under the flat locale model (CHPL_LOCALE_MODEL=flat), all 32 threads share one locale, and forall distributes across task-parallel threads. Under a NUMA-aware locale model, channel partitions can be aligned to socket-local memory.

The type surface

The core types in HostNumaTiming.chpl:

record Partition {
  var start: int;
  var count: int;
}

record TimingStats {
  var minSeconds: real;
  var maxSeconds: real;
  var meanSeconds: real;
  var samples: int;
}

partitionChannels distributes channels across partitions with remainder handling:

proc partitionChannels(numChannels: int, partitions: int): [0..<partitions] Partition {
  var parts: [0..<partitions] Partition;
  const base = numChannels / partitions;
  const remainder = numChannels % partitions;
  var offset = 0;
  for i in 0..<partitions {
    const count = base + (if i < remainder then 1 else 0);
    parts[i] = new Partition(start = offset, count = count);
    offset += count;
  }
  return parts;
}

The conformance predicate in TimingProofs.chpl:

record TimingBudget {
  var targetPeriodSeconds: real;
  var maxJitterSeconds: real;
  var maxLatencySeconds: real;
}

proc timingConforms(intervals: list(real), budget: TimingBudget): bool {
  if budget.targetPeriodSeconds < 0.0 then return false;
  if budget.maxJitterSeconds < 0.0 then return false;
  if budget.maxLatencySeconds < 0.0 then return false;
  if intervals.size == 0 then return true;
  return maxAbsDeviation(intervals, budget.targetPeriodSeconds)
           <= budget.maxJitterSeconds + 1e-12
      && worstCase(intervals) <= budget.maxLatencySeconds + 1e-12;
}

This is the kind of predicate that benefits from PBT: the 1e-12 epsilon, the edge cases around empty intervals and negative budgets, the interaction between max-deviation and worst-case constraints.

Property-based testing for greatness

use quickchpl;
use TimingProofs;

var roundTrip = property("monotonic timestamps round-trip to intervals",
  listGen(realGen(0.0, 0.05), 1, 128),
  proc(intervals: list(real)): bool {
    const timestamps = timestampsFromIntervals(intervals);
    const recovered = intervalsFromTimestamps(timestamps);
    if !isMonotonicNondecreasing(timestamps) then return false;
    if recovered.size != intervals.size then return false;
    for i in 0..<intervals.size {
      if !approxEqual(recovered[i], intervals[i]) then return false;
    }
    return true;
  });
var r1 = check(roundTrip, 200);

The generator listGen(realGen(0.0, 0.05), 1, 128) produces lists of 1-128 non-negative reals — synthetic inter-sample intervals. The property verifies that intervalsFromTimestamps is a left inverse of the timestamp constructor.

The scaling invariant is the most interesting:

var scalingInvariant = property("scaling intervals and budget preserves conformance",
  tupleGen(
    realGen(2.0, 256.0),   // sample count
    realGen(0.004, 0.02),  // period
    realGen(0.0, 1.0),     // jitter fraction
    realGen(0.1, 10.0),    // scale factor
    realGen(0.0, 10000.0)  // seed
  ),
  proc(args: (real, real, real, real, real)): bool {
    // ... generate jittered timestamps, scale both intervals and budget ...
    return timingConforms(intervals, budget)
        && timingConforms(scaledIntervals, scaledBudget);
  });

If timingConforms(intervals, budget) holds, then scaling both the intervals and the budget by the same positive factor should preserve conformance. This catches unit-conversion bugs and numerical edge cases in the epsilon handling, which is terrific 👍

From TestHostNumaTiming.chpl, the partition properties:

var partitionCoverage = property("partition coverage equals channel total",
  tupleGen(
    listGen(realGen(-1.0, 1.0), 0, 512),
    realGen(1.0, 16.0)
  ),
  proc(args: (list(real), real)): bool {
    const (channels, groupApprox) = args;
    const groups = max(1, min(16, groupApprox: int));
    const parts = partitionChannels(channels.size, groups);
    return totalCovered(parts) == channels.size;
  });

var partitionContiguous = property("partitions are contiguous and exact",
  // ... same generator shape ...
  proc(args: (list(real), real)): bool {
    // ...
    return coversExactly(parts, channels.size);
  });

coversExactly checks that partitions are contiguous, non-negative, and sum to the total. Together with coverage, these two properties establish that partitionChannels is a correct partition function for any channel count and group count in range.

HostNumaTiming (5 properties)

TimingProofs (4 properties)

round-trip: timestamps ↔ intervals

exact budget: zero-jitter ⊢ conforms

bounded jitter: ε-jitter ⊢ conforms

scaling: λ·intervals, λ·budget ⊢ conforms

coverage: Σ partition = total

contiguous: partitions exact + ordered

deterministic: signal(n,s) = signal(n,s)

ordered: min ≤ mean ≤ max

scaling: λ·samples → λ·stats

timingConforms

partitionChannels

syntheticSignal

summarizeSamples

Projection for cleanliness

Each probe run is projected into a typed Dhall record via ChapelHostProbeRun.dhall. The type carries metadata (date, host, compiler, kernel), configuration (locale count, channels, samples, sample rate, partitions), results (timing conforms, jitter stats), and performance (serial time, parallel time, speedup ratio). A separate KernelValidationRun.dhall type covers kernel validation captures.

The pipeline is one command:

just platform-host-characterization-window 
  target=jess@honey 
  tag=generic-repeat-2026-04-26 
  expect_lane=generic 
  smi_samples=3 smi_duration=120 
  hwlat_duration=120 chapel_samples=5

This captures SMI counts, tracefs hwlat windows, five Chapel probe runs, and projects each into a Dhall record. The Nix derivation for Chapel 2.8.0 pins the compiler, task model (qthreads), locale model (flat), and LLVM version (18).

just platform-host-characterization-window

smi-validate (3× 120s windows)

tracefs hwlat (3× 120s windows)

capture-chapel-host-probe-series (5 samples)

Nix: Chapel 2.8.0 + qthreads + flat locale + LLVM 18

chpl HostNumaProbe.chpl -M src

serial + forall timing on target

raw capture .txt per sample

project-chapel-host-probe-dhall

typed Dhall record per sample

evidence packet

Measurements; kinda meh but is repeatable 🤙

The host runs two kernel lanes from linux-xr (my Rocky 10 kernel carry with DRM patches and a variety of other stuff for the Bigscreen Beyond headset and related toys):

  • Generic: 6.19.5-7.xr.el10 (PREEMPT_DYNAMIC)
  • RT: 6.19.5-rt1-8.xr.el10 (PREEMPT_RT)

Both lanes share the same isolcpus, nohz_full, irqaffinity, and tsc=nowatchdog cmdline posture.

Generic repeat (five samples)

MetricMinMaxMeanStdev
Serial0.021841s0.027597s0.023822s0.002590s
Parallel0.001768s0.002382s0.001962s0.000249s
Ratio9.3375x14.1814x12.2760x1.8093x

SMI: 280, 279, 279 per 120s (~2.3/s). Tracefs hwlat: 0 us max across all windows. Load average rose from ~8 to ~18 across samples — this is characterization under lab load, not an idle benchmark.

RT repeat (five samples)

MetricMinMaxMeanStdev
Serial0.022144s0.023715s0.022862s0.000690s
Parallel0.001726s0.016977s0.005033s0.006683s
Ratio1.3617x12.8297x9.3204x4.6220x

SMI: 279, 279, 278 per 120s (unchanged). Tracefs hwlat: 2, 2, 14 us (one threshold crossing). All five samples conform to the timing predicate, but sample 2 collapsed to 1.3617x — the parallel path took 16.977ms instead of the typical ~1.8ms.

Per-sample comparison

Characterization Ratio (serial / parallel) by SampleS1S2S3S4S51614121086420Ratio

The generic lane holds between 9x and 14x. The RT lane’s sample 2 is the story: 1.36x while every other sample is 9x+. That outlier drove the RT parallel mean to 5.033ms (generic: 1.962ms) and inflated stdev from 1.81x to 4.62x.

Parallel Reduction Time (ms) by SampleS1S2S3S4S5181614121086420Parallel ms

Serial times (not shown) are nearly identical across both lanes (~22-27ms). The parallel path is where PREEMPT_RT’s sleeping-lock conversion costs show up.

Raw CSV.

What the RT result means

PREEMPT_RT converts spinlocks to sleeping mutexes and threads interrupt handlers. The parallel degradation (serial unchanged, parallel slower and more variable) is the expected cost of that conversion on a NUMA-aware forall workload. The RT necessity analysis in the repo has the full reasoning.

RT is still available as insurance for specific deadline-sensitive paths — audio buffer underruns at low sample counts, or BCI callback timing if we ever need sub-millisecond guarantees. The linux-xr kernel carries both lanes, and the host can one-shot boot into RT and return to generic with a validated fallback. But the current evidence says the generic lane with isolcpus + SCHED_FIFO is the stronger default for a mixed GPU + audio + Chapel + Kubernetes workload.

What’s next for me and the Dell ^w^

  • Targeted isolation: SCHED_FIFO on isolated cores for audio/BCI threads, then measure actual deadline behavior on the generic lane
  • publish and manufacture the Dell hardware mods: at somepoint I’ll get around to this, still cleaning up loose ends
  • Audio/BCI I/O packet: period, buffer, quantum, xrun evidence — the real latency-sensitive workload, not a proxy
  • XR frame timing: xrWaitFrame histograms via Monado (this is XoxDWM-owned, not Dell)
  • Fresh quickchpl run: the nine properties need a current pass for paper citation, ofc
  • Fan and enclosure: the T7810 thermal path is its own workstream

The Dell-7810 repo which may or may not be public at the time of reading contains hardware specifics, as well as my OpenSCAD drawings and measurements for the physical modification work. To recap, the Dell 7810 itself has two Xeon E5-2630 v3 (Haswell), 32 threads, two NUMA nodes. It is, as described before, considerably hacked up- (running two power supplies at ~2kw mean draw, an AMD RX 9070 XT, a few hundred channels 96khz AD/DA I/O, a NUT power monitor and an RKE2 cluster) as well as hosting a wide variety of BCI/XR experiments (Open BCI, ganglion, mouth trackers, eye trackers) along with OpenXR, BS2e goggles, my special linux-xr kernel iterations, the xoxdwm stack, Monado, just silly girlie things. ^w^

Cheers all and merry springtime, -Jess

Related Posts

Comments

Loading comments...