<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>null program</title>
  <link rel="alternate" type="text/html" href="https://nullprogram.com"/>
  <link rel="self" type="application/atom+xml" href="https://nullprogram.com/feed/"/>
  <updated>2024-06-10T03:14:45Z</updated>
  <id>urn:uuid:f8b65823-4ec5-3a70-efc8-2b713aa63091</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com/</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
  <entry>
    <title>Solving "Two Sum" in C with a tiny hash table</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/06/26/"/>
    <id>urn:uuid:5d15318f-6915-4f72-8690-74a84d43d2f7</id>
    <updated>2023-06-26T19:38:18Z</updated>
    <category term="c"/><category term="go"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I came across a question: How does one efficiently solve <a href="https://leetcode.com/problems/two-sum/">Two Sum</a> in C?
There’s a naive quadratic time solution, but also an amortized linear time
solution using a hash table. Without a built-in or standard library hash
table, the latter sounds onerous. However, a <a href="/blog/2022/08/08/">mask-step-index table</a>,
a hash table construction suitable for many problems, requires only a few
lines of code. This approach is useful even when a standard hash table is
available, because by <a href="https://vimeo.com/644068002">exploiting the known problem constraints</a>, it
beats typical generic hash table performance by 1–2 orders of magnitude
(<a href="https://gist.github.com/skeeto/7119cf683662deae717c0d4e79ebf605">demo</a>).</p>

<p>The Two Sum exercise, restated:</p>

<blockquote>
  <p>Given an integer array and target, return the distinct indices of two
elements that sum to the target.</p>
</blockquote>

<p>In particular, the solution doesn’t find elements, but their indices. The
exercise also constrains input ranges — important but easy to overlook:</p>

<ul>
  <li>2 &lt;= <code class="language-plaintext highlighter-rouge">count</code> &lt;= 10<sup>4</sup></li>
  <li>-10<sup>9</sup> &lt;= <code class="language-plaintext highlighter-rouge">nums[i]</code> &lt;= 10<sup>9</sup></li>
  <li>-10<sup>9</sup> &lt;= <code class="language-plaintext highlighter-rouge">target</code> &lt;= 10<sup>9</sup></li>
</ul>

<p>Notably, indices fit in a 16-bit integer with lots of room to spare. In
fact, it will fit in a 14-bit address space (16,384) with still plenty of
overhead. Elements fit in a signed 32-bit integer, and we can add and
subtract elements without overflow, if just barely. The last constraint
isn’t redundant, but it’s not readily exploitable either.</p>

<p>The naive solution is to linearly search the array for the complement.
With nested loops, it’s obviously quadratic time. At 10k elements, we
expect an abysmal 25M comparisons on average.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int16_t</span> <span class="n">count</span> <span class="o">=</span> <span class="p">...;</span>
<span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span> <span class="o">=</span> <span class="p">...;</span>

<span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">count</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">target</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// found</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">nums</code> array is “keyed” by index. It would be better to also have the
inverse mapping: key on elements to obtain the <code class="language-plaintext highlighter-rouge">nums</code> index. Then for each
element we could compute the complement and find its index, if any, using
this second mapping.</p>

<p>The input range is finite, so an inverse map is simple. Allocate an array,
one element per integer in range, and store the index there. However, the
input range is 2 billion, and even with 16-bit indices that’s a 4GB array.
Feasible on 64-bit hosts, but wasteful. The exercise is certainly designed
to make it so. This array would be very sparse, at most less than half a
percent of its elements populated. That’s a hint: Associative arrays are
far more appropriate for representing such sparse mappings. That is, a
hash table.</p>

<p>Using Go’s built-in hash table:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">TwoSumWithMap</span><span class="p">(</span><span class="n">nums</span> <span class="p">[]</span><span class="kt">int32</span><span class="p">,</span> <span class="n">target</span> <span class="kt">int32</span><span class="p">)</span> <span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">seen</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="kt">int32</span><span class="p">]</span><span class="kt">int16</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">num</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">nums</span> <span class="p">{</span>
        <span class="n">complement</span> <span class="o">:=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">num</span>
        <span class="k">if</span> <span class="n">j</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">seen</span><span class="p">[</span><span class="n">complement</span><span class="p">];</span> <span class="n">ok</span> <span class="p">{</span>
            <span class="k">return</span> <span class="kt">int</span><span class="p">(</span><span class="n">j</span><span class="p">),</span> <span class="n">i</span><span class="p">,</span> <span class="no">true</span>
        <span class="p">}</span>
        <span class="n">seen</span><span class="p">[</span><span class="n">num</span><span class="p">]</span> <span class="o">=</span> <span class="kt">int16</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In essence, the hash table folds the sparse 2 billion element array onto a
smaller array, with collision resolution when elements inevitably land in
the same slot. For this exercise, that small array could be as small as
10,000 elements because that’s the most we’d ever need to track. For
folding the large key space onto the smaller, we could use modulo. For
collision resolution, we could keep walking the table.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int16_t</span> <span class="n">seen</span><span class="p">[</span><span class="mi">10000</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>

<span class="c1">// Find or insert nums[index].</span>
<span class="kt">int16_t</span> <span class="nf">lookup</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int16_t</span> <span class="n">index</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">nums</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">%</span> <span class="mi">10000</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// unbias</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>  <span class="c1">// empty slot</span>
            <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// insert biased index</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">nums</span><span class="p">[</span><span class="n">index</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">j</span><span class="p">;</span>  <span class="c1">// match found</span>
        <span class="p">}</span>
        <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10000</span><span class="p">;</span>  <span class="c1">// keep looking</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Take note of a few details:</p>

<ol>
  <li>
    <p>An empty slot is zero, and an empty table is a zero-initialized array.
Since zero is a valid value, and all values are non-negative, it biases
values by 1 in the table.</p>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">nums</code> array is part of the table structure, necessary for lookups.
<strong>The two mappings — element-by-index and index-by-element — share
structure.</strong></p>
  </li>
  <li>
    <p>It uses <em>open addressing</em> with <em>linear probing</em>, and so walks the table
until it either either finds the element or hits an empty slot.</p>
  </li>
  <li>
    <p>The “hash” function is modulo. If inputs are not random, they’ll tend
to bunch up in the table. Combined with linear probing makes for lots
of collisions. For the worst case, imagine sequentially ordered inputs.</p>
  </li>
  <li>
    <p>Sometimes the table will almost completely fill, and lookups will be no
better than the linear scans of the naive solution.</p>
  </li>
  <li>
    <p>Most subtle of all: This hash table is not enough for the exercise. The
keyed-on element may not even be in <code class="language-plaintext highlighter-rouge">nums</code>, and when lookup fails, that
element is not inserted in the table. Instead, a different element is
inserted. The conventional solution has at least two hash table
lookups. <strong>In the Go code, it’s <code class="language-plaintext highlighter-rouge">seen[complement]</code> for lookups and
<code class="language-plaintext highlighter-rouge">seen[num]</code> for inserts.</strong></p>
  </li>
</ol>

<p>To solve (4) we’ll use a hash function to more uniformly distribute
elements in the table. We’ll also probe the table in a random-ish order
that depends on the key. In practice there will be little bunching even
for non-random inputs.</p>

<p>To solve (5) we’ll use a larger table: 2<sup>14</sup> or 16,384 elements.
This has breathing room, and with a power of two we can use a fast mask
instead of a slow division (though in practice, compilers usually
implement division by a constant denominator with modular multiplication).</p>

<p>To solve (6) we’ll key complements together under the same key. It looks
for the complement, but on failure it inserts the current element in the
empty slot. In other words, <strong>this solution will only need a single hash
table lookup per element!</strong></p>

<p>Laying down some groundwork:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int16_t</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">;</span>
    <span class="kt">_Bool</span> <span class="n">ok</span><span class="p">;</span>
<span class="p">}</span> <span class="n">TwoSum</span><span class="p">;</span>

<span class="n">TwoSum</span> <span class="nf">twosum</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int16_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">target</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">TwoSum</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="kt">int16_t</span> <span class="n">seen</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">seen</code> array is a 32KiB hash table large enough for all inputs, small
enough that it can be a local variable. In the loop:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="kt">int32_t</span> <span class="n">complement</span> <span class="o">=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">int32_t</span> <span class="n">key</span> <span class="o">=</span> <span class="n">complement</span><span class="o">&gt;</span><span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">?</span> <span class="n">complement</span> <span class="o">:</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">uint32_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">key</span> <span class="o">*</span> <span class="mi">489183053u</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">mask</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">seen</span><span class="p">)</span><span class="o">/</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">seen</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">step</span> <span class="o">=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="mi">13</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
</code></pre></div></div>

<p>Compute the complement, then apply a “max” operation to derive a key. Any
commutative operation works, though obviously addition would be a poor
choice. XOR is similar enough to cause many collisions. Multiplication
works well, and is probably better if the ternary produces a branch.</p>

<p>The hash function is multiplication with <a href="/blog/2019/11/19/">a randomly-chosen prime</a>.
As we’ll see in a moment, <code class="language-plaintext highlighter-rouge">step</code> will also add-shift the hash before use.
The initial index will be the bottom 14 bits of this hash. For <code class="language-plaintext highlighter-rouge">step</code>,
recall from the MSI article that it must be odd so that every slot is
eventually probed. I shift out 13 bits and then override the 14th bit, so
<code class="language-plaintext highlighter-rouge">step</code> effectively skips over the 14 bits used for the initial table
index.</p>

<p>I used <code class="language-plaintext highlighter-rouge">unsigned</code> because I don’t really care about the width of the hash
table index, but more importantly, I want defined overflow from all the
bit twiddling, even in the face of implicit promotion. As a bonus, it can
help in reasoning about indirection: <code class="language-plaintext highlighter-rouge">seen</code> indices are <code class="language-plaintext highlighter-rouge">unsigned</code>, <code class="language-plaintext highlighter-rouge">nums</code>
indices are <code class="language-plaintext highlighter-rouge">int16_t</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
            <span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// unbias</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// bias and insert</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">complement</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">r</span><span class="p">.</span><span class="n">i</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">j</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
                <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
</code></pre></div></div>

<p>The step is added before using the index the first time, helping to
scatter the start point and reduce collisions. If it’s an empty slot,
insert the <em>current</em> element, not the complement — which wouldn’t be
possible anyway. Unlike conventional solutions, this doesn’t require
another hash and lookup. If it finds the complement, problem solved,
otherwise keep going.</p>

<p>Putting it all together, it’s only slightly longer than solutions using a
generic hash table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TwoSum</span> <span class="nf">twosum</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int16_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">target</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">TwoSum</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="kt">int16_t</span> <span class="n">seen</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">complement</span> <span class="o">=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">int32_t</span> <span class="n">key</span> <span class="o">=</span> <span class="n">complement</span><span class="o">&gt;</span><span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">?</span> <span class="n">complement</span> <span class="o">:</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">uint32_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">key</span> <span class="o">*</span> <span class="mi">489183053u</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">mask</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">seen</span><span class="p">)</span><span class="o">/</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">seen</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">step</span> <span class="o">=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="mi">13</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
            <span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// unbias</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// bias and insert</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">complement</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">r</span><span class="p">.</span><span class="n">i</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">j</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
                <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Applying this technique to Go:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">TwoSumWithBespoke</span><span class="p">(</span><span class="n">nums</span> <span class="p">[]</span><span class="kt">int32</span><span class="p">,</span> <span class="n">target</span> <span class="kt">int32</span><span class="p">)</span> <span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">var</span> <span class="n">seen</span> <span class="p">[</span><span class="m">1</span> <span class="o">&lt;&lt;</span> <span class="m">14</span><span class="p">]</span><span class="kt">int16</span>
    <span class="k">for</span> <span class="n">n</span><span class="p">,</span> <span class="n">num</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">nums</span> <span class="p">{</span>
        <span class="n">complement</span> <span class="o">:=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">num</span>
        <span class="n">hash</span> <span class="o">:=</span> <span class="kt">int</span><span class="p">(</span><span class="n">num</span> <span class="o">*</span> <span class="n">complement</span> <span class="o">*</span> <span class="m">489183053</span><span class="p">)</span>
        <span class="n">mask</span> <span class="o">:=</span> <span class="nb">len</span><span class="p">(</span><span class="n">seen</span><span class="p">)</span> <span class="o">-</span> <span class="m">1</span>
        <span class="n">step</span> <span class="o">:=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="m">13</span> <span class="o">|</span> <span class="m">1</span>
        <span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="n">hash</span><span class="p">;</span> <span class="p">;</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span>
            <span class="n">j</span> <span class="o">:=</span> <span class="kt">int</span><span class="p">(</span><span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="m">1</span><span class="p">)</span> <span class="c">// unbias</span>
            <span class="k">if</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="m">0</span> <span class="p">{</span>
                <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="kt">int16</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="o">+</span> <span class="m">1</span> <span class="c">// bias</span>
                <span class="k">break</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">complement</span> <span class="p">{</span>
                <span class="k">return</span> <span class="n">j</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="no">true</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With Go 1.20 this is an order of magnitude faster than <code class="language-plaintext highlighter-rouge">map[int32]int16</code>,
which isn’t surprising. I used multiplication as the key operator because,
in my first take, Go produced a branch for the “max” operation — at a 25%
performance penalty on random inputs.</p>

<p>A full-featured, generic hash table may be overkill for your problem, and
a bit of hashed indexing with collision resolution over a small array
might be sufficient. The problem constraints might open up such shortcuts.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>My ranking of every Shakespeare play</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/06/22/"/>
    <id>urn:uuid:98eae9a1-cd7f-4d1c-be53-85058f1b2649</id>
    <updated>2023-06-22T19:10:25Z</updated>
    <category term="rant"/><category term="meatspace"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=36438620">on Hacker News</a>.</em></p>

<p>A few years ago I set out on a personal journey to study and watch a
performance of each of Shakespeare’s 37 plays. I’ve reached my goal and,
though it’s not a usual topic around here, I wanted to get my thoughts
down while fresh. I absolutely loved some of these plays and performances,
and so I’d like to highlight them, especially because my favorites are,
with one exception, not “popular” plays. Per tradition, I begin with my
least enjoyed plays and work my way up. All performances were either a
recording of a live stage or an adaptation, so they’re also available to
you if you’re interested, though in most cases not for free. I’ll mention
notable performances when applicable. The availability of a great
performance certainly influenced my play rankings.</p>

<!--more-->

<p>Like many of you, I had assigned reading for several Shakespeare plays in
high school. I loathed these assignments. I wasn’t interested at the time,
nor was I mature enough to appreciate the writing. Even revisiting as an
adult, the conventional selection — <em>Romeo and Juliet</em>, <em>Julius Caesar</em>,
etc. — are not highly ranked on my list. For the next couple of decades I
thought that Shakespeare just wasn’t for me.</p>

<p>Then I watched <a href="https://www.youtube.com/watch?v=rbSN4Lv_N4g">the 1993 adaption of <em>Much Ado About Nothing</em></a> and it
instantly became one of my favorite films. Why didn’t we read <em>this</em> in
high school?! Reading <a href="https://shakespeare-navigators.ewu.edu/ado/index.html">the play with footnotes</a> helped to follow the
humor and allusions. Even with the film’s abridging, some of it still went
over my head. I soon discovered <em>Asimov’s Guide to Shakespeare</em> — yes,
<em>that</em> Asimov — which was exactly what I needed, and a perfect companion
while reading and watching the plays. If stumbling upon this turned out so
well, then I’d better keep going.</p>

<p>Wanting a solid set of the plays with good footnotes and editing — there
is no canonical version of the plays — I picked up a copy of <em>The Norton
Shakespeare</em>. Unfortunately it’s part of the college textbook racket, and
it shows. The collection is designed to be sold to students who will lug
them in bookbags, will typically open them face-up on a desk, and are
uninterested in their contents beyond class. It includes a short-term,
digital-only, DRMed component to prevent resale. After all, their target
audience will not read it again anyway. Though at least it’s complete and
compact, better for reference than reading.</p>

<p>In contrast, the Folger Shakespeare Library mass market paperbacks are
better for enthusiasts, both in form and format. They’re clearly built for
casual, comfortable reading. However, they’re not sold as a complete set,
and gathering used copies takes some work.</p>

<p>Also essential was <a href="https://en.wikipedia.org/wiki/BBC_Television_Shakespeare"><em>BBC Television Shakespeare</em></a>, produced between
1978 and 1985. Finding productions of the more obscure plays is tricky,
but it always provided a fallback. In some cases these were the best
performances anyway! When I mention “the BBC production” I mean this
series. Like many collections, they omit <em>The Two Noble Kinsmen</em> due to
unclear authorship, and for this reason I’m omitting it from my list as
well. As with any faithful production, I suggest subtitles on the first
viewing, as it aids with understanding. Shakespeare’s sentence structure
is sometimes difficult to parse by moderns, and on-screen text helps. (By
the way, a couple of handy SHA-1 sums for those who know how to use them:)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0ae909e5444c17183570407bd09a622d2827751e
55c77ed7afb8d377c9626527cc762bda7f3e1d83
</code></pre></div></div>

<p>As my list will show, my favorites are comedic comedies and histories,
particularly the two <a href="https://en.wikipedia.org/wiki/Henriad">Henriads</a>, each a group of four plays. The first —
<em>Richard II</em>, <em>1 Henry IV</em>, <em>2 Henry IV</em>, and <em>Henry V</em> — concerns events
around Henry V, in the late 14th and early 15th century. Those number
prefixes are <em>parts</em>, as in <em>Henry IV</em> has two parts. In my list I combine
parts as though a single play. The second — <em>1 Henry VI</em>, <em>2 Henry VI</em>, <em>3
Henry VI</em>, <em>Richard III</em> — is about the Wars of the Roses, spanning the
15th century. Asimov’s book was essential for filling in the substantial
historical background for these plays, and my journey was also in part a
history study.</p>

<p>I especially enjoy villain monologues, and plays with them rank higher as
a result. It’s said that everyone is the hero of their own story, but
Shakespeare’s villains may know that they’re villains and revel it in it,
bragging directly to the audience about all the trouble they’re going to
cause. In some cases they mock the audience’s sacred values, which in a
way, is like the stand up comedy of Shakespeare’s time. Notable examples
are Edmund (<em>King Lear</em>), Aaron (<em>Titus Andronicus</em>), Richard III, Iago
(<em>Othello</em>), and Shylock (<em>The Merchant of Venice</em>).</p>

<p>As with literature even today, authors are not experts in moral reasoning
and protagonists are often, on reflection, incredibly evil. Shakespeare is
no different, especially for historical events and people, praising those
who create mass misery (e.g. tyrants waging wars) and vilifying those who
improve everyone’s lives (e.g. anyone who deals with money). Up to and
including Shakespeare’s time, <a href="https://acoup.blog/2022/07/29/collections-logistics-how-did-they-do-it-part-ii-foraging/">a pre-industrial army on the march was a
rolling humanitarian crisis</a>, even in “friendly” territory,
slaughtering and stealing its way through the country in order to keep
going. So, much like <em>suspension of belief</em>, there’s a <em>suspension of
morality</em> where I engage with the material on its own moral terms, however
illogical it may be.</p>

<p>Now finally my list. The beginning will be short and negative because, to
be frank, I disliked some of the plays. Even Shakespeare had to work under
constraints. In his time none were regarded as great works. They weren’t
even viewed as literature, but similarly to how we consider television
scripts today. Also, around 20% of plays credited to Shakespeare were
collaborations of some degree, though the collaboration details have been
long lost. For simplicity, I will just refer to the author as Shakespeare.</p>

<h3 id="37-timon-of-athens">(37) Timon of Athens</h3>

<p>I have nothing positive to say about this play. It’s about a man who
borrows and spends recklessly, then learns all the wrong lessons from the
predictable results.</p>

<h3 id="36-the-two-gentlemen-of-verona">(36) The Two Gentlemen of Verona</h3>

<p>Involves a couple of love triangles, a woman disguised as a man — a common
Shakespeare trope — and perhaps the worst ending to a play ever written.
The two “gentlemen” are terrible people and undeserving of their happy
ending. Though I enjoyed the scenes with Proteus and Crab, the play’s fool
and his dog.</p>

<h3 id="35-troilus-and-cressida">(35) Troilus and Cressida</h3>

<p>Interesting that it’s set during the <em>Iliad</em> and features legendary
characters such as Achilles, Ajax, and Hector. I have no other positives
to note. Cressida’s abrupt change of character in the Greek camp later in
the play is baffling, as though part of the play has been lost, and ruins
an already dull play for me.</p>

<h3 id="34-the-winters-tale">(34) The Winter’s Tale</h3>

<p>A baby princess is lost, presumed dead, and raised by shepherds. She is
later rediscovered by her father as a young adult. It has a promising
start, but in the final act the main plot is hastily resolved off-stage
and seemingly replaced with a hastily rewritten ending that nonsensically
resolves a secondary story line.</p>

<h3 id="33-cymbeline">(33) Cymbeline</h3>

<p>The title refers to a legendary early King of Britain and is set in the
first century, but it is primarily about his daughter. The plot is
complicated so I won’t summarize it here. It’s long and I just didn’t
enjoy it. This is the second play in the list to feature a woman disguised
as a man.</p>

<h3 id="32-the-tempest">(32) The Tempest</h3>

<p>A political exile stranded on an island in the Mediterranean gains magical
powers through study, with the help of a spirit creates a tempest that
strands his enemies on his island, then gently torments them until he’s
satisfied that he’s had his revenge. It’s an okay play.</p>

<p>More interesting is the historical context behind the play. It’s based
loosely on events around the founding of Jamestown, Virginia. Until this
play, Shakespeare and Jamestown were, in my mind, unrelated historical
events. In fact, Pocahontas very nearly met Shakespeare, missing him by
just a couple of years, but she did meet his rival, Ben Jonson. I spent
far more time catching up on real history, including reading the
fascinating <a href="https://en.wikipedia.org/wiki/True_Reportory"><em>True Reportory</em></a>, than I did on the play.</p>

<h3 id="31-the-taming-of-the-shrew">(31) The Taming of the Shrew</h3>

<p>About a man courting and “taming” an ill-tempered woman, the shrew. The
seeming moral of the play was outdated even in Shakespeare’s time, and
it’s unclear what was intended. Technically it’s a play within a play, and
an outer frame presents the play as part of an elaborate prank. However,
the outer frame is dropped and never revisited, indicating that perhaps
this part of the play was lost. The BBC production skips this framing
entirely and plays it straight.</p>

<h3 id="30-alls-well-that-ends-well">(30) All’s Well That Ends Well</h3>

<p>Helena, a low-born enterprising young woman, saves a king’s life. She’s in
love with a nobleman, Bertram, and the king orders him to marry her as
repayment. He spurns her solely due to her low upbringing and flees the
country. She gives chase, and eventually wins him over. Helena is a great
character, and Bertram is utterly undeserving of her, which ruins the play
for me in an unearned ending.</p>

<h3 id="29-antony-and-cleopatra">(29) Antony and Cleopatra</h3>

<p>A tragedy about people who we know for sure existed, the first such on the
list so far. The sequel to <em>Julius Caesar</em>, completing the story of the
Second Triumvirate. Historically interesting, but the title characters
were terrible, selfish people, including in the play, and they aren’t
interesting enough to make up for it.</p>

<p>I enjoyed the portrayal of Octavian as a shrewd politician.</p>

<h3 id="28-julius-caesar">(28) Julius Caesar</h3>

<p>A classic school reading assignment. Caesar’s death in front of the Statue
of Pompey is obviously poetic, and so every performance loves playing it
up. Antony’s speech is my favorite part of the play. I didn’t dislike this
play, but nor did I find it interesting revisiting it as an adult.</p>

<h3 id="27-coriolanus">(27) Coriolanus</h3>

<p>About the career of a legendary Roman general and war hero who attempts to
enter politics. He despises the plebeians, which gets him into trouble,
but all he really wants is to please is mother. Stratford Festival has <a href="https://www.youtube.com/watch?v=06tR1wMWV_o">a
worthy adaption in a contemporary setting</a>.</p>

<h3 id="26-henry-viii">(26) Henry VIII</h3>

<p>He reigned from 1509 to 1547, but the play only covers Henry VIII’s first
divorce. It paved the way for the English Reformation, though the play has
surprisingly little to say it, or his murder spree. It’s set a few decades
after the events of <em>Richard III</em> — too distant to truly connect with the
second Henriad.</p>

<p>While I appreciate its historical context — with liberal dramatic license
— it’s my least favorite of the English histories. It’s not part of an
epic tetralogy, and the subject matter is mundane. My favorite scene is
Katherine (Catherine in the history books) firmly rejecting the court’s
jurisdiction and walking out. My favorite line: “No man’s pie is freed
from his ambitious finger.”</p>

<h3 id="25-romeo-and-juliet">(25) Romeo and Juliet</h3>

<p>Another classic reading assignment that requires no description. A
beautiful play, but I just don’t connect with its romantic core.</p>

<h3 id="24-the-merchant-of-venice">(24) The Merchant of Venice</h3>

<p>An infamously antisemitic play where a Jewish moneylender, Shylock, loans
to the titular merchant of Venice where the collateral is the original
“pound of flesh,” providing the source for that cliche. Though even in his
prejudice, Shakespeare can’t help but write multifaceted characters,
particularly with Shylock’s famous “If you prick us, do we not bleed?”
speech.</p>

<h3 id="23-twelfth-night">(23) Twelfth Night</h3>

<p>Twins, a young man and a woman, are separated by a shipwreck. The woman
disguises herself as a man and takes employment with a local duke and
falls in love with him, but her employment requires her to carry love
letters to the duke’s love interest. In the meantime the brother arrives,
unaware his sister is in town in disguise, and everyone gets the twins
mixed up leading to comedy. It’s a fun play. The title has nothing to do
with the play, but refers to the holiday when the play was first
performed.</p>

<p>The play is the source of the famous quote, “Some are born great, some
achieve greatness, and some have greatness thrust upon them.” It’s used as
part of a joke, and when I heard it, I had thought the play was mocking
some original source.</p>

<h3 id="22-pericles">(22) Pericles</h3>

<p>A Greek play about a royal family — father, mother, daughter — separated
by unfortunate — if contrived — circumstances, each thinking the others
dead, but all tearfully reunited in a happy ending. My favorite part is
the daughter, Marina, talking her way out of trouble: “She’s able to
freeze the god Priapus and undo a whole generation.”</p>

<p>The BBC production stirred me, particularly the scene where Pericles and
Marina are reunited.</p>

<h3 id="21-richard-ii">(21) Richard II</h3>

<p>Richard II, grandson of the famed Edward III, was a young King of England
from 1367 to 1400. At least in the play, he carelessly makes dangerous
enemies of his friends, and so is deposed by Henry Bolingbroke, who goes
on to become Henry IV. The play is primarily about this abrupt transition
of power, and it is the first play of the first Henriad. The conflict in
this play creates tensions that will not be resolved until 1485, the end
of the Wars of the Roses. Shakespeare spends <em>seven</em> additional plays on
this a huge, interesting subject.</p>

<p>For me, Richard II is the most dull of the Henriad plays. It’s a slow
start, but establishes the groundwork for the greater plays that follow.
The BBC production of the first Henriad has “linked” casting where the
same actors play the same roles through the four plays, which makes this
an even more important watch.</p>

<h3 id="20-othello">(20) Othello</h3>

<p>Another of the famous tragedy. Othello, an important Venetian general, and
“the Moore of Venice” is dispatched to Venice-controlled Cyprus to defend
against an attack by the Ottoman Turks. Iago, who has been overlooked for
promotion by Othello, treacherously seeks revenge, secretly sabotaging all
involved while they call him “honest Iago.” Though his schemes quickly go
well beyond revenge, and continues sowing chaos just for his own fun.</p>

<p>I watched a few adaptions, and I most enjoyed the <a href="https://www.youtube.com/watch?v=4dcwVLGyTkk">2015 Royal Shakespeare
Company <em>Othello</em></a>, which
places it in a modern setting and requires few changes to do so.</p>

<h3 id="19-the-comedy-of-errors">(19) The Comedy of Errors</h3>

<p>A fun, short play about a highly contrived situation: Two pairs of twins,
where each pair of brothers has been given the same name, is separated at
birth. As adults they all end up in the same town, and everyone mixes them
up leading to comedy. It’s the lightest of Shakespeare’s plays, but also
lacks depth.</p>

<h3 id="18-hamlet">(18) Hamlet</h3>

<p>Another common, more senior, high school reading assignment. Shakespeare’s
longest play, and probably the most subtle. In everything spoken between
Hamlet and his murderous uncle, Claudius, one must read between the lines.
Their real meanings are obscured by courtly language — familiar to
Shakespeare’s audience, but not moderns. Asimov is great for understanding
the political maneuvering, which is a lot like a game of chess. It made me
appreciate the play more than I would have otherwise.</p>

<p>You’d be hard-pressed to find something that beats the faithful,
star-studded <a href="https://www.youtube.com/watch?v=Tt_QkXy3uuQ">1996 major film adaption</a>.</p>

<h3 id="17-richard-iii">(17) Richard III</h3>

<p>The final play of the second Henriad. Much of the play is Richard III
winking at the audience, monologuing about his villainous plans, then
executing those plans without remorse. Makes cheering for the bad guy fun.
If you want to see an evil schemer get away with it, at least right up
until the end when he gets his comeuppance, this is the play for you. This
play is the source of the famous “My kingdom for a horse.”</p>

<p>I liked two different performances for different reasons. The <a href="https://www.youtube.com/watch?v=k20svFhRI44">1995 major
film</a> puts the play in the World Word II era. It’s solid and does
well standing alone. The BBC production has linked casting with the three
parts of Henry VI, which allows one to enjoy it in full in its broader
context. It’s also well-performed, but obviously has less spectacle and a
lower budget.</p>

<h3 id="16-the-merry-wives-of-windsor">(16) The Merry Wives of Windsor</h3>

<p>The comedy spin-off of Henry IV. Allegedly, Elizabeth I liked the
character of John Falstaff from Henry IV so much — I can’t blame her! —
that she demanded another play with the character, and so Shakespeare
wrote this play. The play brings over several characters from Henry IV.
Unfortunately it’s in name only and they hardly behave like the same
characters. Despite this, it’s still fun and does not require knowledge of
Henry IV.</p>

<p>Falstaff ineptly attempts to seduce two married women, the titular wives,
who play along in order to get revenge on him. However, their husbands are
not in on the prank. One suspects infidelity and hatches his own plans.
The confusion leads to the comedy.</p>

<p>The <a href="https://www.youtube.com/watch?v=RA7j9XDu8F8">2018 Royal Shakespeare Company production</a> aptly puts it in
a modern suburban setting.</p>

<h3 id="15-titus-andronicus">(15) Titus Andronicus</h3>

<p>A play about a legendary Roman general committed to duty above all else,
even the lives of his own sons. He and his family become brutal victims of
political rivals, and in return gets his own brutal revenge. It’s by far
Shakespeare’s most violent and disturbing play. It’s a bit too violent
even for me, but it ranks this highly because Aaron the Moore is such a
fantastic character, another villain that loves winking at the audience.
His lines throughout the play make me smile: “If one good deed in all my
life I did, I do repent it from my very soul.”</p>

<p>I enjoyed the <a href="https://www.youtube.com/watch?v=OvZRvKf78yY">1999 major film</a>, which puts it in a contemporary
setting.</p>

<h3 id="14-king-lear">(14) King Lear</h3>

<p>The titular, mythological king of pre-Roman Britain wants to retire, and
so he divides his kingdom between his three daughters. However, after
petty selfishness on Lear’s part, he disowns the most deserving daughter,
while the other two scheme against one another.</p>

<p>Some of the scenes in this play are my favorite among Shakespeare, such as
Edmund’s monologue on bastards where he criticizes the status quo and
mocks the audience’s beliefs. It also has one of the best fools, who while
playing dumb, is both observant and wise. That’s most of Shakespeare’s
fools, but it’s especially true in <em>King Lear</em> (“This is not altogether
fool, my lord.”). This fool uses this “tenure” to openly mock the king to
his face, the only character that can do so without repercussions.</p>

<p>My favorite performance was <a href="https://www.youtube.com/watch?v=1PkmXMHHOxQ">the 2015 Stratford Festival stage
production</a>, especially for its Edmund, Lear, and Fool.</p>

<h3 id="13-macbeth">(13) Macbeth</h3>

<p>The shortest tragedy, a common reading assignment, and a perfect example
of literature I could not appreciate without more maturity. Even the plays
I dislike have beautiful poetry, but I especially love it in <em>Macbeth</em>.</p>

<p>The history behind <em>Macbeth</em> is itself fascinating. The play was written
custom for the newly-crowned King James I — of <em>King James Version</em> fame —
and even calls him out in the audience. James I was obsessed with witch
hunts, so the play includes witchcraft. The character Banquo was by
tradition considered to be his ancestor.</p>

<p>My favorite production by far — I watched a number of them! — was <a href="https://www.youtube.com/watch?v=HM3hsVrBMA4">the
2021 film</a>. It should be an approachable introduction for Shakespeare
newcomers more interested in drama than comedy. Notably for me, it departs
from typical productions in that Macbeth and Lady Macbeth do not scream at
each other — perhaps normally a side effect of speaking loudly for stage
performance. Particularly in Act 1, Scene 7 (“screw your courage to the
sticking place”). In the film they argue calmly, like a couple in a
genuine, healthy relationship, making the tragedy that much more tragic.</p>

<p>That being said, it drops the ball with the porter scene — a bit of comic
relief just after Macbeth murders Duncan. There’s knocking at the gate,
and the porter, charged with attending it, is hungover and takes his time.
In a monologue he imagines himself porter to Hell, and on each impatient
knock considers the different souls he would be greeting. Of all the
porter scenes I watched, the best porter as the <a href="https://www.youtube.com/watch?v=oGZV-KwW4ZE">2017 Stratford Festival
production</a>, where he is both charismatic and hilarious. I wish I
could share a clip.</p>

<h3 id="12-king-john">(12) King John</h3>

<p>King John, brother of “<em>Coeur de Lion</em>” Richard I, ruled in early 13th
century. His reign led to the Magna Carta, and he’s also the Prince John
of the Robin Hood legend, though because it’s a history, and paints John
in a positive light, that legend isn’t included. It depicts fascinating,
real historical events and people, including <a href="https://en.wikipedia.org/wiki/Eleanor_of_Aquitaine">Eleanor of Aquitaine</a>.
It also has one of my favorite Shakespeare characters, Phillip the
Bastard, who gets all the coolest lines. I especially love his
introductory scene where his lineage is disputed by his half-bother and
Eleanor, impressed, essentially adopts him on the spot.</p>

<p><a href="https://www.youtube.com/watch?v=YkRBRoh_0QQ">The 2015 Stratford Festival stage performance</a> is wonderful, and
I’ve re-watched it a few times. The performances are all great.</p>

<h3 id="119-henry-vi">(11–9) Henry VI</h3>

<p>As previously noted, this is actually three plays. At 3–4 hours apiece,
it’s about the length of a modern television season. I thought it might
take awhile to consume, but I was completely sucked in, watching and
studying the whole trilogy in a single weekend.</p>

<p>Henry V died young in 1422, and his infant son became Henry VI, leaving
England ruled by his uncles. As an adult he was a weak king, which allowed
the conflicts of the previously-mentioned <em>Richard II</em> to bubble up into
the Wars of the Roses, a bloody power conflict between the Lancasters and
Yorks. The play features historical people including Joan la Pucelle
(“Joan of Arc”), English war hero John Talbot, and <a href="https://en.wikipedia.org/wiki/Jack_Cade%27s_Rebellion">Jack Cade</a>.
<em>Richard III</em> wraps up the conflicts of <em>Henry VI</em>, forming the second
Henriad. When watching/reading the play, keep in mind that the play is
anti-French, anti-York, and (implicitly) pro-Tudor.</p>

<p>Most of the first part was probably not written by Shakespeare, but rather
adapted from an existing play to fill out the backstory. I think I can see
the “seams” between the original and the edits that introduce the roses.</p>

<p>I <em>loved</em> the BBC production of the second Henriad. Producing such an epic
story must be daunting, and it’s amazing what they could convey with such
limited budget and means. It has hilarious and clever cinematography for
the scene where the Countess of Auvergne attempts to trap Talbot (Part 1,
Act 2, Scene 3). Again, I wish I could share a clip!</p>

<h3 id="8-henry-v">(8) Henry V</h3>

<p>Due to his amazing victories, most notably <a href="https://en.wikipedia.org/wiki/Battle_of_Agincourt">at Agincourt</a> where, for
once, Shakespeare isn’t exaggerating the odds, Henry V is one of the great
kings of English history. This play is a followup to <em>Richard II</em> and
<em>Henry IV</em>, completing the first Henriad, and depicts Henry V’s war with
France. Outside of the classroom, this is one of Shakespeare’s most
popular plays.</p>

<p>The obvious choice for viewing is <a href="https://www.youtube.com/watch?v=okxEzUlnn_0">the 1989 major film</a>, which, by
borrowing a few scenes from <em>Henry IV</em>, attempts a standalone experience,
though with limited success. I watched it before <em>Henry IV</em>, and I could
not understand why the film was so sentimental about a character that
hadn’t even appeared yet. It probably has <a href="https://www.youtube.com/watch?v=A-yZNMWFqvM">the best Saint Crispin’s Day
Speech ever performed</a>, in part because it’s placed in a broader
context than originally intended. The <a href="https://www.youtube.com/watch?v=HS7OG9zcV-M">introduction is bold</a> as is
<a href="https://www.youtube.com/watch?v=mKHihAPr2Rc">Exeter’s ultimatum delivery</a>. It cleverly, and without changing his
lines, also depicts Montjoy, the French messenger, as sympathetic to the
English, also not originally intended. I didn’t realize this until I
watched other productions.</p>

<p>The BBC production is also worthy, in large part because of its linked
casting with <em>Richard II</em> and <em>Henry IV</em>. It’s also unabridged, including
the whole glove thing, for better or worse.</p>

<h3 id="76-henry-iv">(7–6) Henry IV</h3>

<p>People will think I’m crazy, but yes, I’m placing <em>Henry IV</em> above <em>Henry
V</em>. My reason is just two words: John Falstaff. This character is one of
Shakespeare’s greatest creations, and really makes these plays for me. As
previously noted, this is two plays mainly because John Falstaff was such
a huge hit. The sequel mostly retreads the same ground, but that’s fine!
I’ve read and re-read all the Falstaff scenes because they’re so fun. I
now have a habit of quoting Falstaff, and it drives my wife nuts.</p>

<p>The Falstaff role makes or breaks a <em>Henry IV</em> production, and my love for
this play is in large part thanks to the phenomenal BBC production. It has
a warm, charismatic Falstaff that <a href="https://www.youtube.com/watch?v=ImVoqdZPPak">perfectly nails the role</a>. It’s
great even beyond Falstaff, of course. At the end of part 2, I tear up
seeing Henry V test the chief justice. I adore this production. What a
masterpiece.</p>

<h3 id="5-a-midsummer-nights-dream">(5) A Midsummer Night’s Dream</h3>

<p>A popular, fun, frivolous play that I enjoyed even more than I expected,
where faeries interfere with Athenians who wander into their forest. The
“rude mechanicals” are charming, especially the naive earnestness of Nick
Bottom, making them my favorite part of the play.</p>

<p>My enjoyment is largely thanks to <a href="https://www.youtube.com/watch?v=v9GhqXz7EVw">a 2014 stage production</a> with
great performances all around, great cinematography, and incredible
effects. Highly recommended. Honorable mention goes to the great Nick
Bottom performances of the BBC production and the 1999 major film.</p>

<h3 id="4-as-you-like-it">(4) As You Like It</h3>

<p>A pastoral comedy about idyllic rural life, and the source of the famous
quote “All the world’s a stage.” A duke has deposed his duke brother,
exiling him and his followers to the forest where the rest of the play
takes place. The main character, Rosalind, is one of the exiles, and,
disguised as a man named Ganymede, flees into the forest with her cousin.
There she runs into her also-exiled love interest, Orlando. While still
disguised as Ganymede, she roleplays as Rosalind — that is, <em>herself</em> — to
help him practice wooing herself. Crazy and fun.</p>

<p>A couple of my favorite lines are “There’s no clock in the forest” and
“falser than vows made in wine.” It’s an unusually musical play, and has a
big, happy ending. The fool, Touchstone, is one of my favorite fools,
named such because he tests the character of everyone with whom he comes
in contact.</p>

<p>It ranks so highly because of <a href="https://www.pbs.org/video/as-you-like-it-8yykc1/">an endearing 2019 production by Kentucky
Shakespeare</a>, which sets the story in a 19th century Kentucky. This is
the most amateur production I’ve shared so far — literally Shakespeare in
the park — but it’s just so enjoyable. Their Rosalind is fantastic and
really makes the play work. I’ve listened to just the audio of the play,
like a podcast, many times now.</p>

<h3 id="3-measure-for-measure">(3) Measure for Measure</h3>

<p>A comedy about justice and mercy. The duke of Vienna announces he will be
away on a trip to Poland, but secretly poses as a monk in order to get his
thumb on the pulse of his city. Unfortunately the man running the city in
his stead is corrupt, and the softhearted duke can’t help but pull strings
behind the scenes to undo the damage, and more. He sets up a scheme such
that, after his dramatic return as duke, the plot is unraveled while
simultaneously testing the character of all involved.</p>

<p>I love so many of the characters and elements of this play. I smile when
the duke jumps into action, my heart wrenches at <a href="https://www.youtube.com/watch?v=paAYJUx9MfQ">Isabella’s impassioned
speech for mercy</a> (“it is excellent to have a giant’s strength,
but it is tyrannous to use it like a giant”), I admire the provost’s
selfless loyalty to the duke, I laugh when Lucio the “fantastic” keeps
putting his foot in his mouth, and I cry when Mariana begs Isabella to
forgive. All around a wonderful play.</p>

<p>Like so many already, a big part of my love for the play is <a href="https://www.crackle.com/watch/f70e0859-c7fa-4dae-961f-130bed2980eb/bbc-television-shakespeare:-measure-for-measure">the BBC
production</a>, which is full of great performances, particularly
the duke, Isabella, and Lucio.</p>

<h3 id="2-much-ado-about-nothing">(2) Much Ado About Nothing</h3>

<p>As the play that finally got me interested in Shakespeare, of course it’s
near the top of the list. Forget Romeo and Juliet: Benedick and Beatrice
are Shakespeare’s greatest romantic pairing!</p>

<p>Don Pedro, Prince of Aragon, stops in Messina with his soldiers while
returning from a military action. While in town there’s a matchmaking plot
and lots of eavesdropping, and then chaos created by the wicked Don John,
brother to Don Pedro. It’s a fun, light, hilarious play. It also features
another of Shakespeare’s great comic characters, Dogberry, famous for his
malapropisms.</p>

<p>This is a very popular play with tons of productions, though I only
watched a few of them. The previously-mentioned 1993 adaption remains my
favorite. It does some abridging, but honestly, it makes the play better
and improves the comedic beats.</p>

<h3 id="1-loves-labours-lost">(1) Love’s Labour’s Lost</h3>

<p>Finally, my favorite play of all, and an unusual one to be at the top of
the list. Much of the play is subtle parody and so makes for a poor first
play for newcomers, who would not be familiar enough with Shakespeare’s
language to distinguish parody from genuine.</p>

<p>The King of Navarre and three lords swear an oath to seclude themselves
in study, swearing off the company of women. Then the French princess and
her court arrives, the four men secretly write love letters in violation
of their oaths, and comedy ensues. There are also various eccentric side
characters mixed into the plot to spice it up. It’s all a ton of fun and
ends with an inept play within a play about the “nine worthies.”</p>

<p>The major reason I love this play so much is <a href="https://www.youtube.com/watch?v=VAotbh5CVqM">a <em>literally perfect</em> 2017
production by Stratford Festival</a>. I love every aspect of this
production such that I can’t even pick a favorite element. I was hooked
within the first minute.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>Hand-written Windows API prototypes: fast, flexible, and tedious</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/05/31/"/>
    <id>urn:uuid:35b44114-7ad2-422b-9eaf-dc37e7eaaf97</id>
    <updated>2023-05-31T01:38:31Z</updated>
    <category term="win32"/><category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>I love fast builds, and for years I’ve been bothered by the build penalty
for translation units including <code class="language-plaintext highlighter-rouge">windows.h</code>. This header has an enormous
number of definitions and declarations and so, for C programs, it tends to
dominate the build time of those translation units. Most programs,
especially systems software, only needs a tiny portion of it. For example,
when compiling <a href="/blog/2023/01/18/">u-config</a> with GCC, two thirds of the debug build was
spent processing <code class="language-plaintext highlighter-rouge">windows.h</code> just for <a href="https://github.com/skeeto/u-config/blob/e6ebb9b/miniwin32.h">4 types, 16 definitions, and 16
prototypes</a>.</p>

<p>To give a sense of the numbers, here’s <code class="language-plaintext highlighter-rouge">empty.c</code>, which does nothing but
include <code class="language-plaintext highlighter-rouge">windows.h</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>With the current Mingw-w64 headers, that’s ~82kLOC (non-blank):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -E empty.c | grep -vc '^$'
82041
</code></pre></div></div>

<p>With <a href="https://github.com/skeeto/w64devkit">w64devkit</a> this takes my system ~450ms to compile with GCC:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time gcc -c empty.c
real    0m 0.45s
user    0m 0.00s
sys     0m 0.00s
</code></pre></div></div>

<p>Compiling an actually empty source file takes ~10ms, so it really is
spending practically all that time processing headers. MSVC is a faster
compiler, and this extends to processing an even larger <code class="language-plaintext highlighter-rouge">windows.h</code> that
crosses over 100kLOC (VS2022). It clocks in at 120ms on the same system:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo /E empty.c | grep -vc '^$'
empty.c
100944
$ time cl /nologo /c empty.c
empty.c
real    0m 0.12s
user    0m 0.09s
sys     0m 0.01s
</code></pre></div></div>

<p>That’s just low enough to be tolerable, but I’d like the situation with
GCC to be better. Defining <code class="language-plaintext highlighter-rouge">WIN32_LEAN_AND_MEAN</code> reduces the number of
included headers, which has a significant effect:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -E -DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
55025
$ time gcc -c -DWIN32_LEAN_AND_MEAN empty.c
real    0m 0.30s
user    0m 0.00s
sys     0m 0.00s

$ cl /nologo /E /DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
empty.c
41436
$ time cl /nologo /c /DWIN32_LEAN_AND_MEAN empty.c
empty.c
real    0m 0.07s
user    0m 0.01s
sys     0m 0.01s
</code></pre></div></div>

<h3 id="precompiled-headers">Precompiled headers</h3>

<p>The official solution is precompiled headers. Put all the system header
includes, <a href="/blog/2023/01/08/">or similar</a>, into a dedicated header, then compile that
header into a special format. For example, <code class="language-plaintext highlighter-rouge">headers.h</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define WIN32_LEAN_AND_MEAN
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>Then <code class="language-plaintext highlighter-rouge">main.c</code> includes <code class="language-plaintext highlighter-rouge">windows.h</code> through this header:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"headers.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I ask <a href="https://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html">GCC to compile <code class="language-plaintext highlighter-rouge">headers.h</code></a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc headers.h
</code></pre></div></div>

<p>It produces <code class="language-plaintext highlighter-rouge">headers.h.gch</code>. When a source includes <code class="language-plaintext highlighter-rouge">headers.h</code>, GCC first
searches for an appropriate <code class="language-plaintext highlighter-rouge">.gch</code>. Not only must the name match, but so
must all the definitions at the moment of inclusion: <code class="language-plaintext highlighter-rouge">headers.h</code> should
always be the first included header, otherwise it may not work. Now when I
compile <code class="language-plaintext highlighter-rouge">main.c</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time gcc -c main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s
</code></pre></div></div>

<p>Much better! MSVC has a conventional name for this header recognizable to
every Visual Studio user: <code class="language-plaintext highlighter-rouge">stdafx.h</code>. It works a bit differently, and I’ve
never used it myself, but I trust it has similar results.</p>

<p>Precompiled headers requires some extra steps that vary by toolchain. Can
we do better? That depends on your definition of “better!”</p>

<h3 id="artisan-handcrafted-prototypes">Artisan, handcrafted prototypes</h3>

<p>As mentioned, systems software tends to need only a few declarations:
open, read, write, stat, etc. What if I wrote these out manually? A bit
tedious, but it doesn’t require special precompiled header handling. It
also creates some new possibilities. To illustrate, a <a href="/blog/2023/02/15/">CRT-free</a>
“hello world” program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="kt">char</span> <span class="n">message</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="n">DWORD</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This takes my system half a second to compile — quite long to produce just
26 assembly instructions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time cc -nostartfiles -o hello.exe hello.c
real    0m 0.50s
user    0m 0.00s
sys     0m 0.00s
$ ./hello.exe
Hello, world!
</code></pre></div></div>

<p>The program requires prototypes only for GetStdHandle and WriteFile, a
definition for <code class="language-plaintext highlighter-rouge">STD_OUTPUT_HANDLE</code>, and some typedefs. Starting with the
easy stuff, the definition and <a href="https://learn.microsoft.com/en-us/windows/win32/winprog/windows-data-types">types look like this</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define STD_OUTPUT_HANDLE ((DWORD)-11)
</span>
<span class="k">typedef</span> <span class="kt">int</span> <span class="n">BOOL</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="n">HANDLE</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">DWORD</span><span class="p">;</span>
</code></pre></div></div>

<p>By the way, here’s a cheat code for quickly finding preprocessor
definitions, faster than looking them up elsewhere:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo '#include &lt;windows.h&gt;' | gcc -E -dM - | grep 'STD_\w*_HANDLE'
#define STD_INPUT_HANDLE ((DWORD)-10)
#define STD_ERROR_HANDLE ((DWORD)-12)
#define STD_OUTPUT_HANDLE ((DWORD)-11)
</code></pre></div></div>

<p>Did you catch the pattern? It’s <code class="language-plaintext highlighter-rouge">-10 - fd</code>, where <code class="language-plaintext highlighter-rouge">fd</code> is the conventional
unix file descriptor number: a kind of mnemonic.</p>

<p>Prototypes are a little trickier, especially if you care about 32-bit. The
Windows API uses the “stdcall” calling convention, which is distinct from
the “cdecl” calling convention on x86, though the same on x64. Of course,
you must already be aware of this merely using the API, as your own
callbacks must usually be stdcall themselves. Further, API functions are
<a href="/blog/2021/05/31/">DLL imports</a> and should be declared as such. Putting it together,
here’s GetStdHandle:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="n">HANDLE</span> <span class="kr">__stdcall</span> <span class="nf">GetStdHandle</span><span class="p">(</span><span class="n">DWORD</span><span class="p">);</span>
</code></pre></div></div>

<p>This works with both Mingw-w64 and MSVC. MSVC requires <code class="language-plaintext highlighter-rouge">__stdcall</code> between
the return type and function name, so don’t get clever about it. If you
only care about GCC then you can declare both using attributes, which I
think is a bit nicer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span> <span class="nf">GetStdHandle</span><span class="p">(</span><span class="n">DWORD</span><span class="p">)</span>
    <span class="n">__attribute__</span><span class="p">((</span><span class="n">dllimport</span><span class="p">,</span><span class="n">stdcall</span><span class="p">));</span>
</code></pre></div></div>

<p>The prototype for <a href="https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile">WriteFile</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="n">BOOL</span> <span class="kr">__stdcall</span> <span class="nf">WriteFile</span><span class="p">(</span><span class="n">HANDLE</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="n">DWORD</span><span class="p">,</span> <span class="n">DWORD</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>You may have noticed I’m taking some shortcuts. The “official” definition
uses an ugly pointer typedef, <code class="language-plaintext highlighter-rouge">LPCVOID</code>, instead of pointer syntax, but I
skipped that type definition. I also replaced the last argument, an
<code class="language-plaintext highlighter-rouge">OVERLAPPED</code> pointer, with a generic pointer. I only need to pass null. I
can keep sanding it down to something more ergonomic:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">int</span> <span class="kr">__stdcall</span> <span class="nf">WriteFile</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>That’s how I typically write these prototypes. I dropped the <code class="language-plaintext highlighter-rouge">const</code>
because it doesn’t help me. I used signed sizes because I like them better
and it’s <a href="/blog/2023/02/13/">what I’m usually holding</a> at the call site. But doesn’t
changing the signedness potentially break compatibility? It makes no
difference to any practical ABI: It’s passed the same way. In general,
signedness is a matter for <em>operators</em>, and only some of them — mainly
comparisons (<code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>, etc.) and division. It’s a similar story for
pointers starting with the 32-bit era, so I can choose whatever pointer
types are convenient.</p>

<p>In general, I can do anything I want so long as I know my compiler will
produce an appropriate function call. These are not standard functions,
like <code class="language-plaintext highlighter-rouge">printf</code> or <code class="language-plaintext highlighter-rouge">memcpy</code>, which are implemented in part by the compiler
itself, but foreign functions. It’s no different than teaching <a href="/blog/2018/05/27/">an
FFI</a> how to make a call. This is also, in essence, how OpenGL and
Vulkan work, with applications <a href="https://www.khronos.org/opengl/wiki/OpenGL_Loading_Library">defining the API for themselves</a>.</p>

<p>Considering all this, my new hello world:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">int</span> <span class="kr">__stdcall</span> <span class="nf">WriteFile</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">void</span> <span class="o">*</span><span class="kr">__stdcall</span> <span class="nf">GetStdHandle</span><span class="p">(</span><span class="kt">int</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">char</span> <span class="n">message</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You know, there’s a kind of beauty to a program that requires no external
definitions. It builds quickly and produces a binary bit-for-bit identical
to the original:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time cc -nostartfiles -o hello.exe main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

$ time cl /nologo hello.c /link /subsystem:console kernel32.lib
hello.c
real    0m 0.03s
user    0m 0.00s
sys     0m 0.00s
</code></pre></div></div>

<p>I’ve also been using this to patch over API rough edges. For example,
<a href="https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsarecvfrom">WSARecvFrom</a> takes <a href="https://learn.microsoft.com/en-us/windows/win32/api/winsock2/ns-winsock2-wsaoverlapped">WSAOVERLAPPED</a>, but <a href="https://learn.microsoft.com/en-us/windows/win32/api/ioapiset/nf-ioapiset-getqueuedcompletionstatus">GetQueuedCompletionStatus</a>
takes <a href="https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped">OVERLAPPED</a>. These types are explicitly compatible, and only
defined separately for annoying technical reasons. I must use the same
overlapped object with both APIs at once, meaning I would normally need
ugly pointer casts on my Winsock calls, or vice versa with I/O completion
ports. But because I’m writing all these definitions myself, I can define
a common overlapped structure for both!</p>

<p>Perhaps you’re worried that this would be too fragile. Well, as a legacy
software aficionado, I enjoy <a href="/blog/2018/04/13/">building and running my programs on old
platforms</a>. So far these programs still work properly <a href="https://winworldpc.com/library/">going back
30 years</a> to Windows NT 3.5 and Visual C++ 4.2. When I do hit a snag,
it’s always been a bug (now long fixed) in the old operating system, not
in my programs or these prototypes. So, in effect, this technique has
worked well for the past 30 years!</p>

<p>Writing out these definitions is a bit of a chore, but after paying that
price I’ve been quite happy with the results. I will likely continue doing
it in the future, at least for non-graphical applications.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>My favorite C compiler flags during development</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/04/29/"/>
    <id>urn:uuid:a90f3f5b-b4c3-4153-ac8e-6cdbf235f44b</id>
    <updated>2023-04-29T22:55:25Z</updated>
    <category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=35758898">on Hacker News</a> and <a href="https://old.reddit.com/r/C_Programming/comments/133bjlp">on reddit</a>.</em></p>

<p>The major compilers have an <a href="https://man7.org/linux/man-pages/man1/gcc.1.html">enormous number of knobs</a>. Most are
highly specialized, but others are generally useful even if uncommon. For
warnings, the venerable <code class="language-plaintext highlighter-rouge">-﻿Wall -﻿Wextra</code> is a good start, but
circumstances improve by tweaking this warning set. This article covers
high-hitting development-time options in GCC, Clang, and MSVC that ought
to get more consideration.</p>

<!--more-->

<p>There’s an irony that the more you use these options, the less useful they
become. Given a reasonable workflow, they are a harsh mistress in a fast,
tight feedback loop quickly breaking the habits that cause warnings and
errors. It’s a kind of self-improvement, where eventually most findings
will be false positives. With heuristics internalized, you will be able
spot the same issues just reading code — a handy skill during code review.</p>

<h3 id="static-warnings">Static warnings</h3>

<p>Traditionally, C and C++ compilers are by default conservative with
warnings. Unless configured otherwise, they only warn about the most
egregious issues where it’s highly confident. That’s too conservative. For
<code class="language-plaintext highlighter-rouge">gcc</code> and <code class="language-plaintext highlighter-rouge">clang</code>, the first order of business is turning on more warnings
with <strong><code class="language-plaintext highlighter-rouge">-﻿Wall</code></strong>. Despite the name, this doesn’t actually enable all
warnings. (<code class="language-plaintext highlighter-rouge">clang</code> has <code class="language-plaintext highlighter-rouge">-﻿Weverything</code> which does literally this, but
trust me, you don’t want it.) However, that still falls short, and you’re
better served enabling <em>extra</em> warnings on with <strong><code class="language-plaintext highlighter-rouge">-﻿Wextra</code></strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -Wall -Wextra ...
</code></pre></div></div>

<p>That should be the baseline on any new project, and closer to what these
compilers should do by default. Not using these means leaving value on the
table. If you come across such a project, there’s a good chance you can
find bugs statically just by using this baseline. Some warnings only occur
at higher optimization levels, so leave these on for your release builds,
too.</p>

<p>For MSVC, including <code class="language-plaintext highlighter-rouge">clang-cl</code>, a similar baseline is <strong><code class="language-plaintext highlighter-rouge">/W4</code></strong>. Though it
goes a bit far, warning about use of unary minus on unsigned types
(C4146), and sign conversions (C4245). If you’re <a href="/blog/2023/02/15/">using a CRT</a>, also
disable the bogus and irresponsible “security” warnings. Putting it
together, the warning baseline becomes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /W4 /wd4146 /wd4245 /D_CRT_SECURE_NO_WARNINGS ...
</code></pre></div></div>

<p>As for <code class="language-plaintext highlighter-rouge">gcc</code> and <code class="language-plaintext highlighter-rouge">clang</code>, I dislike unused parameter warnings, so I often
turn it off, at least while I’m working: <strong><code class="language-plaintext highlighter-rouge">-﻿Wno-unused-parameter</code></strong>.
Rarely is it a defect to not use a parameter. It’s common for a function
to fit a fixed prototype but not need all its parameters (e.g. <code class="language-plaintext highlighter-rouge">WinMain</code>).
Were it up to me, this would not be part of <code class="language-plaintext highlighter-rouge">-﻿Wextra</code>.</p>

<p>I also dislike unused functions warnings: <strong><code class="language-plaintext highlighter-rouge">-﻿Wno-unused-function</code></strong>.
I can’t say this is wrong for the baseline since, in most cases, ultimately
I do want to know if there are unused functions, e.g. to be deleted. But
while I’m working it’s usually noise.</p>

<p>If I’m <a href="/blog/2017/03/01/">working with OpenMP</a>, I may also disable warnings about
unknown pragmas: <strong><code class="language-plaintext highlighter-rouge">-﻿Wno-unknown-pragmas</code></strong>. One cool feature of
OpenMP is that the typical case gracefully degrades to single-threaded
behavior when not enabled. That is, compiling without <code class="language-plaintext highlighter-rouge">-﻿fopenmp</code>.
I’ll test both ways to ensure I get deterministic results, or just to ease
debugging, and I don’t want warnings when it’s disabled. It’s fine for the
baseline to have this warning, but sometimes it’s a poor match.</p>

<p>When working with single-precision floats, perhaps on games or graphics,
it’s easy to accidentally introduce promotion to double precision, which
can hurt performance. It could be neglecting an <code class="language-plaintext highlighter-rouge">f</code> suffix on a constant
or using <code class="language-plaintext highlighter-rouge">sin</code> instead of <code class="language-plaintext highlighter-rouge">sinf</code>. Use <strong><code class="language-plaintext highlighter-rouge">-﻿Wdouble-promotion</code></strong> to
catch such mistakes. Honestly, this is important enough that it should go
into the baseline.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PI 3.141592653589793
</span><span class="kt">float</span> <span class="n">degs</span> <span class="o">=</span> <span class="p">...;</span>
<span class="kt">float</span> <span class="n">rads</span> <span class="o">=</span> <span class="n">degs</span> <span class="o">*</span> <span class="n">PI</span> <span class="o">/</span> <span class="mi">180</span><span class="p">;</span>  <span class="c1">// warns about promotion</span>
</code></pre></div></div>

<p>It can be awkward around variadic functions, particularly <code class="language-plaintext highlighter-rouge">printf</code>, which
cannot receive <code class="language-plaintext highlighter-rouge">float</code> arguments, and so implicitly converts. You’ll need
a explicit cast to disable the warning. I imagine this is the main reason
the warning is not part of <code class="language-plaintext highlighter-rouge">-﻿Wextra</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="n">x</span> <span class="o">=</span> <span class="p">...;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%.17g</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">x</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally, an advanced option: <strong><code class="language-plaintext highlighter-rouge">-﻿Wconversion -Wno-sign-conversion</code></strong>.
It warns about implicit conversions that may result in data loss. Sign
conversions do not have data loss, the implicit conversions are useful,
and in my experience they’re not a source of defects, so I disable that
part using the second flag (like MSVC <code class="language-plaintext highlighter-rouge">/wd4245</code>). The important warning
here is truncation of size values, warning about unsound uses of sizes and
subscripts. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// NOTE: would be declared/defined via windows.h</span>
<span class="k">typedef</span> <span class="kt">uint32_t</span> <span class="n">DWORD</span><span class="p">;</span>
<span class="n">BOOL</span> <span class="nf">WriteFile</span><span class="p">(</span><span class="n">HANDLE</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="n">DWORD</span><span class="p">,</span> <span class="n">DWORD</span> <span class="o">*</span><span class="p">,</span> <span class="n">OVERLAPPED</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">void</span> <span class="nf">logmsg</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">msg</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">err</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_ERROR_HANDLE</span><span class="p">);</span>
    <span class="n">DWORD</span> <span class="n">out</span><span class="p">;</span>
    <span class="n">WriteFile</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">out</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// len truncation warning</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On 64-bit targets, it will warn about truncating the 64-bit <code class="language-plaintext highlighter-rouge">len</code> for the
32-bit parameter. To dismiss the warning, you must either address it by
using a loop to <a href="/blog/2023/02/13/">call <code class="language-plaintext highlighter-rouge">WriteFile</code> multiple times</a>, or acknowledge the
truncation with an explicit cast and accept the consequences. In this case
I may know from context it’s impossible for the program to even construct
such a large message, so I’d use an assertion and truncate.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">logmsg</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">msg</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">err</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_ERROR_HANDLE</span><span class="p">);</span>
    <span class="n">DWORD</span> <span class="n">out</span><span class="p">;</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">len</span> <span class="o">&lt;=</span> <span class="mh">0xffffffff</span><span class="p">);</span>
    <span class="n">WriteFile</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="p">(</span><span class="n">DWORD</span><span class="p">)</span><span class="n">len</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">out</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You might consider changing the interface instead:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">logmsg</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">msg</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">len</span><span class="p">);</span>
</code></pre></div></div>

<p>That probably passes the buck and doesn’t solve the underlying problem.
The caller may be holding a <code class="language-plaintext highlighter-rouge">size_t</code> length, so the truncation happens
there instead. Or maybe you keep propagating this change backwards until
it, say, dissipates on a known constant. <code class="language-plaintext highlighter-rouge">-﻿Wconversion</code> leads to
these ripple effects that improves the overall program, which is why I
like it.</p>

<p>The catch is that the above warning only happens for 64-bit targets. So
you might miss it. The inverse is true in other cases. This is one area
where <a href="/blog/2021/08/21/">cross-architecture testing</a> can pay off.</p>

<p>Unfortunately since this warning is off the beaten path, it seems like it
doesn’t quite get the attention it could use. It warns about simple cases
where truncation has been explicitly handled/avoided. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="p">...;</span>
<span class="kt">char</span> <span class="n">digit</span> <span class="o">=</span> <span class="sc">'0'</span> <span class="o">+</span> <span class="n">x</span><span class="o">%</span><span class="mi">10</span><span class="p">;</span>  <span class="c1">// false warning</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">'0'</code> is a known constant. The operation <code class="language-plaintext highlighter-rouge">x%10</code> has a known range (-9
to 9). Therefore the addition result has a known range, and all results
can be represented in a <code class="language-plaintext highlighter-rouge">char</code>. Yet it still warns. This often comes up
dealing with character data like this.</p>

<p>In my <code class="language-plaintext highlighter-rouge">logmsg</code> fix I had used an assertion to check that no truncation
actually occurred. But wouldn’t it be nice if the compiler could generate
that for us somehow? That brings us to dynamic checks.</p>

<h3 id="dynamic-run-time-checks">Dynamic run-time checks</h3>

<p>Sanitizers have been around for nearly a decade but are still criminally
underused. They insert run-time assertions into programs at the flip of a
switch typically at a modest performance cost — less than the cost of a
debug build. All three major compilers support at least one sanitizer on
all targets. In most cases, failing to use them is practically the same as
not even trying to find defects. Every beginner tutorial ought to be using
sanitizers <em>from page 1</em> where they teach how to compile a program with
<code class="language-plaintext highlighter-rouge">gcc</code>. (That this is universally <em>not</em> the case, and that these same
tutorials also do not begin with teaching a debugger, is a major, on-going
education failure.)</p>

<p>There are multiple different sanitizers with lots of overlap, but Address
Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan) are the most
general. They are compatible with each other and form a solid, general
baseline. To use address sanitizer, at both compile and link time do:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -fsanitize=address ...
</code></pre></div></div>

<p>It’s even spelled the same way in MSVC. It’s needed at link time because
it includes a runtime component. When working properly it’s aware of all
allocations and checks all memory accesses that might be out of bounds,
producing a run-time error if that occurs. It’s not always appropriate,
but most projects that can use it probably should.</p>

<p>UBSan is enabled similarly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -fsanitize=undefined ...
</code></pre></div></div>

<p>It adds checks around operations that might be undefined, emitting a
run-time error if it occurs. It has an optional runtime component to
produce a helpful diagnostic. You can instead insert a trap instruction,
which is how I prefer to use it: <strong><code class="language-plaintext highlighter-rouge">-﻿fsanitize-trap=undefined</code></strong>.
(Until recently it was <strong><code class="language-plaintext highlighter-rouge">-﻿fsanitize-undefined-trap-on-error</code></strong>.)
This works on platforms where the UBSan runtime is unsupported. Some
instrumentation is only inserted at higher optimization levels.</p>

<p>For me, the most useful UBSan check is signed overflow — e.g. computing
the wrong result — and it’s instrumentation I miss when not working in C.
In programs where this might be an issue, combine it <a href="/blog/2019/01/25/">with a fuzzer</a>
to search for inputs that cause overflows. This is yet another argument in
favor of <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf">signed sizes</a>, as UBSan can detect such overflows. (Yes,
UBSan optionally instruments unsigned overflow, too, but then you must
somehow distinguish <a href="/blog/2019/11/19/">intentional</a> from <a href="/blog/2017/07/19/">unintentional</a>
overflow.)</p>

<p>On Linux, ASan and UBSan strangely do not have <a href="/blog/2022/06/26/">debugger-oriented
defaults</a>. Fortunately that’s easy to address with a couple of
environment variables, which cause them to break on error instead of
uselessly exiting:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">ASAN_OPTIONS</span><span class="o">=</span><span class="nv">abort_on_error</span><span class="o">=</span>1:halt_on_error<span class="o">=</span>1
<span class="nb">export </span><span class="nv">UBSAN_OPTIONS</span><span class="o">=</span><span class="nv">abort_on_error</span><span class="o">=</span>1:halt_on_error<span class="o">=</span>1
</code></pre></div></div>

<p>Also, when compiling you can combine sanitizers like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -fsanitize=address,undefined ...
</code></pre></div></div>

<p>As of this writing, MSVC does not have UBSan, but it does have a similar
feature, <a href="https://learn.microsoft.com/en-us/cpp/build/reference/rtc-run-time-error-checks">run-time error checks</a>. Three sub-flags (<code class="language-plaintext highlighter-rouge">c</code>, <code class="language-plaintext highlighter-rouge">s</code>, <code class="language-plaintext highlighter-rouge">u</code>)
enable different checks, and <strong><code class="language-plaintext highlighter-rouge">/RTCcsu</code></strong> turns them all on. The <code class="language-plaintext highlighter-rouge">c</code> flag
generates the assertion I had manually written with <code class="language-plaintext highlighter-rouge">-﻿Wconversion</code>,
and traps any truncation at run time. There’s nothing quite like this in
UBSan! It’s so extreme that it’s compatible with neither standard runtime
libraries (fortunately <a href="/blog/2023/02/11/">not a big deal</a>) nor with ASan.</p>

<p>Caveat: Explicit casts aren’t enough, you must actually truncate variables
using a mask in order to pass the check. For example, to accept truncation
in the <code class="language-plaintext highlighter-rouge">logmsg</code> function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">WriteFile</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="n">len</span><span class="o">&amp;</span><span class="mh">0xffffffff</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">out</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Thread Sanitizer (TSan) is occasionally useful for finding — or, more
often, <em>proving</em> the presence of — data races. It has a runtime component
and so must be used at compile time and link time.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -fsanitize=thread ...
</code></pre></div></div>

<p>Unfortunately it only works in a narrow context. The target must use
pthreads, not C11 threads, OpenMP, nor <a href="/blog/2023/03/23/">direct cloning</a>. It must
only synchronize through code that was compiled with TSan. That means no
synchronization <a href="/blog/2022/10/03/">through system calls</a>, especially no futexes. Most
non-trivial programs do not meet the criteria.</p>

<h3 id="debug-information">Debug information</h3>

<p>Another common mistake in tutorials is using plain old <code class="language-plaintext highlighter-rouge">-﻿g</code> instead
of <strong><code class="language-plaintext highlighter-rouge">-﻿g3</code></strong> (read: “debug level 3”). That’s like using <code class="language-plaintext highlighter-rouge">-﻿O</code>
instead of <code class="language-plaintext highlighter-rouge">-﻿O3</code>. It adds a lot more debug information to the
output, particularly enums and macros. The extra information is useful and
you’re better off having it!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -g3 ...
</code></pre></div></div>

<p>All the major build systems — CMake, Autotools, Meson, etc. — get this
wrong in their standard debug configurations. Producing a fully-featured
debug build from these systems is a constant battle for me. Often it’s
easier to ignore the build system entirely and <code class="language-plaintext highlighter-rouge">cc -g3 **/*.c</code> (plus
sanitizers, etc.).</p>

<p>(Short term note: GCC 11, released in March 2021, switched to DWARF5 by
default. However, GDB could not access the extra <code class="language-plaintext highlighter-rouge">-﻿g3</code> debug
information in DWARF5 until GDB 13, released February 2023. If you have a
toolchain from that two year window — except <a href="https://github.com/skeeto/w64devkit">mine</a> because I patched
it — then you may also need <code class="language-plaintext highlighter-rouge">-﻿gdwarf-4</code> to switch back to DWARF4.)</p>

<p>What about <code class="language-plaintext highlighter-rouge">-﻿Og</code>? In theory it enables optimizations that do not
interfere with debugging, and potentially some additional warnings. In
practice I still get far too many “optimized out” messages from GDB when I
use it, so I don’t bother. Fortunately C is such a simple language that
debug builds are nearly as fast as release builds anyway.</p>

<p>On MSVC I like having debug information embedded in binaries, as GCC does,
which is done using <strong><code class="language-plaintext highlighter-rouge">/Z7</code></strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl ... /Z7 ...
</code></pre></div></div>

<p>Though I certainly understand the value of separate debug information,
<code class="language-plaintext highlighter-rouge">/Zi</code>, in some cases. Sometimes I wish the GNU toolchain made this easier.</p>

<h3 id="summary">Summary</h3>

<p>My personal rigorous baseline for development using <code class="language-plaintext highlighter-rouge">gcc</code> and <code class="language-plaintext highlighter-rouge">clang</code>
looks like this (all platforms):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -g3 -Wall -Wextra -Wconversion -Wdouble-promotion
     -Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion
     -fsanitize=undefined -fsanitize-trap ...
</code></pre></div></div>

<p>While ASan is great for quickly reviewing and evaluating other people’s
projects, I don’t find it useful for my own programs. I avoid that class
of defects through smarter paradigms (region-based allocation, no null
terminated strings, etc.). I also prefer the behavior of trap instruction
UBSan versus a diagnostic, as it behaves better under debuggers.</p>

<p>For <code class="language-plaintext highlighter-rouge">cl</code> and <code class="language-plaintext highlighter-rouge">clang-cl</code>, my personal baseline looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /Z7 /W4 /wd4146 /wd4245 /RTCcsu ...
</code></pre></div></div>

<p>I don’t normally need <code class="language-plaintext highlighter-rouge">/D_CRT_SECURE_NO_WARNINGS</code> since I don’t use a CRT
anyway.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>Practical libc-free threading on Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/03/23/"/>
    <id>urn:uuid:631a8107-2eef-420b-9594-752e6f013048</id>
    <updated>2023-03-23T05:32:41Z</updated>
    <category term="c"/><category term="optimization"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re <a href="/blog/2023/02/15/">not using a C runtime</a> on Linux, and instead you’re
programming against its system call API. It’s long-term and stable after
all. <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">Memory management</a> and <a href="/blog/2023/02/13/">buffered I/O</a> are easily
solved, but a lot of software benefits from concurrency. It would be nice
to also have thread spawning capability. This article will demonstrate a
simple, practical, and robust approach to spawning and managing threads
using only raw system calls. It only takes about a dozen lines of C,
including a few inline assembly instructions.</p>

<p>The catch is that there’s no way to avoid using a bit of assembly. Neither
the <code class="language-plaintext highlighter-rouge">clone</code> nor <code class="language-plaintext highlighter-rouge">clone3</code> system calls have threading semantics compatible
with C, so you’ll need to paper over it with a bit of inline assembly per
architecture. This article will focus on x86-64, but the basic concept
should work on all architecture supported by Linux. The <a href="https://man7.org/linux/man-pages/man2/clone.2.html">glibc <code class="language-plaintext highlighter-rouge">clone(2)</code>
wrapper</a> fits a C-compatible interface on top of the raw system call,
but we won’t be using it here.</p>

<p>Before diving in, the complete, working demo: <a href="https://github.com/skeeto/scratch/blob/master/misc/stack_head.c"><strong><code class="language-plaintext highlighter-rouge">stack_head.c</code></strong></a></p>

<h3 id="the-clone-system-call">The clone system call</h3>

<p>On Linux, threads are spawned using the <code class="language-plaintext highlighter-rouge">clone</code> system call with semantics
like the classic unix <code class="language-plaintext highlighter-rouge">fork(2)</code>. One process goes in, two processes come
out in nearly the same state. For threads, those processes share almost
everything and differ only by two registers: the return value — zero in
the new thread — and stack pointer. Unlike typical thread spawning APIs,
the application does not supply an entry point. It only provides a stack
for the new thread. The simple form of the raw clone API looks something
like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">clone</span><span class="p">(</span><span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>
</code></pre></div></div>

<p>Sounds kind of elegant, but it has an annoying problem: The new thread
begins life in the <em>middle</em> of a function without any established stack
frame. Its stack is a blank slate. It’s not ready to do anything except
jump to a function prologue that will set up a stack frame. So besides the
assembly for the system call itself, it also needs more assembly to get
the thread into a C-compatible state. In other words, <strong>a generic system
call wrapper cannot reliably spawn threads</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">brokenclone</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">threadentry</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="kt">long</span> <span class="n">r</span> <span class="o">=</span> <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_clone</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">stack</span><span class="p">);</span>
    <span class="c1">// DANGER: new thread may access non-existant stack frame here</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">r</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">threadentry</span><span class="p">(</span><span class="n">arg</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For odd historical reasons, each architecture’s <code class="language-plaintext highlighter-rouge">clone</code> has a slightly
different interface. The newer <code class="language-plaintext highlighter-rouge">clone3</code> unifies these differences, but it
suffers from the same thread spawning issue above, so it’s not helpful
here.</p>

<h3 id="the-stack-header">The stack “header”</h3>

<p>I <a href="/blog/2015/05/15/">figured out a neat trick eight years ago</a> which I continue to use
today. The parent and child threads are in nearly identical states when
the new thread starts, but the immediate goal is to diverge. As noted, one
difference is their stack pointers. To diverge their execution, we could
make their execution depend on the stack. An obvious choice is to push
different return pointers on their stacks, then let the <code class="language-plaintext highlighter-rouge">ret</code> instruction
do the work.</p>

<p>Carefully preparing the new stack ahead of time is the key to everything,
and there’s a straightforward technique that I like call the <code class="language-plaintext highlighter-rouge">stack_head</code>,
a structure placed at the high end of the new stack. Its first element
must be the entry point pointer, and this entry point will receive a
pointer to its own <code class="language-plaintext highlighter-rouge">stack_head</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>The structure must have 16-byte alignment on all architectures. I used an
attribute to help keep this straight, and it can help when using <code class="language-plaintext highlighter-rouge">sizeof</code>
to place the structure, as I’ll demonstrate later.</p>

<p>Now for the cool part: The <code class="language-plaintext highlighter-rouge">...</code> can be anything you want! Use that area
to seed the new stack with whatever thread-local data is necessary. It’s a
neat feature you don’t get from standard thread spawning interfaces. If I
plan to “join” a thread later — wait until it’s done with its work — I’ll
put a join futex in this space:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">join_futex</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>More details on that futex shortly.</p>

<h3 id="the-clone-wrapper">The clone wrapper</h3>

<p>I call the <code class="language-plaintext highlighter-rouge">clone</code> wrapper <code class="language-plaintext highlighter-rouge">newthread</code>. It has the inline assembly for the
system call, and since it includes a <code class="language-plaintext highlighter-rouge">ret</code> to diverge the threads, it’s a
“naked” function <a href="/blog/2023/02/12/">just like with <code class="language-plaintext highlighter-rouge">setjmp</code></a>. The compiler will
generate no prologue or epilogue, and the function body is limited to
inline assembly without input/output operands. It cannot even reliably
reference its parameters by name. Like <code class="language-plaintext highlighter-rouge">clone</code>, it doesn’t accept a thread
entry point. Instead it accepts a <code class="language-plaintext highlighter-rouge">stack_head</code> seeded with the entry
point. The whole wrapper is just six instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="kr">naked</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">long</span> <span class="nf">newthread</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"mov  %%rdi, %%rsi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// arg2 = stack</span>
        <span class="s">"mov  $0x50f00, %%edi</span><span class="se">\n</span><span class="s">"</span>  <span class="c1">// arg1 = clone flags</span>
        <span class="s">"mov  $56, %%eax</span><span class="se">\n</span><span class="s">"</span>       <span class="c1">// SYS_clone</span>
        <span class="s">"syscall</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov  %%rsp, %%rdi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// entry point argument</span>
        <span class="s">"ret</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="o">:</span> <span class="o">:</span> <span class="s">"rax"</span><span class="p">,</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"rsi"</span><span class="p">,</span> <span class="s">"rdi"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On x86-64, both function calls and system calls use <code class="language-plaintext highlighter-rouge">rdi</code> and <code class="language-plaintext highlighter-rouge">rsi</code> for
their first two parameters. Per the reference <code class="language-plaintext highlighter-rouge">clone(2)</code> prototype above:
the first system call argument is <code class="language-plaintext highlighter-rouge">flags</code> and the second argument is the
new <code class="language-plaintext highlighter-rouge">stack</code>, which will point directly at the <code class="language-plaintext highlighter-rouge">stack_head</code>. However, the
stack pointer arrives in <code class="language-plaintext highlighter-rouge">rdi</code>. So I copy <code class="language-plaintext highlighter-rouge">stack</code> into the second argument
register, <code class="language-plaintext highlighter-rouge">rsi</code>, then load the flags (<code class="language-plaintext highlighter-rouge">0x50f00</code>) into the first argument
register, <code class="language-plaintext highlighter-rouge">rdi</code>. The system call number goes in <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<p>Where does that <code class="language-plaintext highlighter-rouge">0x50f00</code> come from? That’s the bare minimum thread spawn
flag set in hexadecimal. If any flag is missing then threads will not
spawn reliably — as discovered the hard way by trial and error across
different system configurations, not from documentation. It’s computed
normally like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">long</span> <span class="n">flags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FILES</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FS</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SIGHAND</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SYSVSEM</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_THREAD</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_VM</span><span class="p">;</span>
</code></pre></div></div>

<p>When the system call returns, it copies the stack pointer into <code class="language-plaintext highlighter-rouge">rdi</code>, the
first argument for the entry point. In the new thread the stack pointer
will be the same value as <code class="language-plaintext highlighter-rouge">stack</code>, of course. In the old thread this is a
harmless no-op because <code class="language-plaintext highlighter-rouge">rdi</code> is a volatile register in this ABI. Finally,
<code class="language-plaintext highlighter-rouge">ret</code> pops the address at the top of the stack and jumps. In the old
thread this returns to the caller with the system call result, either an
error (<a href="/blog/2016/09/23/">negative errno</a>) or the new thread ID. In the new thread
<strong>it pops the first element of <code class="language-plaintext highlighter-rouge">stack_head</code></strong> which, of course, is the
entry point. That’s why it must be first!</p>

<p>The thread has nowhere to return from the entry point, so when it’s done
it must either block indefinitely or use the <code class="language-plaintext highlighter-rouge">exit</code> (<em>not</em> <code class="language-plaintext highlighter-rouge">exit_group</code>)
system call to terminate itself.</p>

<h3 id="caller-point-of-view">Caller point of view</h3>

<p>The caller side looks something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">threadentry</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ... do work ...</span>
    <span class="n">__atomic_store_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
    <span class="n">futex_wake</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">);</span>
    <span class="n">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span> <span class="o">=</span> <span class="n">newstack</span><span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">16</span><span class="p">);</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">entry</span> <span class="o">=</span> <span class="n">threadentry</span><span class="p">;</span>
    <span class="c1">// ... assign other thread data ...</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">newthread</span><span class="p">(</span><span class="n">stack</span><span class="p">);</span>

    <span class="c1">// ... do work ...</span>

    <span class="n">futex_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">exit_group</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Despite the minimalist, 6-instruction clone wrapper, this is taking the
shape of a conventional threading API. It would only take a bit more to
hide the futex, too. Speaking of which, what’s going on there? The <a href="/blog/2022/10/05/">same
principal as a WaitGroup</a>. The futex, an integer, is zero-initialized,
indicating the thread is running (“not done”). The joiner tells the kernel
to wait until the integer is non-zero, which it may already be since I
don’t bother to check first. When the child thread is done, it atomically
sets the futex to non-zero and wakes all waiters, which might be nobody.</p>

<p>Caveat: It’s not safe to free/reuse the stack after a successful join. It
only indicates the thread is done with its work, not that it exited. You’d
need to wait for its <code class="language-plaintext highlighter-rouge">SIGCHLD</code> (or use <code class="language-plaintext highlighter-rouge">CLONE_CHILD_CLEARTID</code>). If this
sounds like a problem, consider <a href="https://vimeo.com/644068002">your context</a> more carefully: Why do
you feel the need to free the stack? It will be freed when the process
exits. Worried about leaking stacks? Why are you starting and exiting an
unbounded number of threads? In the worst case park the thread in a thread
pool until you need it again. Only worry about this sort of thing if
you’re building a general purpose threading API like pthreads. I know it’s
tempting, but avoid doing that unless you absolutely must.</p>

<p>What’s with the <code class="language-plaintext highlighter-rouge">force_align_arg_pointer</code>? Linux doesn’t align the stack
for the process entry point like a System V ABI function call. Processes
begin life with an unaligned stack. This attribute tells GCC to fix up the
stack alignment in the entry point prologue, <a href="/blog/2023/02/15/#stack-alignment-on-32-bit-x86">just like on Windows</a>.
If you want to access <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, and <code class="language-plaintext highlighter-rouge">envp</code> you’ll need <a href="/blog/2022/02/18/">more
assembly</a>. (I wish doing <em>really basic things</em> without libc on Linux
didn’t require so much assembly.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__asm</span> <span class="p">(</span>
    <span class="s">".global _start</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"_start:</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  (%rsp), %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsp), %rsi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsi,%rdi,8), %rdx</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   call  main</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  %eax, %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  $60, %eax</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   syscall</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">envp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Getting back to the example usage, it has some regular-looking system call
wrappers. Where do those come from? Start with this 6-argument generic
system call wrapper.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">syscall6</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">a</span><span class="p">,</span> <span class="kt">long</span> <span class="n">b</span><span class="p">,</span> <span class="kt">long</span> <span class="n">c</span><span class="p">,</span> <span class="kt">long</span> <span class="n">d</span><span class="p">,</span> <span class="kt">long</span> <span class="n">e</span><span class="p">,</span> <span class="kt">long</span> <span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">ret</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r10</span> <span class="n">asm</span><span class="p">(</span><span class="s">"r10"</span><span class="p">)</span> <span class="o">=</span> <span class="n">d</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r8</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r8"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">e</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r9</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r9"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">ret</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r10</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r8</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r9</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I could define <code class="language-plaintext highlighter-rouge">syscall5</code>, <code class="language-plaintext highlighter-rouge">syscall4</code>, etc. but instead I’ll just wrap it
in macros. The former would be more efficient since the latter wastes
instructions zeroing registers for no reason, but for now I’m focused on
compacting the implementation source.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SYSCALL1(n, a) \
    syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
    syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
    syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))
</span></code></pre></div></div>

<p>Now we can have some exits:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit_group</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit_group</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Simplified futex wrappers:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">expect</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL4</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAIT</span><span class="p">,</span> <span class="n">expect</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL3</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAKE</span><span class="p">,</span> <span class="mh">0x7fffffff</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And so on.</p>

<p>Finally I can talk about that <code class="language-plaintext highlighter-rouge">newstack</code> function. It’s just a wrapper
around an anonymous memory map allocating pages from the kernel. I’ve
hardcoded the constants for the standard mmap allocation since they’re
nothing special or unusual. The return value check is a little tricky
since a large portion of the negative range is valid, so I only want to
check for a small range of negative errnos. (Allocating a arena looks
basically the same.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="nf">newstack</span><span class="p">(</span><span class="kt">long</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">p</span> <span class="o">=</span> <span class="n">SYSCALL6</span><span class="p">(</span><span class="n">SYS_mmap</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x22</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">p</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">4096UL</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="kt">long</span> <span class="n">count</span> <span class="o">=</span> <span class="n">size</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span><span class="p">);</span>
    <span class="k">return</span> <span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span> <span class="o">+</span> <span class="n">count</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">aligned</code> attribute comes into play here: I treat the result like an
array of <code class="language-plaintext highlighter-rouge">stack_head</code> and return the last element. The attribute ensures
each individual elements is aligned.</p>

<p>That’s it! There’s not much to it other than a few thoughtful assembly
instructions. It took doing this a few times in a few different programs
before I noticed how simple it can be.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>CRT-free in 2023: tips and tricks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/02/15/"/>
    <id>urn:uuid:025441bf-084e-4c3e-9a37-269e2ac1a4d6</id>
    <updated>2023-02-15T02:12:00Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Seven years ago I wrote about <a href="/blog/2016/01/31/">“freestanding” Windows executables</a>.
After an additional seven years of practical experience both writing and
distributing such programs, half using <a href="https://github.com/skeeto/w64devkit">a custom-built toolchain</a>,
it’s time to revisit these cabalistic incantations and otherwise scant
details. I’ve tweaked my older article over the years as I’ve learned, but
this is a full replacement and does not assumes you’ve read it. The <a href="/blog/2023/02/11/">“why”
has been covered</a> and the focus will be on the “how”. Both the GNU
and MSVC toolchains will be considered.</p>

<p>I no longer call these “freestanding” programs since that term is, at
best, inaccurate. In fact, we will be actively avoiding GCC features
associated with that label. Instead I call these <em>CRT-free</em> programs,
where CRT stands for the <em>C runtime</em> the Windows-oriented term for
<em>libc</em>. This term communicates both intent and scope.</p>

<h3 id="entry-point">Entry point</h3>

<p>You should already know that <code class="language-plaintext highlighter-rouge">main</code> is not the program’s entry point, but
a C application’s entry point. The CRT provides the entry point, where it
initializes the CRT, including <a href="/blog/2022/02/18/">parsing command line options</a>, then
calls the application’s <code class="language-plaintext highlighter-rouge">main</code>. The real entry point doesn’t have a name.
It’s just the address of the function to be called by the loader without
arguments.</p>

<p>You might naively assume you could continue using the name <code class="language-plaintext highlighter-rouge">main</code> and tell
the linker to use it as the entry point. You would be wrong. <strong>Avoid the
name <code class="language-plaintext highlighter-rouge">main</code>!</strong> It has a special meaning in C gets special treatment. Using
it without a conventional CRT will confuse your tools an may cause build
issues.</p>

<p>While you can use almost any other name you like, the conventional names
are <code class="language-plaintext highlighter-rouge">mainCRTStartup</code> (console subsystem) and <code class="language-plaintext highlighter-rouge">WinMainCRTStartup</code> (windows
subsystem). It’s easy to remember: Append <code class="language-plaintext highlighter-rouge">CRTStartup</code> to the name you’d
use in a normal CRT-linking application. I strongly recommend using these
names because it reduces friction. Your tools are already familiar with
them, so you won’t need to do anything special.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>     <span class="c1">// console subsystem</span>
<span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>  <span class="c1">// windows subsystem</span>
</code></pre></div></div>

<p>The MSVC linker documentation says the entry point uses the <code class="language-plaintext highlighter-rouge">__stdcall</code>
calling convention. Ignore this and <strong>do not use <code class="language-plaintext highlighter-rouge">__stdcall</code> for your
entry point!</strong> Since entry points take no arguments, there is no practical
difference from the <code class="language-plaintext highlighter-rouge">__cdecl</code> calling convention, so it does not actually
matter. Rather, the goal is to avoid <code class="language-plaintext highlighter-rouge">__stdcall</code> <em>function decorations</em>.
In particular, the GNU linker <code class="language-plaintext highlighter-rouge">--entry</code> option does not understand them,
nor can it find decorated entry points on its own. If you use <code class="language-plaintext highlighter-rouge">__stdcall</code>,
then the 32-bit GNU linker will silently (!) choose the beginning of your
<code class="language-plaintext highlighter-rouge">.text</code> section as the entry point.</p>

<p>If you’re using C++, then of course you will also need to use <code class="language-plaintext highlighter-rouge">extern "C"</code>
so that it’s not name-mangled. Otherwise the results are similarly bad.</p>

<p>If using <code class="language-plaintext highlighter-rouge">-fwhole-program</code>, you will need to mark your entry point as
externally visible for GCC so that it knows its an entry point. While
linkers are familiar with conventional entry point names, GCC the
<em>compiler</em> is not. Normally you do not need to worry about this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">externally_visible</span><span class="p">))</span>  <span class="c1">// for -fwhole-program</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The entry point returns <code class="language-plaintext highlighter-rouge">int</code>. <em>If there are no other threads</em> then the
process will exit with the returned value as its exit status. In practice
this is only useful for console programs. Windows subsystem programs have
threads started automatically, without warning, and it’s almost certain
your main thread is not the last thread. You probably want to use
<code class="language-plaintext highlighter-rouge">ExitProcess</code> or even <code class="language-plaintext highlighter-rouge">TerminateProcess</code> instead of returning. The latter
exits more abruptly and can avoid issues with certain subsystems, like
DirectSound, not shutting down gracefully: It doesn’t even let them try.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">TerminateProcess</span><span class="p">(</span><span class="n">GetCurrentProcess</span><span class="p">(),</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="compilation">Compilation</h3>

<p>Starting with the GNU toolchain, you have two ways to get into “CRT-free
mode”: <code class="language-plaintext highlighter-rouge">-nostartfiles</code> and <code class="language-plaintext highlighter-rouge">-nostdlib</code>. The former is more dummy-proof,
and it’s what I use in build documentation. The latter can be a more
complicated, but when it succeeds you get guarantees about the result. I
use it in build scripts I intend to run myself, which I want to fail if
they don’t do exactly what I expect. To illustrate, consider this trivial
program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">ExitProcess</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This program uses <code class="language-plaintext highlighter-rouge">ExitProcess</code> from <code class="language-plaintext highlighter-rouge">kernel32.dll</code>. Compiling is easy:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles example.c
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-nostartfiles</code> prevents it from linking the CRT entry point, but it
still implicitly passes other “standard” linker flags, including libraries
<code class="language-plaintext highlighter-rouge">-lmingw32</code> and <code class="language-plaintext highlighter-rouge">-lkernel32</code>. Programs can use <code class="language-plaintext highlighter-rouge">kernel32.dll</code> functions
without explicitly linking that DLL. But, hey, isn’t <code class="language-plaintext highlighter-rouge">-lmingw32</code> the CRT,
the thing we’re avoiding? It is, but it wasn’t actually linked because the
program didn’t reference it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -p a.exe | grep -Fi .dll
        DLL Name: KERNEL32.dll
</code></pre></div></div>

<p>However, <code class="language-plaintext highlighter-rouge">-nostdlib</code> does not pass any of these libraries, so you need to
do so explicitly.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib example.c -lkernel32
</code></pre></div></div>

<p>The MSVC toolchain behaves a little like <code class="language-plaintext highlighter-rouge">-nostartfiles</code>, not linking a
CRT unless you need it, semi-automatically. However, you’ll need to list
<code class="language-plaintext highlighter-rouge">kernel32.dll</code> and tell it which subsystem you’re using.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl example.c /link /subsystem:console kernel32.lib
</code></pre></div></div>

<p>However, MSVC has a handy little feature to list these arguments in the
source file.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef _MSC_VER
</span>  <span class="cp">#pragma comment(linker, "/subsystem:console")
</span>  <span class="cp">#pragma comment(lib, "kernel32.lib")
#endif
</span></code></pre></div></div>

<p>This information must go somewhere, and I prefer the source file rather
than a build script. Then anyone can point MSVC at the source without
worrying about options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl example.c
</code></pre></div></div>

<p>I try to make all my Windows programs so simply built.</p>

<h3 id="stack-probes">Stack probes</h3>

<p>On Windows, it’s expected that stacks will commit dynamically. That is,
the stack is merely <em>reserved</em> address space, and it’s only committed when
the stack actually grows into it. This made sense 30 years ago as a memory
saving technique, but today it no longer makes sense. However, programs
are still built to use this mechanism.</p>

<p>To function properly, programs must touch each stack page for the first
time in order. Normally that’s not an issue, but if your stack frame
exceeds the page size, there’s a chance it might step over a page. When a
function has a large stack frame, GCC inserts a call to a “stack probe” in
<code class="language-plaintext highlighter-rouge">libgcc</code> that touches its pages in the prologue. It’s not unlike <a href="/blog/2017/06/21/">stack
clash protection</a>.</p>

<p>For example, if I have a 4kiB local variable:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">12</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When I compile with <code class="language-plaintext highlighter-rouge">-nostdlib</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib example.c
ld: ... undefined reference to `___chkstk_ms'
</code></pre></div></div>

<p>It’s trying to link the CRT stack probe. You can disable this behavior
with <code class="language-plaintext highlighter-rouge">-mno-stack-arg-probe</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -nostdlib example.c
</code></pre></div></div>

<p>Or you can just link <code class="language-plaintext highlighter-rouge">-lgcc</code> to provide a definition:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib example.c -lgcc
</code></pre></div></div>

<p>Had you used <code class="language-plaintext highlighter-rouge">-nostartfiles</code>, you wouldn’t have noticed because it passes
<code class="language-plaintext highlighter-rouge">-lgcc</code> automatically. It’s “dummy-proof” because this sort of issue goes
away before it comes up, though for the same reason it’s harder to tell
exactly what went into a program.</p>

<p>If you disable the probe altogether — my preference — you’ve only solved
the linker problem, but the underlying stack commit problem remains and
your program may crash. You can solve that by telling the linker to ask
the loader to commit a larger stack up front rather than grow it at run
time. Say, 2MiB:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000 example.c
</code></pre></div></div>

<p>Of course, I wish that this was simply the default behavior because it’s
far more sensible! Another option is to avoid large stack frames in the
first place. Allocate locals larger than 4kiB in, say, a scratch arena
instead of on the stack.</p>

<p>MSVC doesn’t have <code class="language-plaintext highlighter-rouge">libgcc</code> of course, but it still generates stack probes
both for growing the stack and for security checks. The latter requires
<code class="language-plaintext highlighter-rouge">kernel32.dll</code>, so if I compile the same program with MSVC, I get a bunch
of linker failures:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl example.c /link /subsystem:console
... unresolved external symbol __imp_RtlCaptureContext ...
... and 7 more ...
</code></pre></div></div>

<p>Using <code class="language-plaintext highlighter-rouge">/Gs1000000000</code> turns off the stack probes, <code class="language-plaintext highlighter-rouge">/GS-</code> turns off the
checks, <code class="language-plaintext highlighter-rouge">/stack</code> commits a larger stack:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /GS- /Gs1000000000 example.c /link
     /subsystem:console /stack:0x200000,200000
</code></pre></div></div>

<p>Though, as before, you could also avoid large stack frames in the first
place.</p>

<h3 id="built-in-functions-ugh">Built-in functions… ugh</h3>

<p>The three major C and C++ compilers — GCC, MSVC, Clang — share a common,
evil weakness: “built-in” functions. <em>No matter what</em>, they each assume
you will supply definitions for standard string functions at link time,
particularly <code class="language-plaintext highlighter-rouge">memset</code> and <code class="language-plaintext highlighter-rouge">memcpy</code>. They do this no matter how many
“seriously now, do not use standard C functions” options you pass. When
you don’t link a CRT, you may need to define them yourself.</p>

<p>In case that sounds easy, there’s a catch-22: The compiler will transform
your <code class="language-plaintext highlighter-rouge">memset</code> definition — that is, <em>in a function named <code class="language-plaintext highlighter-rouge">memset</code></em> — into
a call to itself. After all, it looks an awful lot like <code class="language-plaintext highlighter-rouge">memset</code>! This
typically manifests as an infinite loop. This will even compile and
<em>appear</em> work — until your program hangs. It’s amazing that each of the
major compilers have this crummy behavior.</p>

<p>No matter what you may have read, <strong><code class="language-plaintext highlighter-rouge">-fno-builtin</code> is not a solution</strong>.
It’s merely a sometimes-honored request, and both GCC and Clang will
continue inserting calls to built-in functions you said do not exist. For
example, making an especially large local variable (and using <code class="language-plaintext highlighter-rouge">volatile</code>
to prevent it from being optimized out):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">volatile</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As of this writing, the latest GCC and Clang will generate a <code class="language-plaintext highlighter-rouge">memset</code> call
despite <code class="language-plaintext highlighter-rouge">-fno-builtin</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -fno-builtin -nostdlib example.c
ld: ... undefined reference to `memset' ...
</code></pre></div></div>

<p>If you want to be absolutely pure, you will need to address this in just
about any non-trivial program. On the other hand, <code class="language-plaintext highlighter-rouge">-nostartfiles</code> will
grab a definition from <code class="language-plaintext highlighter-rouge">msvcrt.dll</code> for you:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles example.c
$ objdump -p a.exe | grep -Fi .dll
        DLL Name: msvcrt.dll
</code></pre></div></div>

<p>To be clear, <em>this is a completely legitimate and pragmatic route!</em> You
get the benefits of both worlds: the CRT is still out of the way, but
there’s also no hassle from misbehaving compilers. If this sounds like a
good deal, then do it! (For on-lookers feeling smug: there is no such
easy, general solution for this problem on Linux.)</p>

<p>But me, I want that CRT-free purity, damnit! There are a few of options.
Option 1, make it unoptimizable. Here I’ve added fake-out inline assembly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">memset</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">p</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kr">__asm</span><span class="p">(</span><span class="s">""</span><span class="p">);</span>
        <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">c</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Alternatively use <code class="language-plaintext highlighter-rouge">volatile</code>. The downside is your program may be slower
since it prevents optimizations you <em>do</em> want. Option 2, disable the
particular troublesome optimization.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">optimize</span><span class="p">(</span><span class="s">"no-tree-loop-distribute-patterns"</span><span class="p">)))</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">memset</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or for the whole program:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -fno-tree-loop-distribute-patterns ...
</code></pre></div></div>

<p>But will that work reliably in the future? Option 3, implement it with
<a href="https://www.felixcloutier.com/documents/gcc-asm.html">inline assembly</a> since it’s opaque to optimization.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">memset</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">r</span> <span class="o">=</span> <span class="n">d</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"rep stosb"</span>
        <span class="o">:</span> <span class="s">"=D"</span><span class="p">(</span><span class="n">d</span><span class="p">),</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="s">"=c"</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"0"</span><span class="p">(</span><span class="n">d</span><span class="p">),</span> <span class="s">"1"</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="s">"2"</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Normally this option could be severe since you’d need assembly for every
target architecture, but Windows (currently) supports few architectures.
You probably only care about x86 and x64, and the inline assembly above is
a polyglot! Important: Be wary of copy-pasting such inline assembly from
Stack Overflow because it’s often wrong.</p>

<p>Regardless, I suggest putting each definition in its own section so that
they can be discarded via <code class="language-plaintext highlighter-rouge">-Wl,--gc-sections</code> when unused:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">section</span><span class="p">(</span><span class="s">".text.memset"</span><span class="p">)))</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">memset</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the past I’ve needed to provide definitions for <code class="language-plaintext highlighter-rouge">memcmp</code>, <code class="language-plaintext highlighter-rouge">memset</code>,
<code class="language-plaintext highlighter-rouge">memcpy</code>, <code class="language-plaintext highlighter-rouge">memmove</code>, and even <code class="language-plaintext highlighter-rouge">strlen</code>.</p>

<p>Unfortunately the MSVC situation is mostly worse. When it inserts such a
CRT call it will not automatically pick up a CRT like <code class="language-plaintext highlighter-rouge">-nostartfiles</code>.
There’s no inline assembly, and it’s harder to selectively disable the
troublesome optimizations. Instead I’ve been using intrinsics like
<code class="language-plaintext highlighter-rouge">__stosb</code>. MSVC has a larger variety of them, which makes up a bit for its
lack of inline assembly.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma function(memset)
</span><span class="kt">void</span> <span class="o">*</span><span class="nf">memset</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__stosb</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">d</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I don’t quite understand the purpose of the <code class="language-plaintext highlighter-rouge">#pragma</code>, but this works.</p>

<h3 id="stack-alignment-on-32-bit-x86">Stack alignment on 32-bit x86</h3>

<p>GCC expects a 16-byte aligned stack and generates code accordingly. Such
is dictated by the x64 ABI, so that’s a given on 64-bit Windows. However,
the x86 ABIs only guarantee 4-byte alignment. If no care is taken to deal
with it, there will likely be unaligned loads. Some may not be valid (e.g.
SIMD) leading to a crash. UBSan disapproves, too. Fortunately there’s a
function attribute for this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>GCC will now align the stack in this function’s prologue. Adjustment is
only necessary at entry points, as GCC will maintain alignment through its
own frames. This includes <em>all</em> entry points, not just the program entry
point, particularly thread start functions. Rule of thumb for i686 GCC:
<strong>If <code class="language-plaintext highlighter-rouge">WINAPI</code> or <code class="language-plaintext highlighter-rouge">__stdcall</code> appears in a definition, the stack likely
requires alignment</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="n">DWORD</span> <span class="n">WINAPI</span> <span class="nf">mythread</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s harmless to use this attribute on x64. The prologue will just be a
smidge larger. If you’re worried about it, use <code class="language-plaintext highlighter-rouge">#ifdef __i686__</code> to limit
it to 32-bit builds.</p>

<h3 id="putting-it-all-together">Putting it all together</h3>

<p>If I’ve written a graphical application with <code class="language-plaintext highlighter-rouge">WinMainCRTStartup</code>, used
large stack frames, marked my entry point as externally visible, plan to
support 32-bit builds, and defined a couple of needed string functions, my
optimal entry point may look something like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef __GNUC__
</span><span class="n">__attribute</span><span class="p">((</span><span class="n">externally_visible</span><span class="p">))</span>
<span class="cp">#endif
#ifdef __i686__
</span><span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="cp">#endif
</span><span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then my “optimize all the things” release build may look something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000
     -O3 -fwhole-program -Wl,--gc-sections -s -nostdlib -mwindows
     -fno-asynchronous-unwind-tables -o app.exe app.c -lkernel32
</code></pre></div></div>

<p>Or with MSVC:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /O2 /GS- /Gs1000000000 app.c /link kernel32.lib
     /subsystem:windows /stack:0x200000,200000
</code></pre></div></div>

<p>Or if I’m taking it easy maybe just:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -O3 -s -nostartfiles -mwindows -o app.exe app.c
</code></pre></div></div>

<p>Or with MSVC (linker flags in source):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /O2 app.c
</code></pre></div></div>

]]>
    </content>
  </entry>
  
  <entry>
    <title>Let's implement buffered, formatted output</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/02/13/"/>
    <id>urn:uuid:4a4af83f-4fd8-4b3b-99aa-089d01f90fad</id>
    <updated>2023-02-13T00:00:00Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://old.reddit.com/r/C_Programming/comments/111238u/lets_implement_buffered_formatted_output/">on reddit</a>.</em></p>

<p>When <a href="/blog/2023/02/11/">not using the C standard library</a>, how does one deal with
formatted output? Re-implementing the entirety of <code class="language-plaintext highlighter-rouge">printf</code> from scratch
seems like a lot of work, and indeed it would be. Fortunately it’s rarely
necessary. With the right mindset, and considering your program’s <em>actual</em>
formatting needs, it’s not as difficult as it might appear. Since it goes
hand-in-hand with buffering, I’ll cover both topics at once, including
<code class="language-plaintext highlighter-rouge">sprintf</code>-like capabilities, which is where we’ll start.</p>

<!--more-->

<h3 id="the-print-is-append-mindset">The print-is-append mindset</h3>

<p>Buffering amortizes the costs of write (and read) system calls. Many small
writes are queued via the buffer into a few large writes. This isn’t just
an implementation detail. It’s key in the mindset to tackle formatted
output: <strong>Printing is appending.</strong></p>

<p>The mindset includes the reverse: <em>Appending is like printing</em>. Consider
this next time you reach for <code class="language-plaintext highlighter-rouge">strcat</code> or similar. Is this the appropriate
destination for this data, or am I just going to print it — i.e. append it
to another, different buffer — afterward?</p>

<p>This concept may sound obvious, but consider that there are major, popular
programming paradigms where the norm is otherwise. I’ll pick on Python to
illustrate, but it’s not alone.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"found </span><span class="si">{</span><span class="n">count</span><span class="si">}</span><span class="s"> items"</span><span class="p">)</span>
</code></pre></div></div>

<p>This line of code allocates a buffer; formats the value of the variable
<code class="language-plaintext highlighter-rouge">count</code> into it; allocates a second buffer; copies into it the prefix
(<code class="language-plaintext highlighter-rouge">"found "</code>), the first buffer, and the suffix (<code class="language-plaintext highlighter-rouge">" items"</code>); copies the
contents of this second buffer into the standard output buffer; then
discards the two temporary buffers. To see for yourself, use the <a href="https://docs.python.org/3/library/dis.html">CPython
bytecode disassembler</a> on it. (It <em>is</em> pretty neat that string
formatting is partially implemented in the compiler and partially parsed
at compile time.)</p>

<p>With the print-is-append mindset, you know it’s ultimately being copied
into the standard output buffer, and that you can skip the intermediate
appending and copying. Avoiding that pessimization isn’t just about the
computer’s time, it’s even more about saving your own time implementing
formatted output.</p>

<p>In C that line looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">printf</span><span class="p">(</span><span class="s">"found %d items</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
</code></pre></div></div>

<p>The format string is a domain-specific language (DSL) that is (usually)
parsed and evaluated at run time. In essence it’s a little program that
says:</p>

<ol>
  <li>Append <code class="language-plaintext highlighter-rouge">"found "</code> to the output buffer</li>
  <li>Format the given integer into the output buffer</li>
  <li>Append <code class="language-plaintext highlighter-rouge">" items\n"</code> to the output buffer</li>
</ol>

<p>For <code class="language-plaintext highlighter-rouge">sprintf</code> the output buffer is caller-supplied instead of a buffered
stream.</p>

<p>In this implementation we’re doing to skip the DSL and express such
“format programs” in C itself. It’s more verbose at the call site, but it
simplifies the implementation. As a bonus, it’s also faster since the
format program is itself compiled by the C compiler. In your own formatted
output implementation you could write a <code class="language-plaintext highlighter-rouge">printf</code> that, following the
format string, calls the append primitives we’ll build below.</p>

<h3 id="buffer-implementation">Buffer implementation</h3>

<p>Let’s begin by defining an output buffer. An output buffer tracks the
total capacity and how much has been written. I’ll include a sticky error
flag to simplify error checks. For a first pass we’ll start with a
<code class="language-plaintext highlighter-rouge">sprintf</code> rather than full-blown <code class="language-plaintext highlighter-rouge">printf</code> because there’s nowhere yet for
the data to go.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define MEMBUF(buf, cap) {buf, cap, 0, 0}
</span><span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">cap</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">_Bool</span> <span class="n">error</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>I’m using <code class="language-plaintext highlighter-rouge">unsigned char</code> since these are <em>bytes</em>, best understood as
unsigned (0–255), particularly important when dealing with encodings. I
also wrote a “constructor” macro, <code class="language-plaintext highlighter-rouge">MEMBUF</code>, to help with initialization.
Next we need a function to append bytes — the core operation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">src</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">avail</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">cap</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">amount</span> <span class="o">=</span> <span class="n">avail</span><span class="o">&lt;</span><span class="n">len</span> <span class="o">?</span> <span class="n">avail</span> <span class="o">:</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">amount</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">[</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">src</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">amount</span><span class="p">;</span>
    <span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">|=</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If there wasn’t room, it copies as much as possible and sets the error
flag to indicate truncation. It doesn’t return the error. Rather than
check after each append, the caller will check after multiple appends,
effectively batching the checks into one check. The typical, expected case
is that there is no error, so make that path fast.</p>

<p>Since it’s an easy point to miss: <code class="language-plaintext highlighter-rouge">append</code> is the only place in the entire
implementation where bounds checking comes into play. Everything else can
confidentially throw bytes at the buffer without worrying if it fits. If
it doesn’t, the sticky error flag will indicate such at a more appropriate
time.</p>

<p>I could have used <code class="language-plaintext highlighter-rouge">memcpy</code> for the loop, but the goal is not to use libc.
Besides, not using <code class="language-plaintext highlighter-rouge">memcpy</code> means we can pass a null pointer without
making it a special exception.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">append</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// append nothing (no-op)</span>
</code></pre></div></div>

<p>I expect that static strings are common sources for append, so I’ll add a
helper macro which gets the length as a compile-time constant. The null
terminator will not be used.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define APPEND_STR(b, s) append(b, s, sizeof(s)-1)
</span></code></pre></div></div>

<p>If that’s not clear yet, it will be once you see an example. It’s also
useful to append single bytes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append_byte</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">append</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">c</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With primitive appends done, we can build ever “higher-level” appends. For
example, to append a formatted <code class="language-plaintext highlighter-rouge">long</code> to the buffer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append_long</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">tmp</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">end</span> <span class="o">=</span> <span class="n">tmp</span> <span class="o">+</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">tmp</span><span class="p">);</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span> <span class="o">=</span> <span class="n">end</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">t</span> <span class="o">=</span> <span class="n">x</span><span class="o">&gt;</span><span class="mi">0</span> <span class="o">?</span> <span class="o">-</span><span class="n">x</span> <span class="o">:</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="o">*--</span><span class="n">beg</span> <span class="o">=</span> <span class="sc">'0'</span> <span class="o">-</span> <span class="n">t</span><span class="o">%</span><span class="mi">10</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">t</span> <span class="o">/=</span> <span class="mi">10</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*--</span><span class="n">beg</span> <span class="o">=</span> <span class="sc">'-'</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">append</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">beg</span><span class="p">,</span> <span class="n">end</span><span class="o">-</span><span class="n">beg</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By working from the negative end — recall that the negative range is
larger than the positive — it supports the full range of signed <code class="language-plaintext highlighter-rouge">long</code>,
whatever it happens to be on this host. With less than 50 lines of code we
now have enough to format the example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">message</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>
<span class="k">struct</span> <span class="n">buf</span> <span class="n">b</span> <span class="o">=</span> <span class="n">MEMBUF</span><span class="p">(</span><span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">message</span><span class="p">));</span>

<span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">"found "</span><span class="p">);</span>
<span class="n">append_long</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">"items</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// truncated</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We can continue defining append functions for whatever types we need.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append_ptr</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">APPEND_STR</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="s">"0x"</span><span class="p">);</span>
    <span class="kt">uintptr_t</span> <span class="n">u</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">p</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">2</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="n">u</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">append_byte</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="s">"0123456789abcdef"</span><span class="p">[(</span><span class="n">u</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="p">))</span><span class="o">&amp;</span><span class="mi">15</span><span class="p">]);</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">vec2</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">;</span> <span class="p">};</span>

<span class="kt">void</span> <span class="nf">append_vec2</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="k">struct</span> <span class="n">vec2</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">"vec2{"</span><span class="p">);</span>
    <span class="n">append_long</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">v</span><span class="p">.</span><span class="n">x</span><span class="p">);</span>
    <span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">", "</span><span class="p">);</span>
    <span class="n">append_long</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">v</span><span class="p">.</span><span class="n">y</span><span class="p">);</span>
    <span class="n">append_byte</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="sc">'}'</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Perhaps you want features like field width? Add a parameter for it… but
only if you need it!</p>

<h3 id="float-formatting">Float formatting</h3>

<p>As mentioned before, <a href="https://netlib.org/fp/dtoa.c">precise float formatting is challenging</a>
because it’s full of edge cases. However, if you only need to output a
simple format at reduced precision, it’s not difficult. To illustrate,
this nearly matches <code class="language-plaintext highlighter-rouge">%f</code>, built atop <code class="language-plaintext highlighter-rouge">append_long</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append_double</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">double</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">prec</span> <span class="o">=</span> <span class="mi">1000000</span><span class="p">;</span>  <span class="c1">// i.e. 6 decimals</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">append_byte</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="sc">'-'</span><span class="p">);</span>
        <span class="n">x</span> <span class="o">=</span> <span class="o">-</span><span class="n">x</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">x</span> <span class="o">+=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span> <span class="o">/</span> <span class="n">prec</span><span class="p">;</span>  <span class="c1">// round last decimal</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)(</span><span class="o">-</span><span class="mi">1UL</span><span class="o">&gt;&gt;</span><span class="mi">1</span><span class="p">))</span> <span class="p">{</span>  <span class="c1">// out of long range?</span>
        <span class="n">APPEND_STR</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="s">"inf"</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="kt">long</span> <span class="n">integral</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
        <span class="kt">long</span> <span class="n">fractional</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">integral</span><span class="p">)</span><span class="o">*</span><span class="n">prec</span><span class="p">;</span>
        <span class="n">append_long</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">integral</span><span class="p">);</span>
        <span class="n">append_byte</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="sc">'.'</span><span class="p">);</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">long</span> <span class="n">i</span> <span class="o">=</span> <span class="n">prec</span><span class="o">/</span><span class="mi">10</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">/=</span> <span class="mi">10</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">&gt;</span> <span class="n">fractional</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">append_byte</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="sc">'0'</span><span class="p">);</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">append_long</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">fractional</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="output-to-a-handle">Output to a handle</h3>

<p>So far this writes output to a buffer and truncates when it runs out of
space. Usually we want this going to a sink, like a kernel object whether
that be a file, pipe, socket, etc. to which we have a handle like a file
descriptor. Instead of truncating, we <em>flush</em> the buffer to this sink, at
which point there’s room for more output. The error flag is set if the
flush fails, but this is essentially the same concept as before.</p>

<p>In these examples I will use a file descriptor <code class="language-plaintext highlighter-rouge">int</code>, but you can use
whatever sort of handle is appropriate. I’ll add an <code class="language-plaintext highlighter-rouge">fd</code> field to the
buffer and a new constructor macro:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define MEMBUF(buf, cap) {buf, cap, 0, -1, 0}
#define FDBUF(fd, buf, cap) {buf, cap, 0, fd, 0}
</span>
<span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">cap</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
    <span class="n">Bool</span> <span class="n">error</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The buffered stream will be polymorphic: Output can go to a memory buffer
or to an operating system handle using the same append interface. This is
a handy feature standard C doesn’t even have, though POSIX does in the
form of <a href="https://man7.org/linux/man-pages/man3/fmemopen.3.html"><code class="language-plaintext highlighter-rouge">fmemopen</code></a>. Nothing else changes except <code class="language-plaintext highlighter-rouge">append</code>,
which, if given a valid handle, will flush when full. Attempting to flush
a memory buffer sets the error flag.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">os_write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>

<span class="kt">void</span> <span class="nf">flush</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">|=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">fd</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">&amp;&amp;</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">|=</span> <span class="o">!</span><span class="n">os_write</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">fd</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">);</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’ve arranged so that output stops when there’s an error. Also I’m using a
hypothetical <code class="language-plaintext highlighter-rouge">os_write</code> in the platform layer as a full, unbuffered write.
Note that unix <code class="language-plaintext highlighter-rouge">write(2)</code> experiences partial writes and so must be used
in a loop. Win32 <code class="language-plaintext highlighter-rouge">WriteFile</code> doesn’t have partial writes, so on Windows an
<code class="language-plaintext highlighter-rouge">os_write</code> could pass its arguments directly to the operating system.</p>

<p>The program will need to call <code class="language-plaintext highlighter-rouge">flush</code> directly when it’s done writing
output, or to display output early, e.g. line buffering. In <code class="language-plaintext highlighter-rouge">append</code> we’ll
use a loop to continue appending and flushing until the input is consumed
or an error occurs.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">src</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">end</span> <span class="o">=</span> <span class="n">src</span> <span class="o">+</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">&amp;&amp;</span> <span class="n">src</span><span class="o">&lt;</span><span class="n">end</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">left</span> <span class="o">=</span> <span class="n">end</span> <span class="o">-</span> <span class="n">src</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">avail</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">cap</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">amount</span> <span class="o">=</span> <span class="n">avail</span><span class="o">&lt;</span><span class="n">left</span> <span class="o">?</span> <span class="n">avail</span> <span class="o">:</span> <span class="n">left</span><span class="p">;</span>

        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">amount</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">[</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">src</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">amount</span><span class="p">;</span>
        <span class="n">src</span> <span class="o">+=</span> <span class="n">amount</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">amount</span> <span class="o">&lt;</span> <span class="n">left</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">flush</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That completes formatted output! We can now do stuff like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">mem</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">10</span><span class="p">];</span>  <span class="c1">// arbitrarily-chosen 1kB buffer</span>
    <span class="k">struct</span> <span class="n">buf</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">FDBUF</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">mem</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">mem</span><span class="p">));</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">long</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">1000000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stdout</span><span class="p">,</span> <span class="s">"iteration "</span><span class="p">);</span>
        <span class="n">append_long</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stdout</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="n">append_byte</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stdout</span><span class="p">,</span> <span class="sc">'\n'</span><span class="p">);</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
    <span class="n">flush</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stdout</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">stdout</span><span class="p">.</span><span class="n">error</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Except for the lack of format DSL, this should feel familiar.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>Let's write a setjmp</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/02/12/"/>
    <id>urn:uuid:ab83cc5d-7877-4cba-98e4-d36059297ead</id>
    <updated>2023-02-12T02:23:11Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=34760828">on Hacker News</a>.</em></p>

<p>Yesterday I wrote that <a href="/blog/2023/02/11/"><code class="language-plaintext highlighter-rouge">setjmp</code> is handy</a> and that it would be nice
to have without linking the C standard library. It’s conceptually simple,
after all. Today let’s explore some differently-portable implementation
possibilities with distinct trade-offs. At the very least it should
illuminate why <code class="language-plaintext highlighter-rouge">setjmp</code> sometimes requires the use of <code class="language-plaintext highlighter-rouge">volatile</code>.</p>

<!--more-->

<p>First, a quick review: <code class="language-plaintext highlighter-rouge">setjmp</code> and <code class="language-plaintext highlighter-rouge">longjmp</code> are a form of <em>non-local
goto</em>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="kt">int</span> <span class="nf">setjmp</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">longjmp</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>
</code></pre></div></div>

<p>Calling <code class="language-plaintext highlighter-rouge">setjmp</code> saves the execution context in a <code class="language-plaintext highlighter-rouge">jmp_buf</code>, and <code class="language-plaintext highlighter-rouge">longjmp</code>
restores this context, returning the thread to this previous point of
execution. This means <code class="language-plaintext highlighter-rouge">setjmp</code> returns twice: (1) after saving the
context, and (2) from <code class="language-plaintext highlighter-rouge">longjmp</code>. To distinguish these cases, the first
time it returns zero and the second time it returns the value passed to
<code class="language-plaintext highlighter-rouge">longjmp</code>.</p>

<p><code class="language-plaintext highlighter-rouge">jmp_buf</code> is an array of some platform-specific type and length. I’ll be
using void pointers in this article because it’s a register-sized type
that isn’t behind a typedef. Plus they print nicely in GDB as hexadecimal
addresses which eased in working it out.</p>

<h3 id="using-gcc-intrinsics">Using GCC intrinsics</h3>

<p>Let’s start with the easiest option. <a href="https://gcc.gnu.org/onlinedocs/gcc/Nonlocal-Gotos.html">GCC has two intrinsics</a> doing
all the hard work for us: <code class="language-plaintext highlighter-rouge">__builtin_setjmp</code> and <code class="language-plaintext highlighter-rouge">__builtin_longjmp</code>. Its
worst case <code class="language-plaintext highlighter-rouge">jmp_buf</code> is length 5, but the most popular architectures only
use the first 3 elements. Clang supports these intrinsics as well for GCC
compatibility.</p>

<p>Be mindful that the semantics are slightly different from the standard C
definition, namely that you cannot use <code class="language-plaintext highlighter-rouge">longjmp</code> from the same function as
<code class="language-plaintext highlighter-rouge">setjmp</code>. It also doesn’t touch the signal mask. However, it’s easier to
use and you don’t need to worry about <code class="language-plaintext highlighter-rouge">volatile</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// NOTE to copy-pasters: semantics differ slightly from standard C</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="mi">5</span><span class="p">];</span>
<span class="cp">#define setjmp __builtin_setjmp
#define longjmp __builtin_longjmp
</span></code></pre></div></div>

<p>If you only care about GCC and/or Clang, then that’s it! It works as-is on
every supported target and nothing more is needed. As a bonus, it will be
more efficient than the libc version, though I should hope that won’t
matter in practice. These are so awesome and convenient that I’m already
second-guessing myself: “Do I <em>really</em> need to support other compilers…?”</p>

<h3 id="using-assembly">Using assembly</h3>

<p>If I want to support more compilers I’ll need to write it myself. It’s
also an excuse to dig into the details. The execution context is no more
than an array of saved registers, and <code class="language-plaintext highlighter-rouge">longjmp</code> is merely restoring those
registers. One of the registers is the instruction pointer, and setting
the instruction pointer is called a jump.</p>

<p>Since we’re talking about registers, that means assembly. We’ll also need
to know the target’s calling convention, so this really narrows things
down. This implementation will target x86-64, a.k.a x64, Windows, <em>but</em> it
will support MSVC as an additional compiler. So it’s a different kind of
portability. I’ll start with GCC via <a href="https://github.com/skeeto/w64devkit">w64devkit</a> then massage it into
something MSVC can use.</p>

<p>I mentioned before that <code class="language-plaintext highlighter-rouge">setjmp</code> returns twice. So to return a second time
we just need to <em>simulate</em> a normal function return. Obviously that
includes restoring the stack pointer like the <code class="language-plaintext highlighter-rouge">ret</code> instruction, but it
means preserving all the non-volatile registers a callee is supposed to
preserve. These will all go in the execution context.</p>

<p>The <a href="https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention">x64 calling convention</a> specifies 9 non-volatile <code class="language-plaintext highlighter-rouge">rsp</code>, <code class="language-plaintext highlighter-rouge">rsp</code>,
<code class="language-plaintext highlighter-rouge">rbx</code>, <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">r12</code>, <code class="language-plaintext highlighter-rouge">r13</code>, <code class="language-plaintext highlighter-rouge">r14</code>, and <code class="language-plaintext highlighter-rouge">r15</code>. We’ll also need the
instruction pointer, <code class="language-plaintext highlighter-rouge">rip</code>, making it 10 total.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="mi">10</span><span class="p">];</span>
</code></pre></div></div>

<h4 id="setjmp-assembly">setjmp assembly</h4>

<p>The tricky issue is that we need to save the registers immediately inside
<code class="language-plaintext highlighter-rouge">setjmp</code> before the compiler has manipulated them in a function prologue.
That will take more than mere inline assembly. We’ll start with a <em>naked</em>
function, which means that GCC will not create a prologue or epilogue.
However, that means no local variables, and the function body will be
limited to inline assembly, including a <code class="language-plaintext highlighter-rouge">ret</code> instruction for the
epilogue.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span><span class="p">((</span><span class="kr">naked</span><span class="p">))</span>
<span class="kt">int</span> <span class="nf">setjmp</span><span class="p">(</span><span class="kt">jmp_buf</span> <span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span><span class="p">(</span>
        <span class="c1">// ...</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The x64 calling convention uses <code class="language-plaintext highlighter-rouge">rcx</code> for the first pointer argument, so
that’s where we’ll find <code class="language-plaintext highlighter-rouge">buf</code>. I’ve arbitrarily decided to store <code class="language-plaintext highlighter-rouge">rip</code>
first, then the other registers in order. However, the current value of
<code class="language-plaintext highlighter-rouge">rip</code> isn’t the one we need. The <code class="language-plaintext highlighter-rouge">rip</code> we need was just pushed on top of
the stack by the caller. I’ll read that off the stack into a scratch
register, <code class="language-plaintext highlighter-rouge">rax</code>, and then store it in the first element of <code class="language-plaintext highlighter-rouge">buf</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rax</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rax</span><span class="p">,</span>  <span class="mi">0</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>The stack pointer, <code class="language-plaintext highlighter-rouge">rsp</code>, is also indirect since I want the pointer just
before <code class="language-plaintext highlighter-rouge">rip</code> was pushed, as it would be just after a <code class="language-plaintext highlighter-rouge">ret</code>. I use a <code class="language-plaintext highlighter-rouge">lea</code>,
<em>load effective address</em>, to add 8 bytes (recall: stack grows down),
placing the result in a scratch register, then write it into the second
element of <code class="language-plaintext highlighter-rouge">buf</code> (i.e. 8 bytes into <code class="language-plaintext highlighter-rouge">%rcx</code>).</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">lea</span> <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rax</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rax</span><span class="p">,</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>Everything else is a matter of elbow grease.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rbp</span><span class="p">,</span> <span class="mi">16</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rbx</span><span class="p">,</span> <span class="mi">24</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rdi</span><span class="p">,</span> <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rsi</span><span class="p">,</span> <span class="mi">40</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nv">r12</span><span class="p">,</span> <span class="mi">48</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nv">r13</span><span class="p">,</span> <span class="mi">56</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nv">r14</span><span class="p">,</span> <span class="mi">64</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nv">r15</span><span class="p">,</span> <span class="mi">72</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>With all work complete, return zero to the caller.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">xor</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Putting it altogether, and avoiding a <code class="language-plaintext highlighter-rouge">-Wunused-variable</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span><span class="p">((</span><span class="kr">naked</span><span class="p">,</span><span class="n">returns_twice</span><span class="p">))</span>
<span class="kt">int</span> <span class="nf">setjmp</span><span class="p">(</span><span class="kt">jmp_buf</span> <span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
    <span class="kr">__asm</span><span class="p">(</span>
        <span class="s">"mov (%rsp), %rax</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rax,  0(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"lea 8(%rsp), %rax</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rax,  8(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rbp, 16(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rbx, 24(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rdi, 32(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rsi, 40(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %r12, 48(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %r13, 56(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %r14, 64(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %r15, 72(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"xor %eax, %eax</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"ret</span><span class="se">\n</span><span class="s">"</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Also take note of the <code class="language-plaintext highlighter-rouge">returns_twice</code> attribute. It informs GCC of this
function’s unusual nature, saying the function <em>doesn’t</em> preserve most
non-volatile registers, and induces <code class="language-plaintext highlighter-rouge">-Wclobbered</code> diagnostics. Technically
this means we could get away with saving only <code class="language-plaintext highlighter-rouge">rip</code>, <code class="language-plaintext highlighter-rouge">rsp</code>, and <code class="language-plaintext highlighter-rouge">rbp</code> —
exactly as <code class="language-plaintext highlighter-rouge">__builtin_setjmp</code> does — but we’ll need the others for MSVC
anyway.</p>

<h4 id="longjmp-assembly">longjmp assembly</h4>

<p>In <code class="language-plaintext highlighter-rouge">longjmp</code> we need to restore all those registers. For purely aesthetic
reasons I’ve decided to do it in reverse order. Everything but <code class="language-plaintext highlighter-rouge">rip</code> is
easy.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="mi">72</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r15</span>
    <span class="nf">mov</span> <span class="mi">64</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r14</span>
    <span class="nf">mov</span> <span class="mi">56</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r13</span>
    <span class="nf">mov</span> <span class="mi">48</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r12</span>
    <span class="nf">mov</span> <span class="mi">40</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rsi</span>
    <span class="nf">mov</span> <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rdi</span>
    <span class="nf">mov</span> <span class="mi">24</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rbx</span>
    <span class="nf">mov</span> <span class="mi">16</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rbp</span>
    <span class="nf">mov</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rsp</span>
</code></pre></div></div>

<p>The instruction set doesn’t have direct access to <code class="language-plaintext highlighter-rouge">rip</code>. It will be a
<code class="language-plaintext highlighter-rouge">jmp</code> instead of <code class="language-plaintext highlighter-rouge">mov</code>, but before jumping we’ll need to prepare the
return value. The x64 calling convention says the second argument is
passed in <code class="language-plaintext highlighter-rouge">rdx</code>, so move that to <code class="language-plaintext highlighter-rouge">rax</code>, then <code class="language-plaintext highlighter-rouge">jmp</code> to the caller. It’s
only a 32-bit operand, C <code class="language-plaintext highlighter-rouge">int</code>, so <code class="language-plaintext highlighter-rouge">edx</code> instead of <code class="language-plaintext highlighter-rouge">rdx</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="o">%</span><span class="nb">edx</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
    <span class="nf">jmp</span> <span class="o">*</span><span class="mi">0</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>Putting it all together, and adding the <code class="language-plaintext highlighter-rouge">noreturn</code> attribute:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span><span class="p">((</span><span class="kr">naked</span><span class="p">,</span><span class="n">noreturn</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">longjmp</span><span class="p">(</span><span class="kt">jmp_buf</span> <span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">ret</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ret</span><span class="p">;</span>
    <span class="kr">__asm</span><span class="p">(</span>
        <span class="s">"mov 72(%rcx), %r15</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 64(%rcx), %r14</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 56(%rcx), %r13</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 48(%rcx), %r12</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 40(%rcx), %rsi</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 32(%rcx), %rdi</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 24(%rcx), %rbx</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 16(%rcx), %rbp</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov  8(%rcx), %rsp</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %edx, %eax</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"jmp *0(%rcx)</span><span class="se">\n</span><span class="s">"</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The C standard says that if <code class="language-plaintext highlighter-rouge">ret</code> is zero then <code class="language-plaintext highlighter-rouge">longjmp</code> will return 1
from <code class="language-plaintext highlighter-rouge">setjmp</code> instead. I leave that detail as a reader exercise. Otherwise
this is a complete, working <code class="language-plaintext highlighter-rouge">setjmp</code>. It works perfectly when I swap it in
for <code class="language-plaintext highlighter-rouge">setjmp.h</code> in <a href="https://github.com/skeeto/u-config/blob/master/test_main.c">my u-config test suite</a>.</p>

<h3 id="considering-volatile">Considering volatile</h3>

<p>Now that you’ve seen the guts, let’s talk about <code class="language-plaintext highlighter-rouge">volatile</code> and why it’s
necessary. Consider this function, <code class="language-plaintext highlighter-rouge">example</code>, which calls a <code class="language-plaintext highlighter-rouge">work</code>
function that may return through <code class="language-plaintext highlighter-rouge">setjmp</code> (e.g. on failure).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">work</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">example</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">jmp_buf</span> <span class="n">buf</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">setjmp</span><span class="p">(</span><span class="n">buf</span><span class="p">))</span> <span class="p">{</span>
        <span class="c1">// first return</span>
        <span class="n">r</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">work</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="c1">// second return</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It stores to <code class="language-plaintext highlighter-rouge">r</code> after the first <code class="language-plaintext highlighter-rouge">setjmp</code> return, then loads <code class="language-plaintext highlighter-rouge">r</code> after the
second <code class="language-plaintext highlighter-rouge">setjmp</code> return. However, <code class="language-plaintext highlighter-rouge">r</code> may have been stored in the execution
context. Since it’s used across function calls, it would be reasonable to
store this variable in non-volatile register like <code class="language-plaintext highlighter-rouge">ebx</code>. If so, it will be
restored to its value at the moment of the first call to <code class="language-plaintext highlighter-rouge">setbuf</code>, in
which case the <em>old</em> <code class="language-plaintext highlighter-rouge">r</code> would be read after restoration by <code class="language-plaintext highlighter-rouge">longjmp</code>. If
it’s not stored in a register, but on the stack, then on the second return
the function will read the latest value out of the stack. In practice, if
<code class="language-plaintext highlighter-rouge">work</code> returns through <code class="language-plaintext highlighter-rouge">longjmp</code>, this function may return either 0 or 1,
probably determined by the optimization level.</p>

<p>The solution is to qualify <code class="language-plaintext highlighter-rouge">r</code> with <code class="language-plaintext highlighter-rouge">volatile</code>, which forces the compiler
to store the variable on the stack and never cache it in a register.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">volatile</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>Though since our <code class="language-plaintext highlighter-rouge">setbuf</code> is marked <code class="language-plaintext highlighter-rouge">returns_twice</code>, GCC will never store
<code class="language-plaintext highlighter-rouge">r</code> in a register across <code class="language-plaintext highlighter-rouge">setjmp</code> calls. This potentially hides a bug in
the program that would occur under some other compilers, but GCC will
(usually) warn about it.</p>

<h3 id="pure-assembly-and-msvc">Pure assembly and MSVC</h3>

<p>MSVC doesn’t understand <code class="language-plaintext highlighter-rouge">__attribute__</code> nor the inline assembly, so it
cannot compile these functions. I could compile my <code class="language-plaintext highlighter-rouge">setjmp</code> with GCC and
the rest of the program with MSVC, which means I need two compilers.
Instead, I’ll move to pure assembly, assemble with GNU <code class="language-plaintext highlighter-rouge">as</code> (TODO: port
to MASM?) so we’ll only need a tiny piece of the GNU toolchain.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>	<span class="nf">.global</span> <span class="nv">setjmp</span>
<span class="nl">setjmp:</span>
        <span class="nf">mov</span> <span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rax</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rax</span><span class="p">,</span>  <span class="mi">0</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">lea</span> <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rax</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rax</span><span class="p">,</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rbp</span><span class="p">,</span> <span class="mi">16</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rbx</span><span class="p">,</span> <span class="mi">24</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rdi</span><span class="p">,</span> <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rsi</span><span class="p">,</span> <span class="mi">40</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nv">r12</span><span class="p">,</span> <span class="mi">48</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nv">r13</span><span class="p">,</span> <span class="mi">56</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nv">r14</span><span class="p">,</span> <span class="mi">64</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nv">r15</span><span class="p">,</span> <span class="mi">72</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">xor</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
	<span class="nf">ret</span>

	<span class="nf">.globl</span> <span class="nv">longjmp</span>
<span class="nl">longjmp:</span>
	<span class="nf">mov</span> <span class="mi">72</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r15</span>
	<span class="nf">mov</span> <span class="mi">64</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r14</span>
	<span class="nf">mov</span> <span class="mi">56</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r13</span>
	<span class="nf">mov</span> <span class="mi">48</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r12</span>
	<span class="nf">mov</span> <span class="mi">40</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rsi</span>
	<span class="nf">mov</span> <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rdi</span>
	<span class="nf">mov</span> <span class="mi">24</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rbx</span>
	<span class="nf">mov</span> <span class="mi">16</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rbp</span>
	<span class="nf">mov</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rsp</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">edx</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
	<span class="nf">jmp</span> <span class="o">*</span><span class="mi">0</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>Then some declarations in C:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="mi">10</span><span class="p">];</span>
<span class="kt">int</span> <span class="nf">setjmp</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">);</span>
<span class="k">_Noreturn</span> <span class="kt">void</span> <span class="nf">longjmp</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>
</code></pre></div></div>

<p>I’ll need to enable C11 for that <code class="language-plaintext highlighter-rouge">_Noreturn</code> in MSVC. Assemble, compile,
and link:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ as -o setjmp.obj setjmp.s
$ cl /std:c11 program.c setjmp.obj
</code></pre></div></div>

<p>That generally works! If I rename to <code class="language-plaintext highlighter-rouge">xsetjmp</code> and <code class="language-plaintext highlighter-rouge">xlongjmp</code> to avoid
conflicting with the CRT definitions, drop them into the u-config test
suite in place of <code class="language-plaintext highlighter-rouge">setjmp.h</code>, then compile with MSVC, it passes all tests
using my alternate implementation in MSVC as well as GCC. Pretty cool!</p>

<h3 id="takeaway">Takeaway</h3>

<p>I’m not sure if I’ll ever use the assembly, but writing this article led
me to try the GCC intrinsics, and I’m so impressed I’m still thinking
about ways I can use them. My main thought is out-of-memory situations in
arena allocators, using a non-local exit to roll back to a savepoint, even
if just to return an error. This is nicer than either terminating the
program or handling OOM errors on every allocation. Very roughly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">cap</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">off</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="mi">5</span><span class="p">];</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="c1">// Place an arena and savepoint an out-of-memory jump.</span>
<span class="cp">#define OOM(a, m, n) __builtin_setjmp((a = place(m, n))-&gt;jmp_buf)
</span>
<span class="c1">// Place a new arena at the front of the buffer.</span>
<span class="n">Arena</span> <span class="o">*</span><span class="nf">place</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">mem</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">size</span> <span class="o">&gt;=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">Arena</span><span class="p">));</span>
    <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">mem</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">cap</span> <span class="o">=</span> <span class="n">size</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">Arena</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">avail</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">cap</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">avail</span> <span class="o">&lt;</span> <span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">__builtin_longjmp</span><span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="kt">jmp_buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">a</span> <span class="o">+</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Usage would look like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">compute</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">workmem</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">memsize</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Arena</span> <span class="o">*</span><span class="n">arena</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">OOM</span><span class="p">(</span><span class="n">arena</span><span class="p">,</span> <span class="n">workmem</span><span class="p">,</span> <span class="n">memsize</span><span class="p">))</span> <span class="p">{</span>
        <span class="c1">// jumps here when out of memory</span>
        <span class="k">return</span> <span class="n">COMPUTE_OOM</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">Thing</span> <span class="o">*</span><span class="n">t</span> <span class="o">=</span> <span class="n">PUSHSTRUCT</span><span class="p">(</span><span class="n">arena</span><span class="p">,</span> <span class="n">Thing</span><span class="p">);</span>
    <span class="c1">// ...</span>

    <span class="k">return</span> <span class="n">COMPUTE_OK</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>More granular snapshots can be made further down the stack by allocating
subarenas out of the main arena. I have yet to try this out in a practical
program.</p>

]]>
    </content>
  </entry>
  

</feed>
