null program

Solving "Two Sum" in C with a tiny hash table

2023-06-26T19:38:18Z

I came across a question: How does one efficiently solve Two Sum in C? There’s a naive quadratic time solution, but also an amortized linear time solution using a hash table. Without a built-in or standard library hash table, the latter sounds onerous. However, a mask-step-index table, a hash table construction suitable for many problems, requires only a few lines of code. This approach is useful even when a standard hash table is available, because by exploiting the known problem constraints, it beats typical generic hash table performance by 1–2 orders of magnitude (demo).

The Two Sum exercise, restated:

Given an integer array and target, return the distinct indices of two elements that sum to the target.

In particular, the solution doesn’t find elements, but their indices. The exercise also constrains input ranges — important but easy to overlook:

2 <= count <= 10⁴
-10⁹ <= nums[i] <= 10⁹
-10⁹ <= target <= 10⁹

Notably, indices fit in a 16-bit integer with lots of room to spare. In fact, it will fit in a 14-bit address space (16,384) with still plenty of overhead. Elements fit in a signed 32-bit integer, and we can add and subtract elements without overflow, if just barely. The last constraint isn’t redundant, but it’s not readily exploitable either.

The naive solution is to linearly search the array for the complement. With nested loops, it’s obviously quadratic time. At 10k elements, we expect an abysmal 25M comparisons on average.

int16_t count = ...;
int32_t *nums = ...;

for (int16_t i = 0; i < count-1; i++) {
    for (int16_t j = i+1; j < count; j++) {
        if (nums[i]+nums[j] == target) {
            // found
        }
    }
}

The nums array is “keyed” by index. It would be better to also have the inverse mapping: key on elements to obtain the nums index. Then for each element we could compute the complement and find its index, if any, using this second mapping.

The input range is finite, so an inverse map is simple. Allocate an array, one element per integer in range, and store the index there. However, the input range is 2 billion, and even with 16-bit indices that’s a 4GB array. Feasible on 64-bit hosts, but wasteful. The exercise is certainly designed to make it so. This array would be very sparse, at most less than half a percent of its elements populated. That’s a hint: Associative arrays are far more appropriate for representing such sparse mappings. That is, a hash table.

Using Go’s built-in hash table:

func TwoSumWithMap(nums []int32, target int32) (int, int, bool) {
    seen := make(map[int32]int16)
    for i, num := range nums {
        complement := target - num
        if j, ok := seen[complement]; ok {
            return int(j), i, true
        }
        seen[num] = int16(i)
    }
    return 0, 0, false
}

In essence, the hash table folds the sparse 2 billion element array onto a smaller array, with collision resolution when elements inevitably land in the same slot. For this exercise, that small array could be as small as 10,000 elements because that’s the most we’d ever need to track. For folding the large key space onto the smaller, we could use modulo. For collision resolution, we could keep walking the table.

int16_t seen[10000] = {0};

// Find or insert nums[index].
int16_t lookup(int32_t *nums, int16_t index)
{
    int i = nums[index] % 10000;
    for (;;) {
        int16_t j = seen[i] - 1;  // unbias
        if (j < 0) {  // empty slot
            seen[i] = index + 1;  // insert biased index
            return -1;
        } else if (nums[j] == nums[index]) {
            return j;  // match found
        }
        i = (i + 1) % 10000;  // keep looking
    }
}

Take note of a few details:

An empty slot is zero, and an empty table is a zero-initialized array. Since zero is a valid value, and all values are non-negative, it biases values by 1 in the table.
The nums array is part of the table structure, necessary for lookups. The two mappings — element-by-index and index-by-element — share structure.
It uses open addressing with linear probing, and so walks the table until it either either finds the element or hits an empty slot.
The “hash” function is modulo. If inputs are not random, they’ll tend to bunch up in the table. Combined with linear probing makes for lots of collisions. For the worst case, imagine sequentially ordered inputs.
Sometimes the table will almost completely fill, and lookups will be no better than the linear scans of the naive solution.
Most subtle of all: This hash table is not enough for the exercise. The keyed-on element may not even be in nums, and when lookup fails, that element is not inserted in the table. Instead, a different element is inserted. The conventional solution has at least two hash table lookups. In the Go code, it’s seen[complement] for lookups and seen[num] for inserts.

To solve (4) we’ll use a hash function to more uniformly distribute elements in the table. We’ll also probe the table in a random-ish order that depends on the key. In practice there will be little bunching even for non-random inputs.

To solve (5) we’ll use a larger table: 2¹⁴ or 16,384 elements. This has breathing room, and with a power of two we can use a fast mask instead of a slow division (though in practice, compilers usually implement division by a constant denominator with modular multiplication).

To solve (6) we’ll key complements together under the same key. It looks for the complement, but on failure it inserts the current element in the empty slot. In other words, this solution will only need a single hash table lookup per element!

Laying down some groundwork:

typedef struct {
    int16_t i, j;
    _Bool ok;
} TwoSum;

TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
    TwoSum r = {0};
    int16_t seen[1<<14] = {0};
    for (int16_t n = 0; n < count; n++) {
        // ...
    }
    return r;
}

The seen array is a 32KiB hash table large enough for all inputs, small enough that it can be a local variable. In the loop:

        int32_t complement = target - nums[n];
        int32_t key = complement>nums[n] ? complement : nums[n];
        uint32_t hash = key * 489183053u;
        unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
        unsigned step = hash>>13 | 1;

Compute the complement, then apply a “max” operation to derive a key. Any commutative operation works, though obviously addition would be a poor choice. XOR is similar enough to cause many collisions. Multiplication works well, and is probably better if the ternary produces a branch.

The hash function is multiplication with a randomly-chosen prime. As we’ll see in a moment, step will also add-shift the hash before use. The initial index will be the bottom 14 bits of this hash. For step, recall from the MSI article that it must be odd so that every slot is eventually probed. I shift out 13 bits and then override the 14th bit, so step effectively skips over the 14 bits used for the initial table index.

I used unsigned because I don’t really care about the width of the hash table index, but more importantly, I want defined overflow from all the bit twiddling, even in the face of implicit promotion. As a bonus, it can help in reasoning about indirection: seen indices are unsigned, nums indices are int16_t.

        for (unsigned i = hash;;) {
            i = (i + step) & mask;
            int16_t j = seen[i] - 1;  // unbias
            if (j < 0) {
                seen[i] = n + 1;  // bias and insert
                break;
            } else if (nums[j] == complement) {
                r.i = j;
                r.j = n;
                r.ok = 1;
                return r;
            }
        }

The step is added before using the index the first time, helping to scatter the start point and reduce collisions. If it’s an empty slot, insert the current element, not the complement — which wouldn’t be possible anyway. Unlike conventional solutions, this doesn’t require another hash and lookup. If it finds the complement, problem solved, otherwise keep going.

Putting it all together, it’s only slightly longer than solutions using a generic hash table:

TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
    TwoSum r = {0};
    int16_t seen[1<<14] = {0};
    for (int16_t n = 0; n < count; n++) {
        int32_t complement = target - nums[n];
        int32_t key = complement>nums[n] ? complement : nums[n];
        uint32_t hash = key * 489183053u;
        unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
        unsigned step = hash>>13 | 1;
        for (unsigned i = hash;;) {
            i = (i + step) & mask;
            int16_t j = seen[i] - 1;  // unbias
            if (j < 0) {
                seen[i] = n + 1;  // bias and insert
                break;
            } else if (nums[j] == complement) {
                r.i = j;
                r.j = n;
                r.ok = 1;
                return r;
            }
        }
    }
    return r;
}

Applying this technique to Go:

func TwoSumWithBespoke(nums []int32, target int32) (int, int, bool) {
    var seen [1 << 14]int16
    for n, num := range nums {
        complement := target - num
        hash := int(num * complement * 489183053)
        mask := len(seen) - 1
        step := hash>>13 | 1
        for i := hash; ; {
            i = (i + step) & mask
            j := int(seen[i] - 1) // unbias
            if j < 0 {
                seen[i] = int16(n) + 1 // bias
                break
            } else if nums[j] == complement {
                return j, n, true
            }
        }
    }
    return 0, 0, false
}

With Go 1.20 this is an order of magnitude faster than map[int32]int16, which isn’t surprising. I used multiplication as the key operator because, in my first take, Go produced a branch for the “max” operation — at a 25% performance penalty on random inputs.

A full-featured, generic hash table may be overkill for your problem, and a bit of hashed indexing with collision resolution over a small array might be sufficient. The problem constraints might open up such shortcuts.

My ranking of every Shakespeare play

2023-06-22T19:10:25Z

This article was discussed on Hacker News.

A few years ago I set out on a personal journey to study and watch a performance of each of Shakespeare’s 37 plays. I’ve reached my goal and, though it’s not a usual topic around here, I wanted to get my thoughts down while fresh. I absolutely loved some of these plays and performances, and so I’d like to highlight them, especially because my favorites are, with one exception, not “popular” plays. Per tradition, I begin with my least enjoyed plays and work my way up. All performances were either a recording of a live stage or an adaptation, so they’re also available to you if you’re interested, though in most cases not for free. I’ll mention notable performances when applicable. The availability of a great performance certainly influenced my play rankings.

Like many of you, I had assigned reading for several Shakespeare plays in high school. I loathed these assignments. I wasn’t interested at the time, nor was I mature enough to appreciate the writing. Even revisiting as an adult, the conventional selection — Romeo and Juliet, Julius Caesar, etc. — are not highly ranked on my list. For the next couple of decades I thought that Shakespeare just wasn’t for me.

Then I watched the 1993 adaption of Much Ado About Nothing and it instantly became one of my favorite films. Why didn’t we read this in high school?! Reading the play with footnotes helped to follow the humor and allusions. Even with the film’s abridging, some of it still went over my head. I soon discovered Asimov’s Guide to Shakespeare — yes, that Asimov — which was exactly what I needed, and a perfect companion while reading and watching the plays. If stumbling upon this turned out so well, then I’d better keep going.

Wanting a solid set of the plays with good footnotes and editing — there is no canonical version of the plays — I picked up a copy of The Norton Shakespeare. Unfortunately it’s part of the college textbook racket, and it shows. The collection is designed to be sold to students who will lug them in bookbags, will typically open them face-up on a desk, and are uninterested in their contents beyond class. It includes a short-term, digital-only, DRMed component to prevent resale. After all, their target audience will not read it again anyway. Though at least it’s complete and compact, better for reference than reading.

In contrast, the Folger Shakespeare Library mass market paperbacks are better for enthusiasts, both in form and format. They’re clearly built for casual, comfortable reading. However, they’re not sold as a complete set, and gathering used copies takes some work.

Also essential was BBC Television Shakespeare, produced between 1978 and 1985. Finding productions of the more obscure plays is tricky, but it always provided a fallback. In some cases these were the best performances anyway! When I mention “the BBC production” I mean this series. Like many collections, they omit The Two Noble Kinsmen due to unclear authorship, and for this reason I’m omitting it from my list as well. As with any faithful production, I suggest subtitles on the first viewing, as it aids with understanding. Shakespeare’s sentence structure is sometimes difficult to parse by moderns, and on-screen text helps. (By the way, a couple of handy SHA-1 sums for those who know how to use them:)

0ae909e5444c17183570407bd09a622d2827751e
55c77ed7afb8d377c9626527cc762bda7f3e1d83

As my list will show, my favorites are comedic comedies and histories, particularly the two Henriads, each a group of four plays. The first — Richard II, 1 Henry IV, 2 Henry IV, and Henry V — concerns events around Henry V, in the late 14th and early 15th century. Those number prefixes are parts, as in Henry IV has two parts. In my list I combine parts as though a single play. The second — 1 Henry VI, 2 Henry VI, 3 Henry VI, Richard III — is about the Wars of the Roses, spanning the 15th century. Asimov’s book was essential for filling in the substantial historical background for these plays, and my journey was also in part a history study.

I especially enjoy villain monologues, and plays with them rank higher as a result. It’s said that everyone is the hero of their own story, but Shakespeare’s villains may know that they’re villains and revel it in it, bragging directly to the audience about all the trouble they’re going to cause. In some cases they mock the audience’s sacred values, which in a way, is like the stand up comedy of Shakespeare’s time. Notable examples are Edmund (King Lear), Aaron (Titus Andronicus), Richard III, Iago (Othello), and Shylock (The Merchant of Venice).

As with literature even today, authors are not experts in moral reasoning and protagonists are often, on reflection, incredibly evil. Shakespeare is no different, especially for historical events and people, praising those who create mass misery (e.g. tyrants waging wars) and vilifying those who improve everyone’s lives (e.g. anyone who deals with money). Up to and including Shakespeare’s time, a pre-industrial army on the march was a rolling humanitarian crisis, even in “friendly” territory, slaughtering and stealing its way through the country in order to keep going. So, much like suspension of belief, there’s a suspension of morality where I engage with the material on its own moral terms, however illogical it may be.

Now finally my list. The beginning will be short and negative because, to be frank, I disliked some of the plays. Even Shakespeare had to work under constraints. In his time none were regarded as great works. They weren’t even viewed as literature, but similarly to how we consider television scripts today. Also, around 20% of plays credited to Shakespeare were collaborations of some degree, though the collaboration details have been long lost. For simplicity, I will just refer to the author as Shakespeare.

(37) Timon of Athens

I have nothing positive to say about this play. It’s about a man who borrows and spends recklessly, then learns all the wrong lessons from the predictable results.

(36) The Two Gentlemen of Verona

Involves a couple of love triangles, a woman disguised as a man — a common Shakespeare trope — and perhaps the worst ending to a play ever written. The two “gentlemen” are terrible people and undeserving of their happy ending. Though I enjoyed the scenes with Proteus and Crab, the play’s fool and his dog.

(35) Troilus and Cressida

Interesting that it’s set during the Iliad and features legendary characters such as Achilles, Ajax, and Hector. I have no other positives to note. Cressida’s abrupt change of character in the Greek camp later in the play is baffling, as though part of the play has been lost, and ruins an already dull play for me.

(34) The Winter’s Tale

A baby princess is lost, presumed dead, and raised by shepherds. She is later rediscovered by her father as a young adult. It has a promising start, but in the final act the main plot is hastily resolved off-stage and seemingly replaced with a hastily rewritten ending that nonsensically resolves a secondary story line.

(33) Cymbeline

The title refers to a legendary early King of Britain and is set in the first century, but it is primarily about his daughter. The plot is complicated so I won’t summarize it here. It’s long and I just didn’t enjoy it. This is the second play in the list to feature a woman disguised as a man.

(32) The Tempest

A political exile stranded on an island in the Mediterranean gains magical powers through study, with the help of a spirit creates a tempest that strands his enemies on his island, then gently torments them until he’s satisfied that he’s had his revenge. It’s an okay play.

More interesting is the historical context behind the play. It’s based loosely on events around the founding of Jamestown, Virginia. Until this play, Shakespeare and Jamestown were, in my mind, unrelated historical events. In fact, Pocahontas very nearly met Shakespeare, missing him by just a couple of years, but she did meet his rival, Ben Jonson. I spent far more time catching up on real history, including reading the fascinating True Reportory, than I did on the play.

(31) The Taming of the Shrew

About a man courting and “taming” an ill-tempered woman, the shrew. The seeming moral of the play was outdated even in Shakespeare’s time, and it’s unclear what was intended. Technically it’s a play within a play, and an outer frame presents the play as part of an elaborate prank. However, the outer frame is dropped and never revisited, indicating that perhaps this part of the play was lost. The BBC production skips this framing entirely and plays it straight.

(30) All’s Well That Ends Well

Helena, a low-born enterprising young woman, saves a king’s life. She’s in love with a nobleman, Bertram, and the king orders him to marry her as repayment. He spurns her solely due to her low upbringing and flees the country. She gives chase, and eventually wins him over. Helena is a great character, and Bertram is utterly undeserving of her, which ruins the play for me in an unearned ending.

(29) Antony and Cleopatra

A tragedy about people who we know for sure existed, the first such on the list so far. The sequel to Julius Caesar, completing the story of the Second Triumvirate. Historically interesting, but the title characters were terrible, selfish people, including in the play, and they aren’t interesting enough to make up for it.

I enjoyed the portrayal of Octavian as a shrewd politician.

(28) Julius Caesar

A classic school reading assignment. Caesar’s death in front of the Statue of Pompey is obviously poetic, and so every performance loves playing it up. Antony’s speech is my favorite part of the play. I didn’t dislike this play, but nor did I find it interesting revisiting it as an adult.

(27) Coriolanus

About the career of a legendary Roman general and war hero who attempts to enter politics. He despises the plebeians, which gets him into trouble, but all he really wants is to please is mother. Stratford Festival has a worthy adaption in a contemporary setting.

(26) Henry VIII

He reigned from 1509 to 1547, but the play only covers Henry VIII’s first divorce. It paved the way for the English Reformation, though the play has surprisingly little to say it, or his murder spree. It’s set a few decades after the events of Richard III — too distant to truly connect with the second Henriad.

While I appreciate its historical context — with liberal dramatic license — it’s my least favorite of the English histories. It’s not part of an epic tetralogy, and the subject matter is mundane. My favorite scene is Katherine (Catherine in the history books) firmly rejecting the court’s jurisdiction and walking out. My favorite line: “No man’s pie is freed from his ambitious finger.”

(25) Romeo and Juliet

Another classic reading assignment that requires no description. A beautiful play, but I just don’t connect with its romantic core.

(24) The Merchant of Venice

An infamously antisemitic play where a Jewish moneylender, Shylock, loans to the titular merchant of Venice where the collateral is the original “pound of flesh,” providing the source for that cliche. Though even in his prejudice, Shakespeare can’t help but write multifaceted characters, particularly with Shylock’s famous “If you prick us, do we not bleed?” speech.

(23) Twelfth Night

Twins, a young man and a woman, are separated by a shipwreck. The woman disguises herself as a man and takes employment with a local duke and falls in love with him, but her employment requires her to carry love letters to the duke’s love interest. In the meantime the brother arrives, unaware his sister is in town in disguise, and everyone gets the twins mixed up leading to comedy. It’s a fun play. The title has nothing to do with the play, but refers to the holiday when the play was first performed.

The play is the source of the famous quote, “Some are born great, some achieve greatness, and some have greatness thrust upon them.” It’s used as part of a joke, and when I heard it, I had thought the play was mocking some original source.

(22) Pericles

A Greek play about a royal family — father, mother, daughter — separated by unfortunate — if contrived — circumstances, each thinking the others dead, but all tearfully reunited in a happy ending. My favorite part is the daughter, Marina, talking her way out of trouble: “She’s able to freeze the god Priapus and undo a whole generation.”

The BBC production stirred me, particularly the scene where Pericles and Marina are reunited.

(21) Richard II

Richard II, grandson of the famed Edward III, was a young King of England from 1367 to 1400. At least in the play, he carelessly makes dangerous enemies of his friends, and so is deposed by Henry Bolingbroke, who goes on to become Henry IV. The play is primarily about this abrupt transition of power, and it is the first play of the first Henriad. The conflict in this play creates tensions that will not be resolved until 1485, the end of the Wars of the Roses. Shakespeare spends seven additional plays on this a huge, interesting subject.

For me, Richard II is the most dull of the Henriad plays. It’s a slow start, but establishes the groundwork for the greater plays that follow. The BBC production of the first Henriad has “linked” casting where the same actors play the same roles through the four plays, which makes this an even more important watch.

(20) Othello

Another of the famous tragedy. Othello, an important Venetian general, and “the Moore of Venice” is dispatched to Venice-controlled Cyprus to defend against an attack by the Ottoman Turks. Iago, who has been overlooked for promotion by Othello, treacherously seeks revenge, secretly sabotaging all involved while they call him “honest Iago.” Though his schemes quickly go well beyond revenge, and continues sowing chaos just for his own fun.

I watched a few adaptions, and I most enjoyed the 2015 Royal Shakespeare Company Othello, which places it in a modern setting and requires few changes to do so.

(19) The Comedy of Errors

A fun, short play about a highly contrived situation: Two pairs of twins, where each pair of brothers has been given the same name, is separated at birth. As adults they all end up in the same town, and everyone mixes them up leading to comedy. It’s the lightest of Shakespeare’s plays, but also lacks depth.

(18) Hamlet

Another common, more senior, high school reading assignment. Shakespeare’s longest play, and probably the most subtle. In everything spoken between Hamlet and his murderous uncle, Claudius, one must read between the lines. Their real meanings are obscured by courtly language — familiar to Shakespeare’s audience, but not moderns. Asimov is great for understanding the political maneuvering, which is a lot like a game of chess. It made me appreciate the play more than I would have otherwise.

You’d be hard-pressed to find something that beats the faithful, star-studded 1996 major film adaption.

(17) Richard III

The final play of the second Henriad. Much of the play is Richard III winking at the audience, monologuing about his villainous plans, then executing those plans without remorse. Makes cheering for the bad guy fun. If you want to see an evil schemer get away with it, at least right up until the end when he gets his comeuppance, this is the play for you. This play is the source of the famous “My kingdom for a horse.”

I liked two different performances for different reasons. The 1995 major film puts the play in the World Word II era. It’s solid and does well standing alone. The BBC production has linked casting with the three parts of Henry VI, which allows one to enjoy it in full in its broader context. It’s also well-performed, but obviously has less spectacle and a lower budget.

(16) The Merry Wives of Windsor

The comedy spin-off of Henry IV. Allegedly, Elizabeth I liked the character of John Falstaff from Henry IV so much — I can’t blame her! — that she demanded another play with the character, and so Shakespeare wrote this play. The play brings over several characters from Henry IV. Unfortunately it’s in name only and they hardly behave like the same characters. Despite this, it’s still fun and does not require knowledge of Henry IV.

Falstaff ineptly attempts to seduce two married women, the titular wives, who play along in order to get revenge on him. However, their husbands are not in on the prank. One suspects infidelity and hatches his own plans. The confusion leads to the comedy.

The 2018 Royal Shakespeare Company production aptly puts it in a modern suburban setting.

(15) Titus Andronicus

A play about a legendary Roman general committed to duty above all else, even the lives of his own sons. He and his family become brutal victims of political rivals, and in return gets his own brutal revenge. It’s by far Shakespeare’s most violent and disturbing play. It’s a bit too violent even for me, but it ranks this highly because Aaron the Moore is such a fantastic character, another villain that loves winking at the audience. His lines throughout the play make me smile: “If one good deed in all my life I did, I do repent it from my very soul.”

I enjoyed the 1999 major film, which puts it in a contemporary setting.

(14) King Lear

The titular, mythological king of pre-Roman Britain wants to retire, and so he divides his kingdom between his three daughters. However, after petty selfishness on Lear’s part, he disowns the most deserving daughter, while the other two scheme against one another.

Some of the scenes in this play are my favorite among Shakespeare, such as Edmund’s monologue on bastards where he criticizes the status quo and mocks the audience’s beliefs. It also has one of the best fools, who while playing dumb, is both observant and wise. That’s most of Shakespeare’s fools, but it’s especially true in King Lear (“This is not altogether fool, my lord.”). This fool uses this “tenure” to openly mock the king to his face, the only character that can do so without repercussions.

My favorite performance was the 2015 Stratford Festival stage production, especially for its Edmund, Lear, and Fool.

(13) Macbeth

The shortest tragedy, a common reading assignment, and a perfect example of literature I could not appreciate without more maturity. Even the plays I dislike have beautiful poetry, but I especially love it in Macbeth.

The history behind Macbeth is itself fascinating. The play was written custom for the newly-crowned King James I — of King James Version fame — and even calls him out in the audience. James I was obsessed with witch hunts, so the play includes witchcraft. The character Banquo was by tradition considered to be his ancestor.

My favorite production by far — I watched a number of them! — was the 2021 film. It should be an approachable introduction for Shakespeare newcomers more interested in drama than comedy. Notably for me, it departs from typical productions in that Macbeth and Lady Macbeth do not scream at each other — perhaps normally a side effect of speaking loudly for stage performance. Particularly in Act 1, Scene 7 (“screw your courage to the sticking place”). In the film they argue calmly, like a couple in a genuine, healthy relationship, making the tragedy that much more tragic.

That being said, it drops the ball with the porter scene — a bit of comic relief just after Macbeth murders Duncan. There’s knocking at the gate, and the porter, charged with attending it, is hungover and takes his time. In a monologue he imagines himself porter to Hell, and on each impatient knock considers the different souls he would be greeting. Of all the porter scenes I watched, the best porter as the 2017 Stratford Festival production, where he is both charismatic and hilarious. I wish I could share a clip.

(12) King John

King John, brother of “Coeur de Lion” Richard I, ruled in early 13th century. His reign led to the Magna Carta, and he’s also the Prince John of the Robin Hood legend, though because it’s a history, and paints John in a positive light, that legend isn’t included. It depicts fascinating, real historical events and people, including Eleanor of Aquitaine. It also has one of my favorite Shakespeare characters, Phillip the Bastard, who gets all the coolest lines. I especially love his introductory scene where his lineage is disputed by his half-bother and Eleanor, impressed, essentially adopts him on the spot.

The 2015 Stratford Festival stage performance is wonderful, and I’ve re-watched it a few times. The performances are all great.

(11–9) Henry VI

As previously noted, this is actually three plays. At 3–4 hours apiece, it’s about the length of a modern television season. I thought it might take awhile to consume, but I was completely sucked in, watching and studying the whole trilogy in a single weekend.

Henry V died young in 1422, and his infant son became Henry VI, leaving England ruled by his uncles. As an adult he was a weak king, which allowed the conflicts of the previously-mentioned Richard II to bubble up into the Wars of the Roses, a bloody power conflict between the Lancasters and Yorks. The play features historical people including Joan la Pucelle (“Joan of Arc”), English war hero John Talbot, and Jack Cade. Richard III wraps up the conflicts of Henry VI, forming the second Henriad. When watching/reading the play, keep in mind that the play is anti-French, anti-York, and (implicitly) pro-Tudor.

Most of the first part was probably not written by Shakespeare, but rather adapted from an existing play to fill out the backstory. I think I can see the “seams” between the original and the edits that introduce the roses.

I loved the BBC production of the second Henriad. Producing such an epic story must be daunting, and it’s amazing what they could convey with such limited budget and means. It has hilarious and clever cinematography for the scene where the Countess of Auvergne attempts to trap Talbot (Part 1, Act 2, Scene 3). Again, I wish I could share a clip!

(8) Henry V

Due to his amazing victories, most notably at Agincourt where, for once, Shakespeare isn’t exaggerating the odds, Henry V is one of the great kings of English history. This play is a followup to Richard II and Henry IV, completing the first Henriad, and depicts Henry V’s war with France. Outside of the classroom, this is one of Shakespeare’s most popular plays.

The obvious choice for viewing is the 1989 major film, which, by borrowing a few scenes from Henry IV, attempts a standalone experience, though with limited success. I watched it before Henry IV, and I could not understand why the film was so sentimental about a character that hadn’t even appeared yet. It probably has the best Saint Crispin’s Day Speech ever performed, in part because it’s placed in a broader context than originally intended. The introduction is bold as is Exeter’s ultimatum delivery. It cleverly, and without changing his lines, also depicts Montjoy, the French messenger, as sympathetic to the English, also not originally intended. I didn’t realize this until I watched other productions.

The BBC production is also worthy, in large part because of its linked casting with Richard II and Henry IV. It’s also unabridged, including the whole glove thing, for better or worse.

(7–6) Henry IV

People will think I’m crazy, but yes, I’m placing Henry IV above Henry V. My reason is just two words: John Falstaff. This character is one of Shakespeare’s greatest creations, and really makes these plays for me. As previously noted, this is two plays mainly because John Falstaff was such a huge hit. The sequel mostly retreads the same ground, but that’s fine! I’ve read and re-read all the Falstaff scenes because they’re so fun. I now have a habit of quoting Falstaff, and it drives my wife nuts.

The Falstaff role makes or breaks a Henry IV production, and my love for this play is in large part thanks to the phenomenal BBC production. It has a warm, charismatic Falstaff that perfectly nails the role. It’s great even beyond Falstaff, of course. At the end of part 2, I tear up seeing Henry V test the chief justice. I adore this production. What a masterpiece.

(5) A Midsummer Night’s Dream

A popular, fun, frivolous play that I enjoyed even more than I expected, where faeries interfere with Athenians who wander into their forest. The “rude mechanicals” are charming, especially the naive earnestness of Nick Bottom, making them my favorite part of the play.

My enjoyment is largely thanks to a 2014 stage production with great performances all around, great cinematography, and incredible effects. Highly recommended. Honorable mention goes to the great Nick Bottom performances of the BBC production and the 1999 major film.

(4) As You Like It

A pastoral comedy about idyllic rural life, and the source of the famous quote “All the world’s a stage.” A duke has deposed his duke brother, exiling him and his followers to the forest where the rest of the play takes place. The main character, Rosalind, is one of the exiles, and, disguised as a man named Ganymede, flees into the forest with her cousin. There she runs into her also-exiled love interest, Orlando. While still disguised as Ganymede, she roleplays as Rosalind — that is, herself — to help him practice wooing herself. Crazy and fun.

A couple of my favorite lines are “There’s no clock in the forest” and “falser than vows made in wine.” It’s an unusually musical play, and has a big, happy ending. The fool, Touchstone, is one of my favorite fools, named such because he tests the character of everyone with whom he comes in contact.

It ranks so highly because of an endearing 2019 production by Kentucky Shakespeare, which sets the story in a 19th century Kentucky. This is the most amateur production I’ve shared so far — literally Shakespeare in the park — but it’s just so enjoyable. Their Rosalind is fantastic and really makes the play work. I’ve listened to just the audio of the play, like a podcast, many times now.

(3) Measure for Measure

A comedy about justice and mercy. The duke of Vienna announces he will be away on a trip to Poland, but secretly poses as a monk in order to get his thumb on the pulse of his city. Unfortunately the man running the city in his stead is corrupt, and the softhearted duke can’t help but pull strings behind the scenes to undo the damage, and more. He sets up a scheme such that, after his dramatic return as duke, the plot is unraveled while simultaneously testing the character of all involved.

I love so many of the characters and elements of this play. I smile when the duke jumps into action, my heart wrenches at Isabella’s impassioned speech for mercy (“it is excellent to have a giant’s strength, but it is tyrannous to use it like a giant”), I admire the provost’s selfless loyalty to the duke, I laugh when Lucio the “fantastic” keeps putting his foot in his mouth, and I cry when Mariana begs Isabella to forgive. All around a wonderful play.

Like so many already, a big part of my love for the play is the BBC production, which is full of great performances, particularly the duke, Isabella, and Lucio.

(2) Much Ado About Nothing

As the play that finally got me interested in Shakespeare, of course it’s near the top of the list. Forget Romeo and Juliet: Benedick and Beatrice are Shakespeare’s greatest romantic pairing!

Don Pedro, Prince of Aragon, stops in Messina with his soldiers while returning from a military action. While in town there’s a matchmaking plot and lots of eavesdropping, and then chaos created by the wicked Don John, brother to Don Pedro. It’s a fun, light, hilarious play. It also features another of Shakespeare’s great comic characters, Dogberry, famous for his malapropisms.

This is a very popular play with tons of productions, though I only watched a few of them. The previously-mentioned 1993 adaption remains my favorite. It does some abridging, but honestly, it makes the play better and improves the comedic beats.

(1) Love’s Labour’s Lost

Finally, my favorite play of all, and an unusual one to be at the top of the list. Much of the play is subtle parody and so makes for a poor first play for newcomers, who would not be familiar enough with Shakespeare’s language to distinguish parody from genuine.

The King of Navarre and three lords swear an oath to seclude themselves in study, swearing off the company of women. Then the French princess and her court arrives, the four men secretly write love letters in violation of their oaths, and comedy ensues. There are also various eccentric side characters mixed into the plot to spice it up. It’s all a ton of fun and ends with an inept play within a play about the “nine worthies.”

The major reason I love this play so much is a literally perfect 2017 production by Stratford Festival. I love every aspect of this production such that I can’t even pick a favorite element. I was hooked within the first minute.

Hand-written Windows API prototypes: fast, flexible, and tedious

2023-05-31T01:38:31Z

I love fast builds, and for years I’ve been bothered by the build penalty for translation units including windows.h. This header has an enormous number of definitions and declarations and so, for C programs, it tends to dominate the build time of those translation units. Most programs, especially systems software, only needs a tiny portion of it. For example, when compiling u-config with GCC, two thirds of the debug build was spent processing windows.h just for 4 types, 16 definitions, and 16 prototypes.

To give a sense of the numbers, here’s empty.c, which does nothing but include windows.h.

#include 

With the current Mingw-w64 headers, that’s ~82kLOC (non-blank):

$ gcc -E empty.c | grep -vc '^$'
82041

With w64devkit this takes my system ~450ms to compile with GCC:

$ time gcc -c empty.c
real    0m 0.45s
user    0m 0.00s
sys     0m 0.00s

Compiling an actually empty source file takes ~10ms, so it really is spending practically all that time processing headers. MSVC is a faster compiler, and this extends to processing an even larger windows.h that crosses over 100kLOC (VS2022). It clocks in at 120ms on the same system:

$ cl /nologo /E empty.c | grep -vc '^$'
empty.c
100944
$ time cl /nologo /c empty.c
empty.c
real    0m 0.12s
user    0m 0.09s
sys     0m 0.01s

That’s just low enough to be tolerable, but I’d like the situation with GCC to be better. Defining WIN32_LEAN_AND_MEAN reduces the number of included headers, which has a significant effect:

$ gcc -E -DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
55025
$ time gcc -c -DWIN32_LEAN_AND_MEAN empty.c
real    0m 0.30s
user    0m 0.00s
sys     0m 0.00s

$ cl /nologo /E /DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
empty.c
41436
$ time cl /nologo /c /DWIN32_LEAN_AND_MEAN empty.c
empty.c
real    0m 0.07s
user    0m 0.01s
sys     0m 0.01s

Precompiled headers

The official solution is precompiled headers. Put all the system header includes, or similar, into a dedicated header, then compile that header into a special format. For example, headers.h:

#define WIN32_LEAN_AND_MEAN
#include 

Then main.c includes windows.h through this header:

#include "headers.h"

int mainCRTStartup(void)
{
    return 0;
}

If I ask GCC to compile headers.h:

$ gcc headers.h

It produces headers.h.gch. When a source includes headers.h, GCC first searches for an appropriate .gch. Not only must the name match, but so must all the definitions at the moment of inclusion: headers.h should always be the first included header, otherwise it may not work. Now when I compile main.c:

$ time gcc -c main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

Much better! MSVC has a conventional name for this header recognizable to every Visual Studio user: stdafx.h. It works a bit differently, and I’ve never used it myself, but I trust it has similar results.

Precompiled headers requires some extra steps that vary by toolchain. Can we do better? That depends on your definition of “better!”

Artisan, handcrafted prototypes

As mentioned, systems software tends to need only a few declarations: open, read, write, stat, etc. What if I wrote these out manually? A bit tedious, but it doesn’t require special precompiled header handling. It also creates some new possibilities. To illustrate, a CRT-free “hello world” program:

#include 

int mainCRTStartup(void)
{
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    char message[] = "Hello, world!\n";
    DWORD len;
    return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}

This takes my system half a second to compile — quite long to produce just 26 assembly instructions:

$ time cc -nostartfiles -o hello.exe hello.c
real    0m 0.50s
user    0m 0.00s
sys     0m 0.00s
$ ./hello.exe
Hello, world!

The program requires prototypes only for GetStdHandle and WriteFile, a definition for STD_OUTPUT_HANDLE, and some typedefs. Starting with the easy stuff, the definition and types look like this:

#define STD_OUTPUT_HANDLE ((DWORD)-11)

typedef int BOOL;
typedef void *HANDLE;
typedef unsigned long DWORD;

By the way, here’s a cheat code for quickly finding preprocessor definitions, faster than looking them up elsewhere:

$ echo '#include ' | gcc -E -dM - | grep 'STD_\w*_HANDLE'
#define STD_INPUT_HANDLE ((DWORD)-10)
#define STD_ERROR_HANDLE ((DWORD)-12)
#define STD_OUTPUT_HANDLE ((DWORD)-11)

Did you catch the pattern? It’s -10 - fd, where fd is the conventional unix file descriptor number: a kind of mnemonic.

Prototypes are a little trickier, especially if you care about 32-bit. The Windows API uses the “stdcall” calling convention, which is distinct from the “cdecl” calling convention on x86, though the same on x64. Of course, you must already be aware of this merely using the API, as your own callbacks must usually be stdcall themselves. Further, API functions are DLL imports and should be declared as such. Putting it together, here’s GetStdHandle:

__declspec(dllimport)
HANDLE __stdcall GetStdHandle(DWORD);

This works with both Mingw-w64 and MSVC. MSVC requires __stdcall between the return type and function name, so don’t get clever about it. If you only care about GCC then you can declare both using attributes, which I think is a bit nicer:

HANDLE GetStdHandle(DWORD)
    __attribute__((dllimport,stdcall));

The prototype for WriteFile:

__declspec(dllimport)
BOOL __stdcall WriteFile(HANDLE, const void *, DWORD, DWORD *, void *);

You may have noticed I’m taking some shortcuts. The “official” definition uses an ugly pointer typedef, LPCVOID, instead of pointer syntax, but I skipped that type definition. I also replaced the last argument, an OVERLAPPED pointer, with a generic pointer. I only need to pass null. I can keep sanding it down to something more ergonomic:

__declspec(dllimport)
int __stdcall WriteFile(void *, void *, int, int *, void *);

That’s how I typically write these prototypes. I dropped the const because it doesn’t help me. I used signed sizes because I like them better and it’s what I’m usually holding at the call site. But doesn’t changing the signedness potentially break compatibility? It makes no difference to any practical ABI: It’s passed the same way. In general, signedness is a matter for operators, and only some of them — mainly comparisons (<, >, etc.) and division. It’s a similar story for pointers starting with the 32-bit era, so I can choose whatever pointer types are convenient.

In general, I can do anything I want so long as I know my compiler will produce an appropriate function call. These are not standard functions, like printf or memcpy, which are implemented in part by the compiler itself, but foreign functions. It’s no different than teaching an FFI how to make a call. This is also, in essence, how OpenGL and Vulkan work, with applications defining the API for themselves.

Considering all this, my new hello world:

__declspec(dllimport)
int __stdcall WriteFile(void *, void *, int, int *, void *);
__declspec(dllimport)
void *__stdcall GetStdHandle(int);

int mainCRTStartup(void)
{
    void *stdout = GetStdHandle(-10 - 1);
    char message[] = "Hello, world!\n";
    int len;
    return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}

You know, there’s a kind of beauty to a program that requires no external definitions. It builds quickly and produces a binary bit-for-bit identical to the original:

$ time cc -nostartfiles -o hello.exe main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

$ time cl /nologo hello.c /link /subsystem:console kernel32.lib
hello.c
real    0m 0.03s
user    0m 0.00s
sys     0m 0.00s

I’ve also been using this to patch over API rough edges. For example, WSARecvFrom takes WSAOVERLAPPED, but GetQueuedCompletionStatus takes OVERLAPPED. These types are explicitly compatible, and only defined separately for annoying technical reasons. I must use the same overlapped object with both APIs at once, meaning I would normally need ugly pointer casts on my Winsock calls, or vice versa with I/O completion ports. But because I’m writing all these definitions myself, I can define a common overlapped structure for both!

Perhaps you’re worried that this would be too fragile. Well, as a legacy software aficionado, I enjoy building and running my programs on old platforms. So far these programs still work properly going back 30 years to Windows NT 3.5 and Visual C++ 4.2. When I do hit a snag, it’s always been a bug (now long fixed) in the old operating system, not in my programs or these prototypes. So, in effect, this technique has worked well for the past 30 years!

Writing out these definitions is a bit of a chore, but after paying that price I’ve been quite happy with the results. I will likely continue doing it in the future, at least for non-graphical applications.

My favorite C compiler flags during development

2023-04-29T22:55:25Z

This article was discussed on Hacker News and on reddit.

The major compilers have an enormous number of knobs. Most are highly specialized, but others are generally useful even if uncommon. For warnings, the venerable -Wall -Wextra is a good start, but circumstances improve by tweaking this warning set. This article covers high-hitting development-time options in GCC, Clang, and MSVC that ought to get more consideration.

There’s an irony that the more you use these options, the less useful they become. Given a reasonable workflow, they are a harsh mistress in a fast, tight feedback loop quickly breaking the habits that cause warnings and errors. It’s a kind of self-improvement, where eventually most findings will be false positives. With heuristics internalized, you will be able spot the same issues just reading code — a handy skill during code review.

Static warnings

Traditionally, C and C++ compilers are by default conservative with warnings. Unless configured otherwise, they only warn about the most egregious issues where it’s highly confident. That’s too conservative. For gcc and clang, the first order of business is turning on more warnings with -Wall. Despite the name, this doesn’t actually enable all warnings. (clang has -Weverything which does literally this, but trust me, you don’t want it.) However, that still falls short, and you’re better served enabling extra warnings on with -Wextra.

$ cc -Wall -Wextra ...

That should be the baseline on any new project, and closer to what these compilers should do by default. Not using these means leaving value on the table. If you come across such a project, there’s a good chance you can find bugs statically just by using this baseline. Some warnings only occur at higher optimization levels, so leave these on for your release builds, too.

For MSVC, including clang-cl, a similar baseline is /W4. Though it goes a bit far, warning about use of unary minus on unsigned types (C4146), and sign conversions (C4245). If you’re using a CRT, also disable the bogus and irresponsible “security” warnings. Putting it together, the warning baseline becomes:

$ cl /W4 /wd4146 /wd4245 /D_CRT_SECURE_NO_WARNINGS ...

As for gcc and clang, I dislike unused parameter warnings, so I often turn it off, at least while I’m working: -Wno-unused-parameter. Rarely is it a defect to not use a parameter. It’s common for a function to fit a fixed prototype but not need all its parameters (e.g. WinMain). Were it up to me, this would not be part of -Wextra.

I also dislike unused functions warnings: -Wno-unused-function. I can’t say this is wrong for the baseline since, in most cases, ultimately I do want to know if there are unused functions, e.g. to be deleted. But while I’m working it’s usually noise.

If I’m working with OpenMP, I may also disable warnings about unknown pragmas: -Wno-unknown-pragmas. One cool feature of OpenMP is that the typical case gracefully degrades to single-threaded behavior when not enabled. That is, compiling without -fopenmp. I’ll test both ways to ensure I get deterministic results, or just to ease debugging, and I don’t want warnings when it’s disabled. It’s fine for the baseline to have this warning, but sometimes it’s a poor match.

When working with single-precision floats, perhaps on games or graphics, it’s easy to accidentally introduce promotion to double precision, which can hurt performance. It could be neglecting an f suffix on a constant or using sin instead of sinf. Use -Wdouble-promotion to catch such mistakes. Honestly, this is important enough that it should go into the baseline.

#define PI 3.141592653589793
float degs = ...;
float rads = degs * PI / 180;  // warns about promotion

It can be awkward around variadic functions, particularly printf, which cannot receive float arguments, and so implicitly converts. You’ll need a explicit cast to disable the warning. I imagine this is the main reason the warning is not part of -Wextra.

float x = ...;
printf("%.17g\n", (double)x);

Finally, an advanced option: -Wconversion -Wno-sign-conversion. It warns about implicit conversions that may result in data loss. Sign conversions do not have data loss, the implicit conversions are useful, and in my experience they’re not a source of defects, so I disable that part using the second flag (like MSVC /wd4245). The important warning here is truncation of size values, warning about unsound uses of sizes and subscripts. For example:

// NOTE: would be declared/defined via windows.h
typedef uint32_t DWORD;
BOOL WriteFile(HANDLE, const void *, DWORD, DWORD *, OVERLAPPED *);

void logmsg(char *msg, size_t len)
{
    HANDLE err = GetStdHandle(STD_ERROR_HANDLE);
    DWORD out;
    WriteFile(err, msg, len, &out, 0);  // len truncation warning
}

On 64-bit targets, it will warn about truncating the 64-bit len for the 32-bit parameter. To dismiss the warning, you must either address it by using a loop to call WriteFile multiple times, or acknowledge the truncation with an explicit cast and accept the consequences. In this case I may know from context it’s impossible for the program to even construct such a large message, so I’d use an assertion and truncate.

void logmsg(char *msg, size_t len)
{
    HANDLE err = GetStdHandle(STD_ERROR_HANDLE);
    DWORD out;
    assert(len <= 0xffffffff);
    WriteFile(err, msg, (DWORD)len, &out, 0);
}

You might consider changing the interface instead:

void logmsg(char *msg, uint32_t len);

That probably passes the buck and doesn’t solve the underlying problem. The caller may be holding a size_t length, so the truncation happens there instead. Or maybe you keep propagating this change backwards until it, say, dissipates on a known constant. -Wconversion leads to these ripple effects that improves the overall program, which is why I like it.

The catch is that the above warning only happens for 64-bit targets. So you might miss it. The inverse is true in other cases. This is one area where cross-architecture testing can pay off.

Unfortunately since this warning is off the beaten path, it seems like it doesn’t quite get the attention it could use. It warns about simple cases where truncation has been explicitly handled/avoided. For example:

int x = ...;
char digit = '0' + x%10;  // false warning

The '0' is a known constant. The operation x%10 has a known range (-9 to 9). Therefore the addition result has a known range, and all results can be represented in a char. Yet it still warns. This often comes up dealing with character data like this.

In my logmsg fix I had used an assertion to check that no truncation actually occurred. But wouldn’t it be nice if the compiler could generate that for us somehow? That brings us to dynamic checks.

Dynamic run-time checks

Sanitizers have been around for nearly a decade but are still criminally underused. They insert run-time assertions into programs at the flip of a switch typically at a modest performance cost — less than the cost of a debug build. All three major compilers support at least one sanitizer on all targets. In most cases, failing to use them is practically the same as not even trying to find defects. Every beginner tutorial ought to be using sanitizers from page 1 where they teach how to compile a program with gcc. (That this is universally not the case, and that these same tutorials also do not begin with teaching a debugger, is a major, on-going education failure.)

There are multiple different sanitizers with lots of overlap, but Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan) are the most general. They are compatible with each other and form a solid, general baseline. To use address sanitizer, at both compile and link time do:

$ cc ... -fsanitize=address ...

It’s even spelled the same way in MSVC. It’s needed at link time because it includes a runtime component. When working properly it’s aware of all allocations and checks all memory accesses that might be out of bounds, producing a run-time error if that occurs. It’s not always appropriate, but most projects that can use it probably should.

UBSan is enabled similarly:

$ cc ... -fsanitize=undefined ...

It adds checks around operations that might be undefined, emitting a run-time error if it occurs. It has an optional runtime component to produce a helpful diagnostic. You can instead insert a trap instruction, which is how I prefer to use it: -fsanitize-trap=undefined. (Until recently it was -fsanitize-undefined-trap-on-error.) This works on platforms where the UBSan runtime is unsupported. Some instrumentation is only inserted at higher optimization levels.

For me, the most useful UBSan check is signed overflow — e.g. computing the wrong result — and it’s instrumentation I miss when not working in C. In programs where this might be an issue, combine it with a fuzzer to search for inputs that cause overflows. This is yet another argument in favor of signed sizes, as UBSan can detect such overflows. (Yes, UBSan optionally instruments unsigned overflow, too, but then you must somehow distinguish intentional from unintentional overflow.)

On Linux, ASan and UBSan strangely do not have debugger-oriented defaults. Fortunately that’s easy to address with a couple of environment variables, which cause them to break on error instead of uselessly exiting:

export ASAN_OPTIONS=abort_on_error=1:halt_on_error=1
export UBSAN_OPTIONS=abort_on_error=1:halt_on_error=1

Also, when compiling you can combine sanitizers like so:

$ cc ... -fsanitize=address,undefined ...

As of this writing, MSVC does not have UBSan, but it does have a similar feature, run-time error checks. Three sub-flags (c, s, u) enable different checks, and /RTCcsu turns them all on. The c flag generates the assertion I had manually written with -Wconversion, and traps any truncation at run time. There’s nothing quite like this in UBSan! It’s so extreme that it’s compatible with neither standard runtime libraries (fortunately not a big deal) nor with ASan.

Caveat: Explicit casts aren’t enough, you must actually truncate variables using a mask in order to pass the check. For example, to accept truncation in the logmsg function:

    WriteFile(err, msg, len&0xffffffff, &out, 0);

Thread Sanitizer (TSan) is occasionally useful for finding — or, more often, proving the presence of — data races. It has a runtime component and so must be used at compile time and link time.

$ cc ... -fsanitize=thread ...

Unfortunately it only works in a narrow context. The target must use pthreads, not C11 threads, OpenMP, nor direct cloning. It must only synchronize through code that was compiled with TSan. That means no synchronization through system calls, especially no futexes. Most non-trivial programs do not meet the criteria.

Debug information

Another common mistake in tutorials is using plain old -g instead of -g3 (read: “debug level 3”). That’s like using -O instead of -O3. It adds a lot more debug information to the output, particularly enums and macros. The extra information is useful and you’re better off having it!

$ cc ... -g3 ...

All the major build systems — CMake, Autotools, Meson, etc. — get this wrong in their standard debug configurations. Producing a fully-featured debug build from these systems is a constant battle for me. Often it’s easier to ignore the build system entirely and cc -g3 **/*.c (plus sanitizers, etc.).

(Short term note: GCC 11, released in March 2021, switched to DWARF5 by default. However, GDB could not access the extra -g3 debug information in DWARF5 until GDB 13, released February 2023. If you have a toolchain from that two year window — except mine because I patched it — then you may also need -gdwarf-4 to switch back to DWARF4.)

What about -Og? In theory it enables optimizations that do not interfere with debugging, and potentially some additional warnings. In practice I still get far too many “optimized out” messages from GDB when I use it, so I don’t bother. Fortunately C is such a simple language that debug builds are nearly as fast as release builds anyway.

On MSVC I like having debug information embedded in binaries, as GCC does, which is done using /Z7.

$ cl ... /Z7 ...

Though I certainly understand the value of separate debug information, /Zi, in some cases. Sometimes I wish the GNU toolchain made this easier.

Summary

My personal rigorous baseline for development using gcc and clang looks like this (all platforms):

$ cc -g3 -Wall -Wextra -Wconversion -Wdouble-promotion
     -Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion
     -fsanitize=undefined -fsanitize-trap ...

While ASan is great for quickly reviewing and evaluating other people’s projects, I don’t find it useful for my own programs. I avoid that class of defects through smarter paradigms (region-based allocation, no null terminated strings, etc.). I also prefer the behavior of trap instruction UBSan versus a diagnostic, as it behaves better under debuggers.

For cl and clang-cl, my personal baseline looks like this:

$ cl /Z7 /W4 /wd4146 /wd4245 /RTCcsu ...

I don’t normally need /D_CRT_SECURE_NO_WARNINGS since I don’t use a CRT anyway.

Practical libc-free threading on Linux

2023-03-23T05:32:41Z

Suppose you’re not using a C runtime on Linux, and instead you’re programming against its system call API. It’s long-term and stable after all. Memory management and buffered I/O are easily solved, but a lot of software benefits from concurrency. It would be nice to also have thread spawning capability. This article will demonstrate a simple, practical, and robust approach to spawning and managing threads using only raw system calls. It only takes about a dozen lines of C, including a few inline assembly instructions.

The catch is that there’s no way to avoid using a bit of assembly. Neither the clone nor clone3 system calls have threading semantics compatible with C, so you’ll need to paper over it with a bit of inline assembly per architecture. This article will focus on x86-64, but the basic concept should work on all architecture supported by Linux. The glibc clone(2) wrapper fits a C-compatible interface on top of the raw system call, but we won’t be using it here.

Before diving in, the complete, working demo: stack_head.c

The clone system call

On Linux, threads are spawned using the clone system call with semantics like the classic unix fork(2). One process goes in, two processes come out in nearly the same state. For threads, those processes share almost everything and differ only by two registers: the return value — zero in the new thread — and stack pointer. Unlike typical thread spawning APIs, the application does not supply an entry point. It only provides a stack for the new thread. The simple form of the raw clone API looks something like this:

long clone(long flags, void *stack);

Sounds kind of elegant, but it has an annoying problem: The new thread begins life in the middle of a function without any established stack frame. Its stack is a blank slate. It’s not ready to do anything except jump to a function prologue that will set up a stack frame. So besides the assembly for the system call itself, it also needs more assembly to get the thread into a C-compatible state. In other words, a generic system call wrapper cannot reliably spawn threads.

void brokenclone(void (*threadentry)(void *), void *arg)
{
    // ...
    long r = syscall(SYS_clone, flags, stack);
    // DANGER: new thread may access non-existant stack frame here
    if (!r) {
        threadentry(arg);
    }
}

For odd historical reasons, each architecture’s clone has a slightly different interface. The newer clone3 unifies these differences, but it suffers from the same thread spawning issue above, so it’s not helpful here.

The stack “header”

I figured out a neat trick eight years ago which I continue to use today. The parent and child threads are in nearly identical states when the new thread starts, but the immediate goal is to diverge. As noted, one difference is their stack pointers. To diverge their execution, we could make their execution depend on the stack. An obvious choice is to push different return pointers on their stacks, then let the ret instruction do the work.

Carefully preparing the new stack ahead of time is the key to everything, and there’s a straightforward technique that I like call the stack_head, a structure placed at the high end of the new stack. Its first element must be the entry point pointer, and this entry point will receive a pointer to its own stack_head.

struct __attribute((aligned(16))) stack_head {
    void (*entry)(struct stack_head *);
    // ...
};

The structure must have 16-byte alignment on all architectures. I used an attribute to help keep this straight, and it can help when using sizeof to place the structure, as I’ll demonstrate later.

Now for the cool part: The ... can be anything you want! Use that area to seed the new stack with whatever thread-local data is necessary. It’s a neat feature you don’t get from standard thread spawning interfaces. If I plan to “join” a thread later — wait until it’s done with its work — I’ll put a join futex in this space:

struct __attribute((aligned(16))) stack_head {
    void (*entry)(struct stack_head *);
    int join_futex;
    // ...
};

More details on that futex shortly.

The clone wrapper

I call the clone wrapper newthread. It has the inline assembly for the system call, and since it includes a ret to diverge the threads, it’s a “naked” function just like with setjmp. The compiler will generate no prologue or epilogue, and the function body is limited to inline assembly without input/output operands. It cannot even reliably reference its parameters by name. Like clone, it doesn’t accept a thread entry point. Instead it accepts a stack_head seeded with the entry point. The whole wrapper is just six instructions:

__attribute((naked))
static long newthread(struct stack_head *stack)
{
    __asm volatile (
        "mov  %%rdi, %%rsi\n"     // arg2 = stack
        "mov  $0x50f00, %%edi\n"  // arg1 = clone flags
        "mov  $56, %%eax\n"       // SYS_clone
        "syscall\n"
        "mov  %%rsp, %%rdi\n"     // entry point argument
        "ret\n"
        : : : "rax", "rcx", "rsi", "rdi", "r11", "memory"
    );
}

On x86-64, both function calls and system calls use rdi and rsi for their first two parameters. Per the reference clone(2) prototype above: the first system call argument is flags and the second argument is the new stack, which will point directly at the stack_head. However, the stack pointer arrives in rdi. So I copy stack into the second argument register, rsi, then load the flags (0x50f00) into the first argument register, rdi. The system call number goes in rax.

Where does that 0x50f00 come from? That’s the bare minimum thread spawn flag set in hexadecimal. If any flag is missing then threads will not spawn reliably — as discovered the hard way by trial and error across different system configurations, not from documentation. It’s computed normally like so:

    long flags = 0;
    flags |= CLONE_FILES;
    flags |= CLONE_FS;
    flags |= CLONE_SIGHAND;
    flags |= CLONE_SYSVSEM;
    flags |= CLONE_THREAD;
    flags |= CLONE_VM;

When the system call returns, it copies the stack pointer into rdi, the first argument for the entry point. In the new thread the stack pointer will be the same value as stack, of course. In the old thread this is a harmless no-op because rdi is a volatile register in this ABI. Finally, ret pops the address at the top of the stack and jumps. In the old thread this returns to the caller with the system call result, either an error (negative errno) or the new thread ID. In the new thread it pops the first element of stack_head which, of course, is the entry point. That’s why it must be first!

The thread has nowhere to return from the entry point, so when it’s done it must either block indefinitely or use the exit (not exit_group) system call to terminate itself.

Caller point of view

The caller side looks something like this:

static void threadentry(struct stack_head *stack)
{
    // ... do work ...
    __atomic_store_n(&stack->join_futex, 1, __ATOMIC_SEQ_CST);
    futex_wake(&stack->join_futex);
    exit(0);
}

__attribute((force_align_arg_pointer))
void _start(void)
{
    struct stack_head *stack = newstack(1<<16);
    stack->entry = threadentry;
    // ... assign other thread data ...
    stack->join_futex = 0;
    newthread(stack);

    // ... do work ...

    futex_wait(&stack->join_futex, 0);
    exit_group(0);
}

Despite the minimalist, 6-instruction clone wrapper, this is taking the shape of a conventional threading API. It would only take a bit more to hide the futex, too. Speaking of which, what’s going on there? The same principal as a WaitGroup. The futex, an integer, is zero-initialized, indicating the thread is running (“not done”). The joiner tells the kernel to wait until the integer is non-zero, which it may already be since I don’t bother to check first. When the child thread is done, it atomically sets the futex to non-zero and wakes all waiters, which might be nobody.

Caveat: It’s not safe to free/reuse the stack after a successful join. It only indicates the thread is done with its work, not that it exited. You’d need to wait for its SIGCHLD (or use CLONE_CHILD_CLEARTID). If this sounds like a problem, consider your context more carefully: Why do you feel the need to free the stack? It will be freed when the process exits. Worried about leaking stacks? Why are you starting and exiting an unbounded number of threads? In the worst case park the thread in a thread pool until you need it again. Only worry about this sort of thing if you’re building a general purpose threading API like pthreads. I know it’s tempting, but avoid doing that unless you absolutely must.

What’s with the force_align_arg_pointer? Linux doesn’t align the stack for the process entry point like a System V ABI function call. Processes begin life with an unaligned stack. This attribute tells GCC to fix up the stack alignment in the entry point prologue, just like on Windows. If you want to access argc, argv, and envp you’ll need more assembly. (I wish doing really basic things without libc on Linux didn’t require so much assembly.)

__asm (
    ".global _start\n"
    "_start:\n"
    "   movl  (%rsp), %edi\n"
    "   lea   8(%rsp), %rsi\n"
    "   lea   8(%rsi,%rdi,8), %rdx\n"
    "   call  main\n"
    "   movl  %eax, %edi\n"
    "   movl  $60, %eax\n"
    "   syscall\n"
);

int main(int argc, char **argv, char **envp)
{
    // ...
}

Getting back to the example usage, it has some regular-looking system call wrappers. Where do those come from? Start with this 6-argument generic system call wrapper.

long syscall6(long n, long a, long b, long c, long d, long e, long f)
{
    register long ret;
    register long r10 asm("r10") = d;
    register long r8  asm("r8")  = e;
    register long r9  asm("r9")  = f;
    __asm volatile (
        "syscall"
        : "=a"(ret)
        : "a"(n), "D"(a), "S"(b), "d"(c), "r"(r10), "r"(r8), "r"(r9)
        : "rcx", "r11", "memory"
    );
    return ret;
}

I could define syscall5, syscall4, etc. but instead I’ll just wrap it in macros. The former would be more efficient since the latter wastes instructions zeroing registers for no reason, but for now I’m focused on compacting the implementation source.

#define SYSCALL1(n, a) \
    syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
    syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
    syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))

Now we can have some exits:

__attribute((noreturn))
static void exit(int status)
{
    SYSCALL1(SYS_exit, status);
    __builtin_unreachable();
}

__attribute((noreturn))
static void exit_group(int status)
{
    SYSCALL1(SYS_exit_group, status);
    __builtin_unreachable();
}

Simplified futex wrappers:

static void futex_wait(int *futex, int expect)
{
    SYSCALL4(SYS_futex, futex, FUTEX_WAIT, expect, 0);
}

static void futex_wake(int *futex)
{
    SYSCALL3(SYS_futex, futex, FUTEX_WAKE, 0x7fffffff);
}

And so on.

Finally I can talk about that newstack function. It’s just a wrapper around an anonymous memory map allocating pages from the kernel. I’ve hardcoded the constants for the standard mmap allocation since they’re nothing special or unusual. The return value check is a little tricky since a large portion of the negative range is valid, so I only want to check for a small range of negative errnos. (Allocating a arena looks basically the same.)

static struct stack_head *newstack(long size)
{
    unsigned long p = SYSCALL6(SYS_mmap, 0, size, 3, 0x22, -1, 0);
    if (p > -4096UL) {
        return 0;
    }
    long count = size / sizeof(struct stack_head);
    return (struct stack_head *)p + count - 1;
}

The aligned attribute comes into play here: I treat the result like an array of stack_head and return the last element. The attribute ensures each individual elements is aligned.

That’s it! There’s not much to it other than a few thoughtful assembly instructions. It took doing this a few times in a few different programs before I noticed how simple it can be.

CRT-free in 2023: tips and tricks

2023-02-15T02:12:00Z

Seven years ago I wrote about “freestanding” Windows executables. After an additional seven years of practical experience both writing and distributing such programs, half using a custom-built toolchain, it’s time to revisit these cabalistic incantations and otherwise scant details. I’ve tweaked my older article over the years as I’ve learned, but this is a full replacement and does not assumes you’ve read it. The “why” has been covered and the focus will be on the “how”. Both the GNU and MSVC toolchains will be considered.

I no longer call these “freestanding” programs since that term is, at best, inaccurate. In fact, we will be actively avoiding GCC features associated with that label. Instead I call these CRT-free programs, where CRT stands for the C runtime the Windows-oriented term for libc. This term communicates both intent and scope.

Entry point

You should already know that main is not the program’s entry point, but a C application’s entry point. The CRT provides the entry point, where it initializes the CRT, including parsing command line options, then calls the application’s main. The real entry point doesn’t have a name. It’s just the address of the function to be called by the loader without arguments.

You might naively assume you could continue using the name main and tell the linker to use it as the entry point. You would be wrong. Avoid the name main! It has a special meaning in C gets special treatment. Using it without a conventional CRT will confuse your tools an may cause build issues.

While you can use almost any other name you like, the conventional names are mainCRTStartup (console subsystem) and WinMainCRTStartup (windows subsystem). It’s easy to remember: Append CRTStartup to the name you’d use in a normal CRT-linking application. I strongly recommend using these names because it reduces friction. Your tools are already familiar with them, so you won’t need to do anything special.

int mainCRTStartup(void);     // console subsystem
int WinMainCRTStartup(void);  // windows subsystem

The MSVC linker documentation says the entry point uses the __stdcall calling convention. Ignore this and do not use __stdcall for your entry point! Since entry points take no arguments, there is no practical difference from the __cdecl calling convention, so it does not actually matter. Rather, the goal is to avoid __stdcall function decorations. In particular, the GNU linker --entry option does not understand them, nor can it find decorated entry points on its own. If you use __stdcall, then the 32-bit GNU linker will silently (!) choose the beginning of your .text section as the entry point.

If you’re using C++, then of course you will also need to use extern "C" so that it’s not name-mangled. Otherwise the results are similarly bad.

If using -fwhole-program, you will need to mark your entry point as externally visible for GCC so that it knows its an entry point. While linkers are familiar with conventional entry point names, GCC the compiler is not. Normally you do not need to worry about this.

__attribute((externally_visible))  // for -fwhole-program
int mainCRTStartup(void)
{
    return 0;
}

The entry point returns int. If there are no other threads then the process will exit with the returned value as its exit status. In practice this is only useful for console programs. Windows subsystem programs have threads started automatically, without warning, and it’s almost certain your main thread is not the last thread. You probably want to use ExitProcess or even TerminateProcess instead of returning. The latter exits more abruptly and can avoid issues with certain subsystems, like DirectSound, not shutting down gracefully: It doesn’t even let them try.

int WinMainCRTStartup(void)
{
    // ...
    TerminateProcess(GetCurrentProcess(), 0);
}

Compilation

Starting with the GNU toolchain, you have two ways to get into “CRT-free mode”: -nostartfiles and -nostdlib. The former is more dummy-proof, and it’s what I use in build documentation. The latter can be a more complicated, but when it succeeds you get guarantees about the result. I use it in build scripts I intend to run myself, which I want to fail if they don’t do exactly what I expect. To illustrate, consider this trivial program:

#include 

int mainCRTStartup(void)
{
    ExitProcess(0);
}

This program uses ExitProcess from kernel32.dll. Compiling is easy:

$ cc -nostartfiles example.c

The -nostartfiles prevents it from linking the CRT entry point, but it still implicitly passes other “standard” linker flags, including libraries -lmingw32 and -lkernel32. Programs can use kernel32.dll functions without explicitly linking that DLL. But, hey, isn’t -lmingw32 the CRT, the thing we’re avoiding? It is, but it wasn’t actually linked because the program didn’t reference it.

$ objdump -p a.exe | grep -Fi .dll
        DLL Name: KERNEL32.dll

However, -nostdlib does not pass any of these libraries, so you need to do so explicitly.

$ cc -nostdlib example.c -lkernel32

The MSVC toolchain behaves a little like -nostartfiles, not linking a CRT unless you need it, semi-automatically. However, you’ll need to list kernel32.dll and tell it which subsystem you’re using.

$ cl example.c /link /subsystem:console kernel32.lib

However, MSVC has a handy little feature to list these arguments in the source file.

#ifdef _MSC_VER
  #pragma comment(linker, "/subsystem:console")
  #pragma comment(lib, "kernel32.lib")
#endif

This information must go somewhere, and I prefer the source file rather than a build script. Then anyone can point MSVC at the source without worrying about options.

$ cl example.c

I try to make all my Windows programs so simply built.

Stack probes

On Windows, it’s expected that stacks will commit dynamically. That is, the stack is merely reserved address space, and it’s only committed when the stack actually grows into it. This made sense 30 years ago as a memory saving technique, but today it no longer makes sense. However, programs are still built to use this mechanism.

To function properly, programs must touch each stack page for the first time in order. Normally that’s not an issue, but if your stack frame exceeds the page size, there’s a chance it might step over a page. When a function has a large stack frame, GCC inserts a call to a “stack probe” in libgcc that touches its pages in the prologue. It’s not unlike stack clash protection.

For example, if I have a 4kiB local variable:

int mainCRTStartup(void)
{
    char buf[1<<12] = {0};
    return 0;
}

When I compile with -nostdlib:

$ cc -nostdlib example.c
ld: ... undefined reference to `___chkstk_ms'

It’s trying to link the CRT stack probe. You can disable this behavior with -mno-stack-arg-probe.

$ cc -mno-stack-arg-probe -nostdlib example.c

Or you can just link -lgcc to provide a definition:

$ cc -nostdlib example.c -lgcc

Had you used -nostartfiles, you wouldn’t have noticed because it passes -lgcc automatically. It’s “dummy-proof” because this sort of issue goes away before it comes up, though for the same reason it’s harder to tell exactly what went into a program.

If you disable the probe altogether — my preference — you’ve only solved the linker problem, but the underlying stack commit problem remains and your program may crash. You can solve that by telling the linker to ask the loader to commit a larger stack up front rather than grow it at run time. Say, 2MiB:

$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000 example.c

Of course, I wish that this was simply the default behavior because it’s far more sensible! Another option is to avoid large stack frames in the first place. Allocate locals larger than 4kiB in, say, a scratch arena instead of on the stack.

MSVC doesn’t have libgcc of course, but it still generates stack probes both for growing the stack and for security checks. The latter requires kernel32.dll, so if I compile the same program with MSVC, I get a bunch of linker failures:

$ cl example.c /link /subsystem:console
... unresolved external symbol __imp_RtlCaptureContext ...
... and 7 more ...

Using /Gs1000000000 turns off the stack probes, /GS- turns off the checks, /stack commits a larger stack:

$ cl /GS- /Gs1000000000 example.c /link
     /subsystem:console /stack:0x200000,200000

Though, as before, you could also avoid large stack frames in the first place.

Built-in functions… ugh

The three major C and C++ compilers — GCC, MSVC, Clang — share a common, evil weakness: “built-in” functions. No matter what, they each assume you will supply definitions for standard string functions at link time, particularly memset and memcpy. They do this no matter how many “seriously now, do not use standard C functions” options you pass. When you don’t link a CRT, you may need to define them yourself.

In case that sounds easy, there’s a catch-22: The compiler will transform your memset definition — that is, in a function named memset — into a call to itself. After all, it looks an awful lot like memset! This typically manifests as an infinite loop. This will even compile and appear work — until your program hangs. It’s amazing that each of the major compilers have this crummy behavior.

No matter what you may have read, -fno-builtin is not a solution. It’s merely a sometimes-honored request, and both GCC and Clang will continue inserting calls to built-in functions you said do not exist. For example, making an especially large local variable (and using volatile to prevent it from being optimized out):

int mainCRTStartup(void)
{
    volatile char buf[1<<14] = {0};
    return 0;
}

As of this writing, the latest GCC and Clang will generate a memset call despite -fno-builtin:

$ cc -mno-stack-arg-probe -fno-builtin -nostdlib example.c
ld: ... undefined reference to `memset' ...

If you want to be absolutely pure, you will need to address this in just about any non-trivial program. On the other hand, -nostartfiles will grab a definition from msvcrt.dll for you:

$ cc -nostartfiles example.c
$ objdump -p a.exe | grep -Fi .dll
        DLL Name: msvcrt.dll

To be clear, this is a completely legitimate and pragmatic route! You get the benefits of both worlds: the CRT is still out of the way, but there’s also no hassle from misbehaving compilers. If this sounds like a good deal, then do it! (For on-lookers feeling smug: there is no such easy, general solution for this problem on Linux.)

But me, I want that CRT-free purity, damnit! There are a few of options. Option 1, make it unoptimizable. Here I’ve added fake-out inline assembly:

void *memset(void *p, int c, size_t n)
{
    char *s = p;
    for (size_t i = 0; i < n; i++) {
        __asm("");
        s[i] = c;
    }
    return p;
}

Alternatively use volatile. The downside is your program may be slower since it prevents optimizations you do want. Option 2, disable the particular troublesome optimization.

__attribute((optimize("no-tree-loop-distribute-patterns")))
void *memset(void *p, int c, size_t n)
{
    // ...
}

Or for the whole program:

$ cc -fno-tree-loop-distribute-patterns ...

But will that work reliably in the future? Option 3, implement it with inline assembly since it’s opaque to optimization.

void *memset(void *d, int c, size_t n)
{
    void *r = d;
    __asm volatile (
        "rep stosb"
        : "=D"(d), "=a"(c), "=c"(n)
        : "0"(d), "1"(c), "2"(n)
        : "memory"
    );
    return r;
}

Normally this option could be severe since you’d need assembly for every target architecture, but Windows (currently) supports few architectures. You probably only care about x86 and x64, and the inline assembly above is a polyglot! Important: Be wary of copy-pasting such inline assembly from Stack Overflow because it’s often wrong.

Regardless, I suggest putting each definition in its own section so that they can be discarded via -Wl,--gc-sections when unused:

__attribute((section(".text.memset")))
void *memset(void *d, int c, size_t n)
{
    // ...
}

In the past I’ve needed to provide definitions for memcmp, memset, memcpy, memmove, and even strlen.

Unfortunately the MSVC situation is mostly worse. When it inserts such a CRT call it will not automatically pick up a CRT like -nostartfiles. There’s no inline assembly, and it’s harder to selectively disable the troublesome optimizations. Instead I’ve been using intrinsics like __stosb. MSVC has a larger variety of them, which makes up a bit for its lack of inline assembly.

#pragma function(memset)
void *memset(void *d, int c, size_t n)
{
    __stosb(d, c, n);
    return d;
}

I don’t quite understand the purpose of the #pragma, but this works.

Stack alignment on 32-bit x86

GCC expects a 16-byte aligned stack and generates code accordingly. Such is dictated by the x64 ABI, so that’s a given on 64-bit Windows. However, the x86 ABIs only guarantee 4-byte alignment. If no care is taken to deal with it, there will likely be unaligned loads. Some may not be valid (e.g. SIMD) leading to a crash. UBSan disapproves, too. Fortunately there’s a function attribute for this:

__attribute((force_align_arg_pointer))
int mainCRTStartup(void)
{
    // ...
}

GCC will now align the stack in this function’s prologue. Adjustment is only necessary at entry points, as GCC will maintain alignment through its own frames. This includes all entry points, not just the program entry point, particularly thread start functions. Rule of thumb for i686 GCC: If WINAPI or __stdcall appears in a definition, the stack likely requires alignment.

__attribute((force_align_arg_pointer))
DWORD WINAPI mythread(void *arg)
{
    // ...
}

It’s harmless to use this attribute on x64. The prologue will just be a smidge larger. If you’re worried about it, use #ifdef __i686__ to limit it to 32-bit builds.

Putting it all together

If I’ve written a graphical application with WinMainCRTStartup, used large stack frames, marked my entry point as externally visible, plan to support 32-bit builds, and defined a couple of needed string functions, my optimal entry point may look something like:

#ifdef __GNUC__
__attribute((externally_visible))
#endif
#ifdef __i686__
__attribute((force_align_arg_pointer))
#endif
int WinMainCRTStartup(void)
{
    // ...
}

Then my “optimize all the things” release build may look something like:

$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000
     -O3 -fwhole-program -Wl,--gc-sections -s -nostdlib -mwindows
     -fno-asynchronous-unwind-tables -o app.exe app.c -lkernel32

Or with MSVC:

$ cl /O2 /GS- /Gs1000000000 app.c /link kernel32.lib
     /subsystem:windows /stack:0x200000,200000

Or if I’m taking it easy maybe just:

$ cc -O3 -s -nostartfiles -mwindows -o app.exe app.c

Or with MSVC (linker flags in source):

$ cl /O2 app.c

Let's implement buffered, formatted output

2023-02-13T00:00:00Z

This article was discussed on reddit.

When not using the C standard library, how does one deal with formatted output? Re-implementing the entirety of printf from scratch seems like a lot of work, and indeed it would be. Fortunately it’s rarely necessary. With the right mindset, and considering your program’s actual formatting needs, it’s not as difficult as it might appear. Since it goes hand-in-hand with buffering, I’ll cover both topics at once, including sprintf-like capabilities, which is where we’ll start.

The print-is-append mindset

Buffering amortizes the costs of write (and read) system calls. Many small writes are queued via the buffer into a few large writes. This isn’t just an implementation detail. It’s key in the mindset to tackle formatted output: Printing is appending.

The mindset includes the reverse: Appending is like printing. Consider this next time you reach for strcat or similar. Is this the appropriate destination for this data, or am I just going to print it — i.e. append it to another, different buffer — afterward?

This concept may sound obvious, but consider that there are major, popular programming paradigms where the norm is otherwise. I’ll pick on Python to illustrate, but it’s not alone.

print(f"found {count} items")

This line of code allocates a buffer; formats the value of the variable count into it; allocates a second buffer; copies into it the prefix ("found "), the first buffer, and the suffix (" items"); copies the contents of this second buffer into the standard output buffer; then discards the two temporary buffers. To see for yourself, use the CPython bytecode disassembler on it. (It is pretty neat that string formatting is partially implemented in the compiler and partially parsed at compile time.)

With the print-is-append mindset, you know it’s ultimately being copied into the standard output buffer, and that you can skip the intermediate appending and copying. Avoiding that pessimization isn’t just about the computer’s time, it’s even more about saving your own time implementing formatted output.

In C that line looks like:

printf("found %d items\n", count);

The format string is a domain-specific language (DSL) that is (usually) parsed and evaluated at run time. In essence it’s a little program that says:

Append "found " to the output buffer
Format the given integer into the output buffer
Append " items\n" to the output buffer

For sprintf the output buffer is caller-supplied instead of a buffered stream.

In this implementation we’re doing to skip the DSL and express such “format programs” in C itself. It’s more verbose at the call site, but it simplifies the implementation. As a bonus, it’s also faster since the format program is itself compiled by the C compiler. In your own formatted output implementation you could write a printf that, following the format string, calls the append primitives we’ll build below.

Buffer implementation

Let’s begin by defining an output buffer. An output buffer tracks the total capacity and how much has been written. I’ll include a sticky error flag to simplify error checks. For a first pass we’ll start with a sprintf rather than full-blown printf because there’s nowhere yet for the data to go.

#define MEMBUF(buf, cap) {buf, cap, 0, 0}
struct buf {
    unsigned char *buf;
    int cap;
    int len;
    _Bool error;
};

I’m using unsigned char since these are bytes, best understood as unsigned (0–255), particularly important when dealing with encodings. I also wrote a “constructor” macro, MEMBUF, to help with initialization. Next we need a function to append bytes — the core operation:

void append(struct buf *b, unsigned char *src, int len)
{
    int avail = b->cap - b->len;
    int amount = avail<len ? avail : len;
    for (int i = 0; i < amount; i++) {
        b->buf[b->len+i] = src[i];
    }
    b->len += amount;
    b->error |= amount < len;
}

If there wasn’t room, it copies as much as possible and sets the error flag to indicate truncation. It doesn’t return the error. Rather than check after each append, the caller will check after multiple appends, effectively batching the checks into one check. The typical, expected case is that there is no error, so make that path fast.

Since it’s an easy point to miss: append is the only place in the entire implementation where bounds checking comes into play. Everything else can confidentially throw bytes at the buffer without worrying if it fits. If it doesn’t, the sticky error flag will indicate such at a more appropriate time.

I could have used memcpy for the loop, but the goal is not to use libc. Besides, not using memcpy means we can pass a null pointer without making it a special exception.

append(b, 0, 0);  // append nothing (no-op)

I expect that static strings are common sources for append, so I’ll add a helper macro which gets the length as a compile-time constant. The null terminator will not be used.

#define APPEND_STR(b, s) append(b, s, sizeof(s)-1)

If that’s not clear yet, it will be once you see an example. It’s also useful to append single bytes:

void append_byte(struct buf *b, unsigned char c)
{
    append(b, &c, 1);
}

With primitive appends done, we can build ever “higher-level” appends. For example, to append a formatted long to the buffer:

void append_long(struct buf *b, long x)
{
    unsigned char tmp[64];
    unsigned char *end = tmp + sizeof(tmp);
    unsigned char *beg = end;
    long t = x>0 ? -x : x;
    do {
        *--beg = '0' - t%10;
    } while (t /= 10);
    if (x < 0) {
        *--beg = '-';
    }
    append(b, beg, end-beg);
}

By working from the negative end — recall that the negative range is larger than the positive — it supports the full range of signed long, whatever it happens to be on this host. With less than 50 lines of code we now have enough to format the example:

char message[256];
struct buf b = MEMBUF(message, sizeof(message));

APPEND_STR(&b, "found ");
append_long(&b, count);
APPEND_STR(&b, "items\n");
if (b.error) {
    // truncated
}

We can continue defining append functions for whatever types we need.

void append_ptr(struct buf *b, void *p)
{
    APPEND_STR(b, "0x");
    uintptr_t u = (uintptr_t)p;
    for (int i = 2*sizeof(u) - 1; i >= 0; i--) {
        append_byte(b, "0123456789abcdef"[(u>>(4*i))&15]);
    }
}

struct vec2 { int x, y; };

void append_vec2(struct buf *b, struct vec2 v)
{
    APPEND_STR(&b, "vec2{");
    append_long(&b, v.x);
    APPEND_STR(&b, ", ");
    append_long(&b, v.y);
    append_byte(&b, '}');
}

Perhaps you want features like field width? Add a parameter for it… but only if you need it!

Float formatting

As mentioned before, precise float formatting is challenging because it’s full of edge cases. However, if you only need to output a simple format at reduced precision, it’s not difficult. To illustrate, this nearly matches %f, built atop append_long:

void append_double(struct buf *b, double x)
{
    long prec = 1000000;  // i.e. 6 decimals

    if (x < 0) {
        append_byte(b, '-');
        x = -x;
    }

    x += 0.5 / prec;  // round last decimal
    if (x >= (double)(-1UL>>1)) {  // out of long range?
        APPEND_STR(b, "inf");
    } else {
        long integral = x;
        long fractional = (x - integral)*prec;
        append_long(b, integral);
        append_byte(b, '.');
        for (long i = prec/10; i > 1; i /= 10) {
            if (i > fractional) {
                append_byte(b, '0');
            }
        }
        append_long(b, fractional);
    }
}

Output to a handle

So far this writes output to a buffer and truncates when it runs out of space. Usually we want this going to a sink, like a kernel object whether that be a file, pipe, socket, etc. to which we have a handle like a file descriptor. Instead of truncating, we flush the buffer to this sink, at which point there’s room for more output. The error flag is set if the flush fails, but this is essentially the same concept as before.

In these examples I will use a file descriptor int, but you can use whatever sort of handle is appropriate. I’ll add an fd field to the buffer and a new constructor macro:

#define MEMBUF(buf, cap) {buf, cap, 0, -1, 0}
#define FDBUF(fd, buf, cap) {buf, cap, 0, fd, 0}

struct buf {
    unsigned char *buf;
    int cap;
    int len;
    int fd;
    Bool error;
};

The buffered stream will be polymorphic: Output can go to a memory buffer or to an operating system handle using the same append interface. This is a handy feature standard C doesn’t even have, though POSIX does in the form of fmemopen. Nothing else changes except append, which, if given a valid handle, will flush when full. Attempting to flush a memory buffer sets the error flag.

_Bool os_write(int fd, void *, int);

void flush(struct buf *b)
{
    b->error |= b->fd < 0;
    if (!b->error && b->len) {
        b->error |= !os_write(b->fd, b->buf, b->len);
        b->len = 0;
    }
}

I’ve arranged so that output stops when there’s an error. Also I’m using a hypothetical os_write in the platform layer as a full, unbuffered write. Note that unix write(2) experiences partial writes and so must be used in a loop. Win32 WriteFile doesn’t have partial writes, so on Windows an os_write could pass its arguments directly to the operating system.

The program will need to call flush directly when it’s done writing output, or to display output early, e.g. line buffering. In append we’ll use a loop to continue appending and flushing until the input is consumed or an error occurs.

void append(struct buf *b, unsigned char *src, int len)
{
    unsigned char *end = src + len;
    while (!b->error && src<end) {
        int left = end - src;
        int avail = b->cap - b->len;
        int amount = avail<left ? avail : left;

        for (int i = 0; i < amount; i++) {
            b->buf[b->len+i] = src[i];
        }
        b->len += amount;
        src += amount;

        if (amount < left) {
            flush(b);
        }
    }
}

That completes formatted output! We can now do stuff like:

int main(void)
{
    unsigned char mem[1<<10];  // arbitrarily-chosen 1kB buffer
    struct buf stdout = FDBUF(1, mem, sizeof(mem));
    for (long i = 0; i < 1000000; i++) {
        APPEND_STR(&stdout, "iteration ");
        append_long(&stdout, i);
        append_byte(&stdout, '\n');
        // ...
    }
    flush(&stdout);
    return stdout.error;
}

Except for the lack of format DSL, this should feel familiar.

Let's write a setjmp

2023-02-12T02:23:11Z

This article was discussed on Hacker News.

Yesterday I wrote that setjmp is handy and that it would be nice to have without linking the C standard library. It’s conceptually simple, after all. Today let’s explore some differently-portable implementation possibilities with distinct trade-offs. At the very least it should illuminate why setjmp sometimes requires the use of volatile.

First, a quick review: setjmp and longjmp are a form of non-local goto.

typedef void *jmp_buf[N];
int setjmp(jmp_buf);
void longjmp(jmp_buf, int);

Calling setjmp saves the execution context in a jmp_buf, and longjmp restores this context, returning the thread to this previous point of execution. This means setjmp returns twice: (1) after saving the context, and (2) from longjmp. To distinguish these cases, the first time it returns zero and the second time it returns the value passed to longjmp.

jmp_buf is an array of some platform-specific type and length. I’ll be using void pointers in this article because it’s a register-sized type that isn’t behind a typedef. Plus they print nicely in GDB as hexadecimal addresses which eased in working it out.

Using GCC intrinsics

Let’s start with the easiest option. GCC has two intrinsics doing all the hard work for us: __builtin_setjmp and __builtin_longjmp. Its worst case jmp_buf is length 5, but the most popular architectures only use the first 3 elements. Clang supports these intrinsics as well for GCC compatibility.

Be mindful that the semantics are slightly different from the standard C definition, namely that you cannot use longjmp from the same function as setjmp. It also doesn’t touch the signal mask. However, it’s easier to use and you don’t need to worry about volatile.

// NOTE to copy-pasters: semantics differ slightly from standard C
typedef void *jmp_buf[5];
#define setjmp __builtin_setjmp
#define longjmp __builtin_longjmp

If you only care about GCC and/or Clang, then that’s it! It works as-is on every supported target and nothing more is needed. As a bonus, it will be more efficient than the libc version, though I should hope that won’t matter in practice. These are so awesome and convenient that I’m already second-guessing myself: “Do I really need to support other compilers…?”

Using assembly

If I want to support more compilers I’ll need to write it myself. It’s also an excuse to dig into the details. The execution context is no more than an array of saved registers, and longjmp is merely restoring those registers. One of the registers is the instruction pointer, and setting the instruction pointer is called a jump.

Since we’re talking about registers, that means assembly. We’ll also need to know the target’s calling convention, so this really narrows things down. This implementation will target x86-64, a.k.a x64, Windows, but it will support MSVC as an additional compiler. So it’s a different kind of portability. I’ll start with GCC via w64devkit then massage it into something MSVC can use.

I mentioned before that setjmp returns twice. So to return a second time we just need to simulate a normal function return. Obviously that includes restoring the stack pointer like the ret instruction, but it means preserving all the non-volatile registers a callee is supposed to preserve. These will all go in the execution context.

The x64 calling convention specifies 9 non-volatile rsp, rsp, rbx, rdi, rsi, r12, r13, r14, and r15. We’ll also need the instruction pointer, rip, making it 10 total.

typedef void *jmp_buf[10];

setjmp assembly

The tricky issue is that we need to save the registers immediately inside setjmp before the compiler has manipulated them in a function prologue. That will take more than mere inline assembly. We’ll start with a naked function, which means that GCC will not create a prologue or epilogue. However, that means no local variables, and the function body will be limited to inline assembly, including a ret instruction for the epilogue.

__attribute__((naked))
int setjmp(jmp_buf buf)
{
    __asm(
        // ...
    );
}

The x64 calling convention uses rcx for the first pointer argument, so that’s where we’ll find buf. I’ve arbitrarily decided to store rip first, then the other registers in order. However, the current value of rip isn’t the one we need. The rip we need was just pushed on top of the stack by the caller. I’ll read that off the stack into a scratch register, rax, and then store it in the first element of buf.

    mov (%rsp), %rax
    mov %rax,  0(%rcx)

The stack pointer, rsp, is also indirect since I want the pointer just before rip was pushed, as it would be just after a ret. I use a lea, load effective address, to add 8 bytes (recall: stack grows down), placing the result in a scratch register, then write it into the second element of buf (i.e. 8 bytes into %rcx).

    lea 8(%rsp), %rax
    mov %rax,  8(%rcx)

Everything else is a matter of elbow grease.

    mov %rbp, 16(%rcx)
    mov %rbx, 24(%rcx)
    mov %rdi, 32(%rcx)
    mov %rsi, 40(%rcx)
    mov %r12, 48(%rcx)
    mov %r13, 56(%rcx)
    mov %r14, 64(%rcx)
    mov %r15, 72(%rcx)

With all work complete, return zero to the caller.

    xor %eax, %eax
    ret

Putting it altogether, and avoiding a -Wunused-variable:

__attribute__((naked,returns_twice))
int setjmp(jmp_buf buf)
{
    (void)buf;
    __asm(
        "mov (%rsp), %rax\n"
        "mov %rax,  0(%rcx)\n"
        "lea 8(%rsp), %rax\n"
        "mov %rax,  8(%rcx)\n"
        "mov %rbp, 16(%rcx)\n"
        "mov %rbx, 24(%rcx)\n"
        "mov %rdi, 32(%rcx)\n"
        "mov %rsi, 40(%rcx)\n"
        "mov %r12, 48(%rcx)\n"
        "mov %r13, 56(%rcx)\n"
        "mov %r14, 64(%rcx)\n"
        "mov %r15, 72(%rcx)\n"
        "xor %eax, %eax\n"
        "ret\n"
    );
}

Also take note of the returns_twice attribute. It informs GCC of this function’s unusual nature, saying the function doesn’t preserve most non-volatile registers, and induces -Wclobbered diagnostics. Technically this means we could get away with saving only rip, rsp, and rbp — exactly as __builtin_setjmp does — but we’ll need the others for MSVC anyway.

longjmp assembly

In longjmp we need to restore all those registers. For purely aesthetic reasons I’ve decided to do it in reverse order. Everything but rip is easy.

    mov 72(%rcx), %r15
    mov 64(%rcx), %r14
    mov 56(%rcx), %r13
    mov 48(%rcx), %r12
    mov 40(%rcx), %rsi
    mov 32(%rcx), %rdi
    mov 24(%rcx), %rbx
    mov 16(%rcx), %rbp
    mov  8(%rcx), %rsp

The instruction set doesn’t have direct access to rip. It will be a jmp instead of mov, but before jumping we’ll need to prepare the return value. The x64 calling convention says the second argument is passed in rdx, so move that to rax, then jmp to the caller. It’s only a 32-bit operand, C int, so edx instead of rdx.

    mov %edx, %eax
    jmp *0(%rcx)

Putting it all together, and adding the noreturn attribute:

__attribute__((naked,noreturn))
void longjmp(jmp_buf buf, int ret)
{
    (void)buf;
    (void)ret;
    __asm(
        "mov 72(%rcx), %r15\n"
        "mov 64(%rcx), %r14\n"
        "mov 56(%rcx), %r13\n"
        "mov 48(%rcx), %r12\n"
        "mov 40(%rcx), %rsi\n"
        "mov 32(%rcx), %rdi\n"
        "mov 24(%rcx), %rbx\n"
        "mov 16(%rcx), %rbp\n"
        "mov  8(%rcx), %rsp\n"
        "mov %edx, %eax\n"
        "jmp *0(%rcx)\n"
    );
}

The C standard says that if ret is zero then longjmp will return 1 from setjmp instead. I leave that detail as a reader exercise. Otherwise this is a complete, working setjmp. It works perfectly when I swap it in for setjmp.h in my u-config test suite.

Considering volatile

Now that you’ve seen the guts, let’s talk about volatile and why it’s necessary. Consider this function, example, which calls a work function that may return through setjmp (e.g. on failure).

void work(jmp_buf);

int example(void)
{
    int r = 0;
    jmp_buf buf;
    if (!setjmp(buf)) {
        // first return
        r = 1;
        work(buf);
    } else {
        // second return
    }
    return r;
}

It stores to r after the first setjmp return, then loads r after the second setjmp return. However, r may have been stored in the execution context. Since it’s used across function calls, it would be reasonable to store this variable in non-volatile register like ebx. If so, it will be restored to its value at the moment of the first call to setbuf, in which case the old r would be read after restoration by longjmp. If it’s not stored in a register, but on the stack, then on the second return the function will read the latest value out of the stack. In practice, if work returns through longjmp, this function may return either 0 or 1, probably determined by the optimization level.

The solution is to qualify r with volatile, which forces the compiler to store the variable on the stack and never cache it in a register.

    volatile int r = 0;

Though since our setbuf is marked returns_twice, GCC will never store r in a register across setjmp calls. This potentially hides a bug in the program that would occur under some other compilers, but GCC will (usually) warn about it.

Pure assembly and MSVC

MSVC doesn’t understand __attribute__ nor the inline assembly, so it cannot compile these functions. I could compile my setjmp with GCC and the rest of the program with MSVC, which means I need two compilers. Instead, I’ll move to pure assembly, assemble with GNU as (TODO: port to MASM?) so we’ll only need a tiny piece of the GNU toolchain.

	.global setjmp
setjmp:
        mov (%rsp), %rax
	mov %rax,  0(%rcx)
	lea 8(%rsp), %rax
	mov %rax,  8(%rcx)
	mov %rbp, 16(%rcx)
	mov %rbx, 24(%rcx)
	mov %rdi, 32(%rcx)
	mov %rsi, 40(%rcx)
	mov %r12, 48(%rcx)
	mov %r13, 56(%rcx)
	mov %r14, 64(%rcx)
	mov %r15, 72(%rcx)
	xor %eax, %eax
	ret

	.globl longjmp
longjmp:
	mov 72(%rcx), %r15
	mov 64(%rcx), %r14
	mov 56(%rcx), %r13
	mov 48(%rcx), %r12
	mov 40(%rcx), %rsi
	mov 32(%rcx), %rdi
	mov 24(%rcx), %rbx
	mov 16(%rcx), %rbp
	mov  8(%rcx), %rsp
	mov %edx, %eax
	jmp *0(%rcx)

Then some declarations in C:

typedef void *jmp_buf[10];
int setjmp(jmp_buf);
_Noreturn void longjmp(jmp_buf, int);

I’ll need to enable C11 for that _Noreturn in MSVC. Assemble, compile, and link:

$ as -o setjmp.obj setjmp.s
$ cl /std:c11 program.c setjmp.obj

That generally works! If I rename to xsetjmp and xlongjmp to avoid conflicting with the CRT definitions, drop them into the u-config test suite in place of setjmp.h, then compile with MSVC, it passes all tests using my alternate implementation in MSVC as well as GCC. Pretty cool!

Takeaway

I’m not sure if I’ll ever use the assembly, but writing this article led me to try the GCC intrinsics, and I’m so impressed I’m still thinking about ways I can use them. My main thought is out-of-memory situations in arena allocators, using a non-local exit to roll back to a savepoint, even if just to return an error. This is nicer than either terminating the program or handling OOM errors on every allocation. Very roughly:

typedef struct {
    size_t cap;
    size_t off;
    void *jmp_buf[5];
} Arena;

// Place an arena and savepoint an out-of-memory jump.
#define OOM(a, m, n) __builtin_setjmp((a = place(m, n))->jmp_buf)

// Place a new arena at the front of the buffer.
Arena *place(void *mem, size_t size)
{
    assert(size >= sizeof(Arena));
    Arena *a = mem;
    a->cap = size;
    a->off = sizeof(Arena);
    return a;
}

void *alloc(Arena *a, size_t size)
{
    size_t avail = a->cap - a->off;
    if (avail < size) {
        __builtin_longjmp(a->jmp_buf, 1);
    }
    void *p = (char *)a + a->off;
    a->off += size;
    return p;
}

Usage would look like:

int compute(void *workmem, size_t memsize)
{
    Arena *arena;
    if (OOM(arena, workmem, memsize)) {
        // jumps here when out of memory
        return COMPUTE_OOM;
    }

    Thing *t = PUSHSTRUCT(arena, Thing);
    // ...

    return COMPUTE_OK;
}

More granular snapshots can be made further down the stack by allocating subarenas out of the main arena. I have yet to try this out in a practical program.