jlm-blog
~jlm

14-Jun-2024

Ridiculous bug with std::random_shuffle

Filed under: programming — jlm @ 17:49

Right now I’m very irate at C++’s STL because its random_shuffle algorithm has a landmine associated with it. I was using it with a vector, which I’ll refer to as v. The call which shuffles v’s contents is random_shuffle(v.begin(), v.end(), rfunc), where rfunc is a function object which, in the case of int-indexable vectors, accepts a single int n and returns a random int in the range [0, n). What’s the obvious way to get a random int in some range? uniform_int_distribution, of course. And boom, there goes the dynamite because uniform_int_distribution(0, n) generates a random int in the range [0, n], not [0, n).

Besides the need to use uniform_int_distribution(0, n-1) being counter-intuitive, some other attributes make this bug frustrating. While random_shuffle requires rfunc’s return value to be in [0, n), it doesn’t check that it is, even when debugging is enabled — instead, it’s our old friend undefined behavior, no compiler diagnostic provided. vector bears some blame too, as it’s happy to silently munge data pointed to by an out-of-range iterator even with debugging enabled, optimization off, warnings enabled — and again, undefined behavior, no compiler diagnostic provided. If 0 is a valid value of a vector element (naturally, for me it was), then the effect of the bug when v.size() is less than v.capacity() is to get a shuffled vector in which one of its elements has been replaced by 0, and the downstream effects of that can easily be too subtle to notice (naturally, for me it was). So, you only get crashes when v.size() == v.capacity() and something in v.data corrupts and gets corrupted by whatever’s just past it in memory — which is a freakin’ rare occurrence!

How do you debug this kind of thing? Since it’s an intermittent bug which any given run of the program is highly unlikely to trigger, turn on core dumps so that when it eventually does occur, you have the dump of the program state available to debug right there. Turn on debugging so you can get the most out of your corefiles. Stick in code which lets you track what you need to track. Here, this shim function is instructive: int f() { int rand_val = uid(rand_eng); return rand_val; }. You need to remove -O to prevent calls to f() from being replaced by calls to uid(rand_eng) and use -O0 to prevent the transient variable rand_val from disappearing. (volatile probably would work also.)

Don’t forget that this is terrible API design, though. We have a library which has a generator of random ints and a consumer of random ints, and their conventions are different enough that the obvious way to combine them is buggy, yet similar enough that they’ll never trigger compiler warnings (much less errors). That’s not a good API choice, that’s a trap for API users! It’s good to have obvious ways to perform the operations API users will want to perform provided they work correctly. Since API design is hard, often you can’t do that. In those cases where you can’t make an obvious use that works, don’t introduce the obvious use in the first place — it’s far worse than having no obvious use at all, because if you make an obvious way to perform an operation, people will perform the operation that way, simply because it’s the obvious way. It should never be wrong. That there are so many ways to perform unsafe operations that are not obvious, not caught by C++ compilers, and not caught at runtime is the thing I dislike most about this programming language. Today I learned one more way to do it, and I should be well beyond that point.

26-Apr-2023

How is this glibc deadlock still unaddressed?

Filed under: programming — jlm @ 13:29

Why Programming is Hard, Volume CXX: the maintainers of critical pthread implementations will ignore multiple bugfixes from parallel programming experts for a longstanding deadlock bug for three years and counting.

24-Apr-2023

Do not meddle with the Application Binary Interface

Filed under: programming — jlm @ 15:18

Why Programming is Hard, Volume CXIX: a comprehensive explanation by JeanHeyd Meneide of how ABI stability stops us from having nice things despite enormous amounts of high-quality work on ABI-consistency and related issues. Naturally, it’s a dispiriting read. Fortunately, half a year later there’s this follow-up on how to fix it.

29-Nov-2022

twitcode #5: Extract HyperRogue achievements from log file

Filed under: programming — jlm @ 17:28

The game HyperRogue is very interesting. Its main gimmick is that it’s played on a hyperbolic plane (surface with negative curvature) instead of an Euclidean plane (surface with zero curvature) like 99.99% of 2-D games are. Playing in this kind of geometry is compellingly mind-bending. Other than the geometry involved, the main way the game differs from typical roguelikes is that the player character doesn’t gain levels or skills in the course of play — instead, the game opens up new “lands” with new environmental challenges and/or monsters with new attributes as the player demonstrates their skill by meeting the unlock requirements for each land.

The game supports “achievements”, where whenever your gameplay meets the condition of some challenge (some very easy, some nightmarishly hard, most of medium difficulty), it records this in a log file. Of course, met achievements aren’t all that the log file contains, but it’s nice enough to start each “met achievement conditions” record with “ACHIEVEMENT” and a marginally-descriptive achievement name, followed by other relevant information. Unfortunately, you can’t get a list of your fulfilled achievements merely by running  grep '^ACHIEVEMENT ' hyperrogue.log  because the game logs when you fulfill an achievement’s conditions separately for every run, not just the first run to fulfill it. So, to extract the records of each achievement’s first fulfillment, we ignore any record with an achievement name we’ve seen before and print out the line otherwise (ie, if we haven’t seen it before), which is nigh-trivial in awk [link]:

/^ACHIEVEMENT / {
    if (!($2 in ach)) {
        ach[$2] = 1;
        print;
    }
}

At 86 characters, it’s the fifth example from me of “twitcode” (programs under 140 bytes). And being in awk, it’s the fifth language I’ve twitcoded as well.

21-Mar-2020

Simple Scala puzzle

Filed under: programming — jlm @ 10:46

Who can guess what this tiny* Scala program does?

object O extends App {
    val x = 123456789L
    val y = 0F
    val z = x − (if (true) x else y)
    println(z)
}

(Apparently this mysterious snippet had been going around a while ago, but I only recently encountered a variation.)

Answer: Prints “−3.0”.

Why?

Short answer: The if/else causes x to get promoted to Float, which rounds it up to 123456792.

Long answer after the fold.

(more…)

17-Feb-2020

Unit tests are not enough

Filed under: programming — jlm @ 20:48

[I wrote this for another blog back in 2011, but it was lost to public view when that blog disappeared. It’s still valid today, though!]

In one of our datastructures, there’s an entry for a 64-bit unique resource identifier. This turned out to be a little too vague a description: Is it an 8-byte array, a 64-bit wide integer, or an instance of a 64-bit UUID class? One of our developers though one thing, and another thought another. No problem, right? The first compile by the one who chose wrong will show the type error. Except we’re using Python, so no type checking until the code runs. Still no problem, because our devs wrote unit tests, so the type error will show up on their first run, right?

Unfortunately, unit tests help not an iota in situations like this. The developer expecting UUID objects writes unit tests which consume UUID objects, and the dev expecting byte arrays writes unit tests which produce byte arrays.

So, integration tests, then? It turns out our devs did these too, verifying that their code integrated with the fancy datastructure class, but this still didn’t catch the mismatch! This is because the datastructure class doesn’t actually do anything with the resource identifier that would trigger a type check: it just accepts it as an entry in a slot when an item is inserted, and gives it out as one of the slot entries when items are looked up, and Python is happy to let an instance of any type pass through unchecked.

It’s only when we plug everything together that this trivial type mismatch shows up. So the moral: Make end-to-end tests early. You need them to turn up even problems as basic as this, and the earlier you find problems, the better.

2-Feb-2019

Less is more, CPU edition

Filed under: general, programming — jlm @ 17:11

I fixed an interesting bug recently. By way of background, I work on industrial robotics control software, which is a mix of time-critical software which has to do its job by a tight deadline (variously between 1 and 10 ms) or a safety watchdog will trigger, and other software that does tasks that aren’t time sensitive. We keep the timing-insensitive software from interfering with the real-time software’s performance by having the scheduler strictly prioritize the deadline-critical tasks over the “best effort” tasks, and further by massively overprovisioning the system with far more CPU and memory than the tasks could ever require.

The bug (as you may have guessed) is that the deadline was sometimes getting missed, triggering the watchdog. What makes the bug interesting is that it was caused by us giving the system too much excess CPU and memory!

The deadline overruns happen when the computer runs best-effort tasks. It only runs those tasks a small fraction of the time, but no matter, they shouldn’t be interfering with the deadline completion at all (and having our software fail its deadlines is unacceptable, of course). The real-time tasks’ peak usage is only about a quarter of the computer’s CPU capacity, and the system gives them that always: 100% of the time they want CPU, they get CPU. They are never delayed, and never have to wait on a resource held by a best-effort task. When they miss the deadlines, it’s always because they’ve gotten jobs that take the usual amount of work to complete, and had their usual amount of time to do it, yet the jobs somehow just take more time to finish. There’s no resource contention, nor are the caches an issue.

Yet when the best-effort tasks execute, the calculations done by the real-time tasks run slower on the same CPU for the same job with the same cache hit rate and with no memory or I/O contention. What’s going on? After hitting some false leads, I discovered that they’re going slower because the CPUs are running at a lower frequency. It turns out the CPUs’ clocks have been stepped down because they’re getting too hot. (Yes, the surrounding air is much warmer than it “should” be, but we don’t have a choice about that.)

The computer is hot because all the CPUs are going at full blast, because all of the best-effort tasks are executing because the computer can fit them all in memory. Half a minute later, the tasks are done, the computer cools off, the CPU frequency gets stepped back up, and the deadline overruns cease. An hour later, the hourly background tasks all go again, the fans spin up to full, but they’re not enough and the CPU frequency steps down, hence the watchdog alerts about missed deadlines.

Okay, maybe we should change the background tasks so they execute in a staggered fashion. But before thinking about how I might do that, I tried disabling ⅔ of the computer’s CPUs instead. The hourly processing now takes 200s instead of 30, but it’s still far below 3600, and that’s what matters. Now the computer stays nice and cool, so the CPUs stay nice and quick, and the deadlines all get met. We were using too many CPUs to get the calculations we needed to get done fast, done fast. Who’d’a thunk?

20-Oct-2018

Scheme got incrementally better

Filed under: programming — jlm @ 08:34

Almost a decade ago, I wrote about how the venerable and elegant but third-tier programming language Scheme had no way to access the program arguments that would work across different implementations, and pointed out that type of large stumbling block tends to hinder the adoption of a language. The last time I worked on a serious project in Scheme was only a little later. So not much news from the Schemeoverse has reached me since then. Which is why I just learned that four years after that post, five years previous to the present day, the Scheme community decided to fix that and standardized on option (c) from that post: (command-line) gets you your program’s arguments. Progress marches on! In the case of Scheme, that march is just veerrryyyy slow.

5-Oct-2017

Tutorial for building .so files

Filed under: linux, programming — jlm @ 18:34

Exactly how one goes about building .so files* isn’t widely understood, and the documentation on it is overcomplicated IMHO. So, I figure the web needs a tutorial demonstrating how to build an extremely simple .so file. Et voilà, here’s a step by step guide for building a .so file on Linux or a similarly flavored Unix using a GNU-compatible toolchain.

We’re going to make two .o files, “call_me.o” and “die.o”, and combine them into a single “call_me_and_die” shared library. Making a library is very much like making an executable, just with some tweaks to the build commands and some extra steps at the end. To make our “call_me_and_die” library, we start the way we would with a “call_me_and_die” executable: writing a .h file for the common function signatures. File so_tut.h:

extern void call_me(void);
extern void die_die_die(void);

(more…)

11-Sep-2017

EBUSY: Another decade, and still sucking

Filed under: general, linux, programming — jlm @ 12:40

Today’s xkcd riffs a common frustration about busy files: programs that can’t do their intended operations on them rarely (if ever) specify the file that’s busy or tell the user which programs are using the file — and without that knowledge, the user can’t fix the problem.

There are tools that can help you discover those things, as I describe using ten years ago, but I’d already been running into EBUSY problems for ten years by then, and the old guard before me for another ten years beyond that. Why haven’t things improved?

Sometimes file_operation(filename) fails — hey, this happens, busy files are a thing, ain’t nothing your program can do about that. But report to the user that file_operation on filename failed for reason strerror(errno)! Not reporting the filename is just sloppy programming.

It’s not like the tools have gotten any better. What, you think it’s acceptable to make the user trace your program, looking through a haystack of every system call to find the needle which is the program’s critical failure? Is that work you’re putting on the user’s shoulders less than sticking the filename in the error message? This has sucked for two generations now. Please help making it suck less.

Powered by WordPress