I read stories about the efforts to mitigate speculative-execution vulnerabilities (and resulting overhead), and I keep wondering if the industry as a whole will have a reckoning in its approach to designing computers.

lwn.net/Articles/901834/

Follow

I'm reminded of Kerrnighan and Plauger saying:

"Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?"

It feels like there's a similar issue with designing computational hardware, but instead of twice has hard, it's at least an order of magnitude harder.

Not to mention the software mitigations for hardware bugs are themselves examples of being as clever as you can be. :/

@cstanhope

It feels like there's a similar issue with designing computational hardware, but instead of twice has hard, it's at least an order of magnitude harder.

It's way, way, way more than an order of magnitude harder.

What we're seeing is the result of trying to debug a massively parallel system. This is true even for single-core, single-issue, in-order processors, much less systems of processors under load with a particular memory configuration.

The only actually useful tool in a hardware designer's toolbox to combat this is formal verification; however, you can only formally verify against properties you're aware of. This is the first concern with massively parallel systems: not only are they non-deterministic, they're also unpredictable.

Also, formal verification is a terribly expensive process in terms of time. Solvers for many kinds of properties are NP-hard or NP-complete, and will basically never terminate in human time scales. So, even here, all you can do is run the verification procedures for as long as you can afford to, and hope that the laws of probability lie in your favor that it did a good job covering all the edge cases that can lead to a falsified property.

I used formal verification to prove the design of my VDC-II core "correct", for example, and I still have an inexplicable bug where if the CPU continually polls the status register, it will corrupt any in-flight DMA operations in video memory. These two circuits aren't even logically connected to each other, so I don't understand this interaction. Somewhere, there's a property I've not specified. At some (potentially distant) point in the future, I'll revisit the project and work to find out the missing (set of) property(-ies).

@cstanhope Ironically, the work-arounds to using the VDC-II core successfully all involve software tricks which, you guessed it, cuts into runtime performance. 😏​

Sign in to participate in the conversation
social.coop

A Fediverse instance for people interested in cooperative and collective projects.