PALMED

is a tool to establish a throughput-based cost-model of a CPU. As a throughput-based model, dependencies between instructions are not taken into account. It builds on top of the idea of port mapping to perform an abstract resource mapping of a processor based on a wide variety of microbenchmarks, while remaining completely agnostic of the concrete CPU architecture, ISA, … Read about all the details here:

Research article (CGO'22) Extended version on ArXiV (no paywall)

Try it!

Our models are presented on this website in two ways: you can either look at the instructions tables, or play around with the model by throwing basic blocks of assembler in, and see what comes out.

See the tables Try the demo

Should you want to try out the tool yourself, the source code is also available for you to look at or play with.

Source code (Git repository)

Models

On this demo website, a few different mappings are available, varying on the architecture, CPU used, etc. Read more details about each mapping here: Mapping details

What it does, what it does not

aims at identifying the throughput-based performance bottleneck of a loop kernel, and as such, it is unable to detect latency-based performance bottlenecks. In other words, a kernel is treated as if there were no dependency between any two instructions and only the throughput of the used resources was taken into account. The kernel is also treated as if it were the body of an infinite (read: huge) loop running in steady-state.

In this abstraction, the CPU (not memory, nor cache) is modeled as a set of abstract resources. To be issued, an instruction must use (with a fixed ratio) some of those resources. This corresponds to what we call the abstract resource mapping which is a dual (conjunctive form) representation of the more known port mapping (disjunctive form) representation. Check the details here or in the research article.

Who are we?

is made by the CORSE Inria team. It is part of the PhD of Nicolas Derumigny and under the supervision of Fabrice Rastello. Other contributors include Théophile Bastian (PhD student at CORSE), Fabian Gruber (former PhD student at CORSE), and Christophe Guillon (engineer at STMicroelectronics).

Related projects

is not the only tool in its category! Here are a few other related projects:

uops.info provides large data tables and port mappings for many x86 microarchitectures. These tables are extracted from a wide variety of sources, including the official documentations. The core contribution of uops.info is to retrieve the port mapping of various architectures using instrumented micro-kernels. It requires hardware counters currently available only on x86 microarchitectures (Intel and AMD).
The Intel Architecture Code Analyzer (IACA) is closely related to in features, as it performs throughput and latency analysis of given basic blocks. Unlike , IACA both considers throughput resource usage and latency between dependent instructions. However, the tool is closed-source and is directly based off microarchitectural informations, strongly tying it to Intel CPUs.
Ithemal is a tool that provides throughput analysis on a basic block, using a deep learning approach. As opposed to , Ithemal does not ignore data dependencies. The machine learning based approach provides a black-box costly performance model.
PMEvo is another similar tool for computing a port mapping. Similarly to , it does not require any specific hardware counter. Its approach is based on evolutionary optimization.
Agner Fog maintains a list of instruction latencies and throughput for Intel, AMD and VIA CPUs. As opposed to the previous projects, it does not provide any port mapping required to model the sharing of resources between instructions.