Code Size: Squeezing more with linker outlining

My previous blog post covered the SEGGER Linker for RISC-V and the benefits provided by enhanced relaxation. This article continues to explore what SEGGER is doing with its linker technology, advancing what is typically possible.

An appetizer…

The linker optimization described in this article, highlighted below, delivers the following benefits over and above those described in my previous blog post:

Architecture Library Linker R/O Code R/O Data Total R/O
RV32IMAC GNU GNU 43176 4608 47784
RV32IMAC GNU SEGGER 37772 3623 41395
RV32IMAC GNU SEGGER + outlining 36086 3623 39709

If you’re curious how we do this, read on!

A bit of background

Applications are typically divided into multiple source files that logically divide functionality into manageable pieces. During development, this has the benefit that when a single C source file is changed, the entire application does not require recompilation.

A side effect of separate compilation is that optimization cannot typically be applied across compilation unit boundaries. You can turn on link-time optimization (LTO), supported by the SEGGER toolset, but this only applies to source code modules. If your application requires a third-party object-code library, as typically provided by silicon vendors, standard link-time optimization will not apply to it!

There’s still an opportunity to reduce application size beyond LTO.

Outlining

One of the ways that an application’s size can be reduced is by searching for sequences of instructions that are common between functions, or even within a function. Once identified, the common sequences can be extracted into a subroutine, replacing each extraction with a call to the subroutine.

The extracted subroutine does not need to abide by any ABI-mandated calling convention because it is operating at the machine level, not the language level. This means the search can be amazingly broad, considering many strands of instructions as candidates, short to long, for replacement.

Because the SEGGER Linker’s outlining optimization works with any object code, including third-party object code libraries — and over the entire application — outlining opportunities are vastly expanded.

The SEGGER Linker carefully analyses the application, picking the best way to reduce code size: greedily taking the longest strand may not be the best strategy, taking shorter strands that are used more often is generally a better approach.

An example

Here is a real-world example from an emWin application. During the lifetime of the product, the original LCD panel became constrained in supply and then obsolete, and needed to be replaced with another, mostly-compatible panel. However, the firmware needs to run on boards populated with either panel for historical in-field updates and newly made boards. To do this, the firmware detects which panel is installed at runtime and initializes emWin with the appropriate panel driver. The final application has two drivers included, with the high-level code blind to the actual panel being used, and able to run on legacy hardware and new hardware equally well.

The panel drivers are fairly similar, but not identical. This leads to a certain amount of “almost duplicated” code between the two drivers. An example of this is the following, with the single difference highlighted in red:

ReadRectCust_16bpp:   ReadRectCust_18bpp:
  ADDI sp, sp, -32       ADDI sp, sp, -32
  SW   s2, 16(sp)        SW   s2, 16(sp)
  LW   s2, 8(a0)         LW   s2, 8(a0)
  SW   s0, 24(sp)        SW   s0, 24(sp)
  SW   s1, 20(sp)        SW   s1, 20(sp)
  LW   a6, 168(s2)       LW   a6, 168(s2)
  SW   s3, 12(sp)        SW   s3, 12(sp)
  SW   s4, 8(sp)         SW   s4, 8(sp)
  SW   s5, 4(sp)         SW   s5, 4(sp)
  SW   ra, 28(sp)        SW   ra, 28(sp)
  MV   a0, s2            MV   a0, s2
  MV   s5, a1            MV   s5, a1
  MV   s4, a2            MV   s4, a2
  MV   s1, a3            MV   s1, a3
  MV   s0, a4            MV   s0, a4
  MV   s3, a5            MV   s3, a5
  JALR a6                JALR a6
  SUB  a2, s1, s5        SUB  a2, s1, s5
  SUB  s0, s0, s4        SUB  s0, s0, s4
  ADDI s0, s0, 1         ADDI s0, s0, 1
  ADDI a2, a2, 1         ADDI a2, a2, 1
  MUL  a2, a2, s0        MUL  a2, a2, s0
  LW   a4, 64(s2)        LW   a4, 64(s2)
  LW   s0, 24(sp)        LW   s0, 24(sp)
  LW   t1, 228(s2)       LW   t1, 232(s2)
  LW   ra, 28(sp)        LW   ra, 28(sp)
  LW   s1, 20(sp)        LW   s1, 20(sp)
  LW   s2, 16(sp)        LW   s2, 16(sp)
  LW   s4, 8(sp)         LW   s4, 8(sp)
  LW   s5, 4(sp)         LW   s5, 4(sp)
  MV   a1, s3            MV   a1, s3
  LW   a0, 24(a4)        LW   a0, 24(a4)
  LW   s3, 12(sp)        LW   s3, 12(sp)
  ADDI sp, sp, 32        ADDI sp, sp, 32
  JR   t1                JR   t1

There seems to be an awful lot of duplication between these two functions. We’ll examine this in the following sections.

RISC-V, explained

Before we proceed, a little explanation of the above for those unfamiliar with RISC-V instructions:

  • JAL label is a subroutine call to label, placing the return address into the ra register. This is identical to the Arm instruction BL label, which places the return address into the lr register.
  • JALR reg is a subroutine call to the address held in the register reg, placing the return address into the ra register. This is identical to the Arm instruction BLX reg, which places the return address into the lr register.
  • JAL reg, label is a subroutine call to label, placing the return address into the register reg. There is no single-instruction Arm equivalent of this.
  • JR reg is a jump to the address held in the register reg. This is identical to the Arm instruction BX reg. This can be used to return from a function in RISC-V and Arm, jumping to the address held in the ra or lr register.

How it performs

The SEGGER Linker finds commonality between multiple functions and rewrites these using subroutines. The two functions above are translated to the following that has identical functionality:

ReadRectCust_18bpp:                  _ReadRectCust_16bpp:
  JAL        tp, <outline.62.38>       JAL        tp, <outline.62.38>
  JALR       a6                        JALR       a6
  JAL        tp, <outline.144.24>      JAL        tp, <outline.144.24>
  LW         t1, 232(s2)               LW         t1, 228(s2)
  J          <tail.52.20>              J          <tail.52.20>

This is quite a saving!

On entry, there’s a call to a subroutine with one common sequence, <outline.62.38>. The first number in this label is simply a serial number of the outlined sequence: in this case, it’s the 62nd outlined sequence in the final application. The second number indicates how many instruction bytes are contained in the outlined sequence: in this case, 38 bytes.

The call to the outlined sequence does not use the standard return register for RISC-V, but substitutes an alternate register, tp, which the outlined sequence uses to return:

<outline.62.38>:
  0x08009F38  1101         ADDI       sp, sp, -32
  0x08009F3A  C84A         SW         s2, 16(sp)
  ...
  0x08009F5A  89BE         MV         s3, a5
  0x08009F5C  8202         JR         tp

The linker chooses to use tp as the standard return address register (ra) could well hold a live value, and we don’t want to destroy that. The linker can dedicate the tp register as an alternate return register without any change in functionality.

The call through a6 is followed by a second outlined sequence, <outline.144.24>. After this, the single instruction that is different between the two remains, followed by a branch to the final outlined sequence that returns.

I can see a better possibility!

If you have a keen mind, you might wonder why the common instructions before the single difference were not collapsed to a single outlined subroutine, but are wrapped around that single function call through a6.

The answer is that the function invoked through a6 may itself be subject to outlining and may well destroy the value held in tp. If the JALR a6 instruction is moved into an outlined sequence, that call through a6 may destroy the outlined function’s return address held in tp, and so is an invalid transformation.

Conclusion

This is an example of just one outlined sequence. You can tune the linker’s outlining parameters when searching for sequences to outline, setting the minimum and maximum length for an outlined strand. The linker is capable of finding many sequences to outline, even in size-optimized code.

If code size is important to you, the SEGGER Linker is capable of reducing your application image in multiple ways.