SISA: A Scalable Instruction Set Architecture
by Brian Fidel Davila
1. Introduction
The never-ending pursuit of higher performance dictates the ongoing development
of
processors that perform computations on progressively bigger blocks of data at a
time.
Two methods of achieving this goal are increasing the size of the data word,
demonstrated most recently by the migration from 32 to 64-bit computing, and
having
an instruction operate on more words at once, as seen in vector computers and the
?multimedia?, ?DSP? or ?SIMD? extensions now common in modern processors.
While it is tempting to design a new architecture carte blanche every time
technological
advances make new operations possible, this is rarely feasible. Market realities
dictate
that pre-existing software continue to be supported unmodified. As a result,
instructions
are added to an existing architecture to provide this functionality.
The need to modify instruction sets for new data formats is a fundamental
limitation of
standard architectures. Ideally, an architecture would not be tied to specific data
sizes.
New processors could then support different operations without requiring
instruction set
modifications.
This paper presents a scalable instruction set architecture, SISA, which meets these
requirements. While any processor has limits on the data formats it supports, SISA
implementations are able to support operations on larger data words and arbitrarily
long
vectors using the same interface.
Section two describes memory aliasing registers and their use in a simple
implementation of SISA, the SISA-I. Next, section three defines vector operations
and
support for a larger data words and introduces a vector machine, SISA-II. Section
four
continues with the SISA-III, which supports superscalar execution. Section five lists
additional work, section six related research and section seven concludes.
2. Registers
?Any problem in computer science can be solved by adding another level of
indirection?
-Alan Kay
To minimize demands on the instruction data path, SISA instructions are 16 bits
long.
This instruction length allows only eight architecture registers, r0 to r7. Each
register is
associated or aliased to a memory address with the rset instruction. These
addresses
may subsequently be accessed with the rget instruction. Addresses may be read or
written from any number of contiguous registers with a single instruction, though
register
storage bandwidth or other limitations may prevent operations on larger subsets of
the
registers from executing in a single cycle. Memory addresses may also be modified
with
the radd and raddi instructions to add the value at a memory address or
immediate,
respectively.
SISA registers also have a width field, which defaults to 32-bits. This may be
changed
with the rsetw instruction. Register widths may be retrieved with rgetw. Both
instructions
may modify multiple registers as with rset and rget. The maxwidth instruction
returns the
maximum width an implementation supports.
In SISA-I, a register?s width may be set to one, two, four, eight, 16 or 32 bits. The
architecture may support any width; SISA-I supports only these widths for
simplicity and
performance. The widths of each operand and the result are passed to the
execution
stage so that it may perform the correct operation. The radd and raddi instructions
also
use register widths when incrementing alias addresses.
SISA-I uses an eight-entry register table for storage of addresses and widths.
When an
rset or rsetw instruction is called, the corresponding entries are updated in the
register
table. This table then provides the values for rget and rgetw.
Data values are stored in four banks of 32-bit wide, 16-entry register files, each
with two
ports. This arrangement allows up to four address registers to be saved or restored
in
one cycle and all eight in two cycles. This storage and a 16-entry, 24-bit wide tag
file
function as a direct-mapped Level-0, or L0, cache. During instruction decode each
aliased register specified is used to retrieve the currently aliased address, which is
then
used to access data in this cache.
It is not feasible to solely use a direct-mapped cache. It would fail in the case
where two
operands are aliased to different addresses that would occupy the same cache line.
While it would be possible to fetch one value directly from memory, this would
result in
substantial performance degradation.
A victim cache is used to avoid this penalty in most cases. When a block is evicted
from
the direct-mapped cache, it is sent to this unit which can then supply it to the data
path
as required. The requirement for 28-bit fully-associative lookups suggests the
victim
cache be kept small, so SISA-I uses eight entries. It is 128 bits wide in order to
hold an
entry from each of the four L0 cache banks in one line.
The victim cache also functions as a write buffer. Normally, dirty cache lines will be
written only when the bus to the next level in the memory hierarchy, the L1 data
cache,
is not otherwise busy. The exception is when the victim cache is full, in which case
a
write must take priority over a read.
While a set-associative cache could also be used, there is a performance issue. The
associative lookup in the victim cache can be performed in parallel with the
directmapped
cache access. Thus, it will be known if the value resides in the victim cache at
the start of the execution phase and it can be muxed into the data path with little
delay.
When the value is not in the victim cache, the tag check for the direct-mapped
cache
can be performed in parallel with the execution stage, which is not possible with a
setassociative
cache. This is not an issue in an implementation where performing a serial
tag check in the execution stage is not in the critical path.
A simple implementation of SISA could directly access memory, but this would only
be
viable in very low-cost solutions or where memory requirements are so small that
it may
be implemented with performance comparable to the speed of a cache.
Operands pass from the decode stage to an ALU in the execution stage before
results
are written back to the L0 cache. Each instruction in these stages of the pipeline
will
keep a eight bit destination field that specifies where the result will be written.
Each
register cache line will have a two bit counter of the number of instructions in the
pipeline that will write to it. The control logic insures cache blocks waiting for
results are
not evicted and that instructions writing to a cache block that has not loaded from
memory are stalled.
Advantages of SISA-I over a conventional pipelined processor include the speed of
function calls and returns and context switches. The rset instruction may be used
to
quickly switch between many different sets of address registers as needed and
control
logic will move values in and out of register storage in parallel with other
operations.
Explicit loads and stores of data are removed from the instruction stream and are
performed in parallel with ALU operations. Loading longer immediate values will
take
more cycles, but the 16-bit instruction format has been shown to suffer only a
small
performance penalty relative to a 32-bit format in a load-store architecture.1
3. Vector operations
Because operands are simply references to memory, it is easy to specify
instructions
that perform vector operations. While both scalar and vector instructions use a
three
operand format, scalar instructions have two source operands and a destination
while
the vector versions have one operand that is both source and destination, one
source
operand and an operand to specify vector length.
While the 16-bit instruction length does not allow a reasonable number of
instructions
with two 3-bit operands and the 8-bit immediate used in scalar instructions, there
are a
number of possible solutions. Either the immediate could be reduced to five bits to
allow
for a vector length operand, the vector length operand could be defined to come
from a
hardwired register, perhaps r0, or they may simply not be supported. The last
option is
chosen for the current specification.
An implementation may support any number of lanes2 for vector operations,
including
one, and the maxveclen instruction, which takes a width as an argument, is
provided to
return this value. SISA-II supports operations on 128 bits, which may be sub-
divided by
any power of two bits. This restriction is not a property of the architecture. This
implementation supports only these formats for simplicity and performance.
To minimize data path complexity, the restriction that only one of the operands
may
come from the victim cache is enforced. When an instruction specifying two
operands
that reside in the data cache is encountered, one of the operands is moved to the
directmapped
cache before the instruction proceeds.
For the bit-wise logical instructions and the move instruction, there is no
difference
between operating on, for example, a vector of four 32-bit values and a vector of
two
64-bit values. For addition and subtraction, the difference is only which carry bits
are
passed. Larger operations may be performed over multiple cycles. SISA-II supports
64-
bit additions in one cycle and 128-bit additions in two.
Multiplication presents a bigger challenge. Multiplication of operands with widths
larger
than those natively supported can be synthesized by combining smaller width
multiplications and additions, essentially decomposing the operands into ?digits? of
the
smaller bit width. While providing a more consistent interface, these operations will
introduce complexity and may be better handled in software. SISA-II supports
multiplications on operands of up to 32 bits and on vectors returning a maximum
of 128
bits. Note that the width of the result register is passed to the execution stage and
that
this determines the limit rather than the sum of the widths of the operands. The
maxmult
instruction takes a result width and returns the maximum number of consecutive
multiplications supported.
4. Superscalar execution
The multiple execution units of SISA-II are underutilized with shorter scalar
operations.
To take advantage of this idle hardware, SISA-III examines two instructions at a
time.
The short instruction word keeps the demands placed on the instruction cache low
and
the three-bit register operands simplify decode logic. Three bits are now used to
record
the number of in-flight instructions that will write to a cache line.
Instead of adding ports to register storage, SISA-III simply does not allow the issue
of
two instructions that would require more than two operands be read from different
lines
in the same bank. Similarly, one pipeline is stalled when two instructions attempt
to
write to different lines in the same bank. While these decisions will decrease the CPI
below what would otherwise be possible, they provide a significant increase in the
CPI
for a modest investment in hardware. Simultaneous multi-threading is also
possible by
duplicating the address, width and PC registers and inserting logic to guard against
structural hazards.
5. Additional Work
An assembler and an implementation of this architecture for a Xilinx Spartan-3
FPGA
are currently undergoing testing. Results for assembly programs vs. equivalents
targeted for a MIPS processor will be available shortly. A more thorough
investigation
requires targeting a compiler for the SISA architecture and perhaps developing a
simulator as well.
There is no reason this architecture must be limited to 16-bits. An instruction set
with a
different instruction width could easily be defined to support, for example,
different
immediate fields. An instruction length of 32 bits allows support for directly
addressing
data values in addition to the indirect addressing specified. To support a legacy
instruction set alongside this new framework, architecture registers could be
defined to
reside at a certain memory address and instructions could be dynamically
translated. In
that data values in load-store architectures must be saved on context switches,
they are
already associated with a memory address. Static translation of legacy code for
traditional architectures to SISA is also possible using this technique.
The performance of a SISA implementation with an L0 cache is of course dependent
on
the performance of the cache. How the performance of various workloads varies
with
different L0 cache implementations warrants further investigation, but interesting
results
require a benchmark such as SPEC CINT2000 be ported to SISA.
To improve the performance of the L0 cache, operand address prediction could be
used
to prefetch data. Such a mechanism could either rely on the branch prediction
mechanism and record operations on the register addresses and widths in each
path or
a method of predicting operand addresses independent of the branch prediction
unit
could be developed.
Different compilation techniques and optimizations will be necessary for SISA.
Gcc3 4.0
supports auto-vectorization from which SISA would benefit. Because the radd and
raddi
instructions may operate on more than one consecutive registers, one or more
registers
may be used to refer to the top values of a stack or stacks. Multiple registers may
also
refer to the same address but have different widths.
6. Related Research
Ditzel and McLellan proposed4 using a cache instead of registers for operands,
citing
disadvantages that have been addressed by advances in technology and
architecture.
The Virtual Context Architecture5 first studies a memory-memory architecture that
uses
memory addresses as operands and then modifies the Alpha architecture to
support
register windows that map logical registers to memory locations based on context.
Banked register files have been studied by Cruz, et al.6
7. Conclusion
A Scalable Instruction Set Architecture has been defined that allows microprocessor
implementations to support operations on arbitrarily long data elements, vector
lengths
and memory address sizes. It has the additional advantages of using only a 16-bit
instruction and allowing for fast function calls and returns and context switches.
The
adoption of such an architecture would greatly reduce the costs of introducing new
processors supporting these new operations.
1 J. Bunda, D. Fussell, R. Jenevein, and W.C. Athas, ?16-Bit vs. 32-Bit Instructions
for Pipelined Microprocessors?,
Proceedings of the 20th Annual International Symposium of Computer Architecture,
pp. 237-246, 1993.
2 Hennessy, J.. and Patterson, D. Computer Architecture: A Quantitative Approach.
Morgan- Kaufmann Publishers,
San Mateo, CA, 2003.
3 Free Software Foundation. GNU Compiler Collection.
http://gcc.gnu.org.
4 Ditzel, D.R.. and McLellan, H.R. Register Allocation for Free: The C Machine Stack
Cache. In Symposium on
Architectural Support for Programming Languages and Operating Systems. SIGPLAN
Not. 17, 4 (Apr. 1982), 48-56.
5 Oehmke, D.. N. Binkert, S. Reinhardt and T. Mudge. Design and Applications of a
Virtual Context Architecture.
https://www.eecs.umich.edu/techreports/cse/2004/CSE-TR-497-04.pdf6 J.-L. Cruz, A. Gonzalez, M. Valero, and N. E Topham. Multiple-Banked Register
File Architectures. In
Proceedings of the ISCA-27, pages 316-325, 2000.