80 isn't nearly enough

By | February 12, 2007, 11:27am PST

Summary: What an exciting week this has been. We unleashed the ‘Era of Tera’ by showcasing the world’s first programmable processor that can deliver Teraflops performance with remarkable energy efficiency.

What an exciting week this has been. We unleashed the ‘Era of Tera’ by showcasing the world’s first programmable processor that can deliver Teraflops performance with remarkable energy efficiency.

It’s rather extraordinary that after decades of single core processors, the high volume processor industry has gone from single to dual to quad-core in just the last two years. Moore’s Law scaling should easily let us hit the 80-core mark in a mainstream processors within the next ten years and quite possibly even less. It is therefore reasonable to ask the question: what are we going to do with this sudden abundance of processors?

The answer is somewhat obvious on the server side of things. More cores and more threads means more transactions per unit time, assuming that all those cores are given the necessary appropriate memory and I/O bandwidth. Other computationally intensive applications in scientific and engineering computing are also likely beneficiaries. I’m talking about seismic analysis, crash simulation, molecular modeling, genetic research, and fluid dynamics.

On the client end of the wire, things aren’t as obvious or straightforward, but they are no less interesting. The abundance of cores is likely to lead to a very different approach to resource allocation. For decades operating systems have been optimized for managing the very scarce processor resources, by cleverly multiplexing many tasks or threads across one or now two or four cores. As quality of service has become more important to users, we’ve all come to realize the limitations of this approach as frames get dropped from video streams or productivity applications pause while the video goes full tilt. A different approach, and one that probably hasn’t received enough attention from the research community, is to dedicate cores to providing particular functions. The allocations become more static than what we see today, but they can certainly be changed over longer periods of time ranging for seconds to hours or even days.

As an example, we could conceive of a multi-function computing appliance that contains a processor with perhaps three dozen cores: we might allocate four of those cores to running the core productivity and collaboration applications. Another cluster of cores, on the order of a dozen, might provide very high quality graphics and visualization. Media processing, beyond encode/decode which would best be handled by dedicated hardware, would be the responsibility of yet another cluster of, say six cores. Still other clusters might be do real-time data mining on various streams of data flowing in from the Internet. Various bots operating within this cluster might be assembling news, shopping, or investing. The key idea here is to let the abundant hardware resources replace a lot of very complex OS code. It’s replaced by cluster or partition management code, which doles out the resources, but stays out of the way until there’s a major shift in the workload.

TJGeezer suggested using Tera-Scale capability along with huge amounts of NAND in an iPOD size container for AI applications. He may be right. One can easily imagine clusters of cores supporting an advanced human interface with real-time speech and vision or language translation. A lot of algorithmic development would have to take place to make this feasible, but there is no doubt in my mind that we’ll have the hardware resources needed to host them. The statistical algorithms that will form the heart of these future recognition systems are highly parallel and thus a great fit for a high core count architecture.

An abundance of cores also enables new ways to deal with challenges associated with system operation in the face of device failures and cosmic radiation. Think of the collection of cores as a redundant array of computing engines (RACE). Two or more cores could be used in tandem to detect and correct faults. If a core becomes unreliable, it can simply be removed from service without significantly affecting overall system performance

As we pack more and more computing resources into smaller areas, managing power and heat in a very fine grain manner will be critical. If we have more cores than are needed to execute the desired set of workloads, we can swap threads between cores whenever one becomes too hot. It’s like the hot potato game – move the potato fast enough and you never get burned. We’ll need the ability to adjust supply voltages, operating frequencies, and sleep states of individual cores in matters of microseconds.

While the challenges are somewhat mind-boggling on both the hardware and software sides to develop and fully utilize these future Tera-Scale platforms, the benefits and opportunities from putting these computing capabilities into the hands of all users are equally incredible.

So how many cores could you use, and what would you use them for? ArsTechnica user dg65536 said it best in his post – “Now that I think about it…80 isn't nearly enough.”

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Biography

Justin Rattner is an Intel Senior Fellow and director of Intel's Corporate Technology Group. He also serves as the corporation's chief technology officer (CTO). He is responsible for leading Intel's microprocessor, communications and systems technology labs and Intel Research. Rattner joined Intel in 1973. He was named its first Principal Engineer in 1979 and its fourth Intel Fellow in 1988. Prior to joining Intel, Rattner held positions with Hewlett-Packard Company and Xerox Corporation. He received bachelor's and master's degrees from Cornell University in Electrical Engineering and Computer Science in 1970 and 1972, respectively.

10
Comments

Join the conversation!

Just In

Software Architecture to Match Your Cores
fdavila 22nd Jun 2007
SISA: A Scalable Instruction Set Architecture
by Brian Fidel Davila
1. Introduction
The never-ending pursuit of higher performance dictates the ongoing development
of
processors that perform computations on progressively bigger blocks of data at a
time.
Two methods of achieving this goal are increasing the size of the data word,
demonstrated most recently by the migration from 32 to 64-bit computing, and
having
an instruction operate on more words at once, as seen in vector computers and the
?multimedia?, ?DSP? or ?SIMD? extensions now common in modern processors.
While it is tempting to design a new architecture carte blanche every time
technological
advances make new operations possible, this is rarely feasible. Market realities
dictate
that pre-existing software continue to be supported unmodified. As a result,
instructions
are added to an existing architecture to provide this functionality.
The need to modify instruction sets for new data formats is a fundamental
limitation of
standard architectures. Ideally, an architecture would not be tied to specific data
sizes.
New processors could then support different operations without requiring
instruction set
modifications.
This paper presents a scalable instruction set architecture, SISA, which meets these
requirements. While any processor has limits on the data formats it supports, SISA
implementations are able to support operations on larger data words and arbitrarily
long
vectors using the same interface.
Section two describes memory aliasing registers and their use in a simple
implementation of SISA, the SISA-I. Next, section three defines vector operations
and
support for a larger data words and introduces a vector machine, SISA-II. Section
four
continues with the SISA-III, which supports superscalar execution. Section five lists
additional work, section six related research and section seven concludes.
2. Registers
?Any problem in computer science can be solved by adding another level of
indirection?
-Alan Kay
To minimize demands on the instruction data path, SISA instructions are 16 bits
long.
This instruction length allows only eight architecture registers, r0 to r7. Each
register is
associated or aliased to a memory address with the rset instruction. These
addresses
may subsequently be accessed with the rget instruction. Addresses may be read or
written from any number of contiguous registers with a single instruction, though
register
storage bandwidth or other limitations may prevent operations on larger subsets of
the
registers from executing in a single cycle. Memory addresses may also be modified
with
the radd and raddi instructions to add the value at a memory address or
immediate,
respectively.
SISA registers also have a width field, which defaults to 32-bits. This may be
changed
with the rsetw instruction. Register widths may be retrieved with rgetw. Both
instructions
may modify multiple registers as with rset and rget. The maxwidth instruction
returns the
maximum width an implementation supports.
In SISA-I, a register?s width may be set to one, two, four, eight, 16 or 32 bits. The
architecture may support any width; SISA-I supports only these widths for
simplicity and
performance. The widths of each operand and the result are passed to the
execution
stage so that it may perform the correct operation. The radd and raddi instructions
also
use register widths when incrementing alias addresses.
SISA-I uses an eight-entry register table for storage of addresses and widths.
When an
rset or rsetw instruction is called, the corresponding entries are updated in the
register
table. This table then provides the values for rget and rgetw.
Data values are stored in four banks of 32-bit wide, 16-entry register files, each
with two
ports. This arrangement allows up to four address registers to be saved or restored
in
one cycle and all eight in two cycles. This storage and a 16-entry, 24-bit wide tag
file
function as a direct-mapped Level-0, or L0, cache. During instruction decode each
aliased register specified is used to retrieve the currently aliased address, which is
then
used to access data in this cache.
It is not feasible to solely use a direct-mapped cache. It would fail in the case
where two
operands are aliased to different addresses that would occupy the same cache line.
While it would be possible to fetch one value directly from memory, this would
result in
substantial performance degradation.
A victim cache is used to avoid this penalty in most cases. When a block is evicted
from
the direct-mapped cache, it is sent to this unit which can then supply it to the data
path
as required. The requirement for 28-bit fully-associative lookups suggests the
victim
cache be kept small, so SISA-I uses eight entries. It is 128 bits wide in order to
hold an
entry from each of the four L0 cache banks in one line.
The victim cache also functions as a write buffer. Normally, dirty cache lines will be
written only when the bus to the next level in the memory hierarchy, the L1 data
cache,
is not otherwise busy. The exception is when the victim cache is full, in which case
a
write must take priority over a read.
While a set-associative cache could also be used, there is a performance issue. The
associative lookup in the victim cache can be performed in parallel with the
directmapped
cache access. Thus, it will be known if the value resides in the victim cache at
the start of the execution phase and it can be muxed into the data path with little
delay.
When the value is not in the victim cache, the tag check for the direct-mapped
cache
can be performed in parallel with the execution stage, which is not possible with a
setassociative
cache. This is not an issue in an implementation where performing a serial
tag check in the execution stage is not in the critical path.
A simple implementation of SISA could directly access memory, but this would only
be
viable in very low-cost solutions or where memory requirements are so small that
it may
be implemented with performance comparable to the speed of a cache.
Operands pass from the decode stage to an ALU in the execution stage before
results
are written back to the L0 cache. Each instruction in these stages of the pipeline
will
keep a eight bit destination field that specifies where the result will be written.
Each
register cache line will have a two bit counter of the number of instructions in the
pipeline that will write to it. The control logic insures cache blocks waiting for
results are
not evicted and that instructions writing to a cache block that has not loaded from
memory are stalled.
Advantages of SISA-I over a conventional pipelined processor include the speed of
function calls and returns and context switches. The rset instruction may be used
to
quickly switch between many different sets of address registers as needed and
control
logic will move values in and out of register storage in parallel with other
operations.
Explicit loads and stores of data are removed from the instruction stream and are
performed in parallel with ALU operations. Loading longer immediate values will
take
more cycles, but the 16-bit instruction format has been shown to suffer only a
small
performance penalty relative to a 32-bit format in a load-store architecture.1
3. Vector operations
Because operands are simply references to memory, it is easy to specify
instructions
that perform vector operations. While both scalar and vector instructions use a
three
operand format, scalar instructions have two source operands and a destination
while
the vector versions have one operand that is both source and destination, one
source
operand and an operand to specify vector length.
While the 16-bit instruction length does not allow a reasonable number of
instructions
with two 3-bit operands and the 8-bit immediate used in scalar instructions, there
are a
number of possible solutions. Either the immediate could be reduced to five bits to
allow
for a vector length operand, the vector length operand could be defined to come
from a
hardwired register, perhaps r0, or they may simply not be supported. The last
option is
chosen for the current specification.
An implementation may support any number of lanes2 for vector operations,
including
one, and the maxveclen instruction, which takes a width as an argument, is
provided to
return this value. SISA-II supports operations on 128 bits, which may be sub-
divided by
any power of two bits. This restriction is not a property of the architecture. This
implementation supports only these formats for simplicity and performance.
To minimize data path complexity, the restriction that only one of the operands
may
come from the victim cache is enforced. When an instruction specifying two
operands
that reside in the data cache is encountered, one of the operands is moved to the
directmapped
cache before the instruction proceeds.
For the bit-wise logical instructions and the move instruction, there is no
difference
between operating on, for example, a vector of four 32-bit values and a vector of
two
64-bit values. For addition and subtraction, the difference is only which carry bits
are
passed. Larger operations may be performed over multiple cycles. SISA-II supports
64-
bit additions in one cycle and 128-bit additions in two.
Multiplication presents a bigger challenge. Multiplication of operands with widths
larger
than those natively supported can be synthesized by combining smaller width
multiplications and additions, essentially decomposing the operands into ?digits? of
the
smaller bit width. While providing a more consistent interface, these operations will
introduce complexity and may be better handled in software. SISA-II supports
multiplications on operands of up to 32 bits and on vectors returning a maximum
of 128
bits. Note that the width of the result register is passed to the execution stage and
that
this determines the limit rather than the sum of the widths of the operands. The
maxmult
instruction takes a result width and returns the maximum number of consecutive
multiplications supported.
4. Superscalar execution
The multiple execution units of SISA-II are underutilized with shorter scalar
operations.
To take advantage of this idle hardware, SISA-III examines two instructions at a
time.
The short instruction word keeps the demands placed on the instruction cache low
and
the three-bit register operands simplify decode logic. Three bits are now used to
record
the number of in-flight instructions that will write to a cache line.
Instead of adding ports to register storage, SISA-III simply does not allow the issue
of
two instructions that would require more than two operands be read from different
lines
in the same bank. Similarly, one pipeline is stalled when two instructions attempt
to
write to different lines in the same bank. While these decisions will decrease the CPI
below what would otherwise be possible, they provide a significant increase in the
CPI
for a modest investment in hardware. Simultaneous multi-threading is also
possible by
duplicating the address, width and PC registers and inserting logic to guard against
structural hazards.
5. Additional Work
An assembler and an implementation of this architecture for a Xilinx Spartan-3
FPGA
are currently undergoing testing. Results for assembly programs vs. equivalents
targeted for a MIPS processor will be available shortly. A more thorough
investigation
requires targeting a compiler for the SISA architecture and perhaps developing a
simulator as well.
There is no reason this architecture must be limited to 16-bits. An instruction set
with a
different instruction width could easily be defined to support, for example,
different
immediate fields. An instruction length of 32 bits allows support for directly
addressing
data values in addition to the indirect addressing specified. To support a legacy
instruction set alongside this new framework, architecture registers could be
defined to
reside at a certain memory address and instructions could be dynamically
translated. In
that data values in load-store architectures must be saved on context switches,
they are
already associated with a memory address. Static translation of legacy code for
traditional architectures to SISA is also possible using this technique.
The performance of a SISA implementation with an L0 cache is of course dependent
on
the performance of the cache. How the performance of various workloads varies
with
different L0 cache implementations warrants further investigation, but interesting
results
require a benchmark such as SPEC CINT2000 be ported to SISA.
To improve the performance of the L0 cache, operand address prediction could be
used
to prefetch data. Such a mechanism could either rely on the branch prediction
mechanism and record operations on the register addresses and widths in each
path or
a method of predicting operand addresses independent of the branch prediction
unit
could be developed.
Different compilation techniques and optimizations will be necessary for SISA.
Gcc3 4.0
supports auto-vectorization from which SISA would benefit. Because the radd and
raddi
instructions may operate on more than one consecutive registers, one or more
registers
may be used to refer to the top values of a stack or stacks. Multiple registers may
also
refer to the same address but have different widths.
6. Related Research
Ditzel and McLellan proposed4 using a cache instead of registers for operands,
citing
disadvantages that have been addressed by advances in technology and
architecture.
The Virtual Context Architecture5 first studies a memory-memory architecture that
uses
memory addresses as operands and then modifies the Alpha architecture to
support
register windows that map logical registers to memory locations based on context.
Banked register files have been studied by Cruz, et al.6
7. Conclusion
A Scalable Instruction Set Architecture has been defined that allows microprocessor
implementations to support operations on arbitrarily long data elements, vector
lengths
and memory address sizes. It has the additional advantages of using only a 16-bit
instruction and allowing for fast function calls and returns and context switches.
The
adoption of such an architecture would greatly reduce the costs of introducing new
processors supporting these new operations.
1 J. Bunda, D. Fussell, R. Jenevein, and W.C. Athas, ?16-Bit vs. 32-Bit Instructions
for Pipelined Microprocessors?,
Proceedings of the 20th Annual International Symposium of Computer Architecture,
pp. 237-246, 1993.
2 Hennessy, J.. and Patterson, D. Computer Architecture: A Quantitative Approach.
Morgan- Kaufmann Publishers,
San Mateo, CA, 2003.
3 Free Software Foundation. GNU Compiler Collection. http://gcc.gnu.org.
4 Ditzel, D.R.. and McLellan, H.R. Register Allocation for Free: The C Machine Stack
Cache. In Symposium on
Architectural Support for Programming Languages and Operating Systems. SIGPLAN
Not. 17, 4 (Apr. 1982), 48-56.
5 Oehmke, D.. N. Binkert, S. Reinhardt and T. Mudge. Design and Applications of a
Virtual Context Architecture.
https://www.eecs.umich.edu/techreports/cse/2004/CSE-TR-497-04.pdf
6 J.-L. Cruz, A. Gonzalez, M. Valero, and N. E Topham. Multiple-Banked Register
File Architectures. In
Proceedings of the ISCA-27, pages 316-325, 2000.
0 Votes
+ -
From the article:

"...assuming that all those cores are given the necessary appropriate memory and I/O bandwidth."


Don't brush this off with just a passing mention. Keeping data supplied to 80 cores is going to be more difficult than it will be deciding what to use all these processing units for or how to program them. Processors that don't have anything to work on will sit idle. As clock speeds have increased, memory and bus speeds have not kept up the same pace.

Even if the speed of cores is kept constant, processors with 80 cores is a leap of more than 6 doublings. In terms of applying Moore's Law, this would be an advance of 9 to 12 years of processing power increases from single-core processors. If Intel can produce 80-core chips in, say, five years, will memory technology be able keep up, let alone practically double the speed curve that has been followed for the last decade or two? I'm skeptical.

- silent E
--
Turning dams into dames since 1974.
0 Votes
+ -
80 is nonsense for the next decade
stephen.oh33@... 13th Feb 2007
It reminds me of those days where Mhz or Ghz is important until no one needs it. Same as # of cores. And of course, i am refering to everyday life, the normal population not the biz communuty.

One post in Arcstechnica said 80 cores isn't enough and points of multiple ways of how it can be utilize. My question, what is the point of running that process faster?

Example: instead of taking 10 seconds to process a job/task, the computer now takes 1 nanosecond.

Impact 1: You get your answer "right-away".

Impact 2: Your computer or server will be leave alone for (at least) the next 9 seconds also idle. As you cant process information faster than that as human interactions is part of the way to process/validate/execute decision/action.

Next Example, I can use 80 cores for
- Running OS faster from 15 seconds boot time to 5 seconds (cant be faster as the BIOS/HW/SW intialize take time)
- Opening a web browser from 2 seconds to 1 nanoseconds (in user experience, it means right away)
- Type in a web address or click a link - remain same - 2 seconds (we just cant click or type faster)
- Waiting for web page to load - remain the same (it is the broadband speed matters not the core)
- Browse the webpage/email - slighly faster due to imaging processing - but fast by a fraction of seconds (as download speed still key for those activities)
- Typing a document - almost same (your typing skill more important than cores)
- Load/Save/Delete a document from ~3 seconds to instant
- Copy the document from your computer to network - from 5-10 seconds to 5 seconds (slightly faster but limit by network utilization, speed and bandwidth)
.... the list goes on

Other good impact: MP3 encoding/decoding, Photo/Image/video editing, computer games, CAD, Animation desing/development... all those improve tremdeously - as long as, the software capable to use the cores.

How to use all 80 cores?
- How the Enterprise custoemr fill up the IDLE time create by faster processing by taking the consideration that there is a limit of how fast a human can work?
- How our everyday life - Instant Messaging, Blogging, Surfing Internet, Email... improve with 80 cores? (Fact is not matter that much)
- What is the real need for everyone? There are infinite electric power available to your everyday life, but we are limit to the device we used and most of the time, we bought a device keep utilizing the electricity but we never really use it much. But it is acceptable as electricity is cheap for everyone.

Until someone is able to fix/improve/innvoate the ecosystem and human interaction model (data and/or information usage) , getting a 80 cores system remains an inefficient way of utilizing resources. We should only buy what we need and what we are capable to use.
0 Votes
+ -
simple multi core trend
poeta nascitur 14th Feb 2007
Currently one of Intel's differentiator is the complexity to design a pentium for example. It requires so much people to tune the processor to the technology and squeeze everything in performance. A few companies out there can afford such investment. On the other hand, this trend towards multicore chips, composed of replicated simple cores, therefore scalable, can potentially require less people to design it (I've read that Polaris was the effort of 30 engineers, HW and SW, in a bit more than 1 year). The difficult part in the future will be, as mentioned, to program those cores. I wonder if Intel is not shooting at its own feet in the long term, as more companies will be capable of providing the underlining hardware subtrate composed of a similar multicore chip. Is it crazy to think that?
0 Votes
+ -
That's a HUGE Ass umption...
Sxooter_z 14th Feb 2007
QUOTE:
More cores and more threads means more transactions per unit time, assuming that all those cores are given the necessary appropriate memory and I/O bandwidth.
ENDQUOTE:

Ummm. How can you just assume that 80+ cores will get enough memory and I/O bandwidth? It's already stretching things with 2/4 cores with each CPU with their own memory banks.

No way will memory architectures suddenly be able to supply 20-80 times the current memory bandwidth without using something insane like 8000 pin CPU packages.

That's almost as bad as the statement from someone a while back that once we find a really fast easy way to factor large primes, all encryption will be easily broken. Ummmmm, yeah.
0 Votes
+ -
Terrific article.
Prognosticator 26th Feb 2007
Clearly multi-core is the next stage of utilizing all those Moore's Law transistors. Very interesting perspective and now the 80 Core proof of concept research chip that Intel submitted to ISSCC makes sense. That is, research the IO and Memory bottlenecks area.

I think the "AOL researchers" that commented here have a good point. Four cores is all the cores anyone would ever need. Yeah.
0 Votes
+ -
Oh Please
Inflection 6th Mar 2007
Justin, Justin, Justin. Please spend your transister budget more wisely. This rope does not deserve pushing. Let it lie and maybe the feeling will pass.
0 Votes
+ -
On the personal scale
Hrothgar - PCLinuxOS User 14th May 2007
I see computers turning more toward the efficiency scale to conserve energy, Run cooler and shrink of course. A unit the sized of a deck or 2 of playing cards will serve as the PC and have wireless interfaces for all other input/output devices except for maybe video. At least I see that as logical as you could simply stack upgrades like lego blocks. Ah I love to dream.
0 Votes
+ -
Right but for the wrong reasons
moloned@... 21st May 2007
80 cores is not enough because the existing prototype only achieves 16GFLOPS/W resulting in a whopping 62W power-dissipation @ 0.9V.

While obviously better than the current crop of multicore processors the types of application envisaged mean that a much higher percentage of that 62W peak will be sustained compared to current multicore processors which are throttled by external memory bandwidth for HPC applications such as FEM and CFD (assuming polaris can be fed with data).

In my experience designing for topline performance rather than for power efficiency results in a higher cost of ownership and operation than designing for the optimum power/performance for a given technology (in this case 65nm).

In order to maximise power efficiency for cost of operation many more than the proposed 80 cores should be integrated into the same die size but operating at much lower power.

Secondly in terms of keeping the cores fed some of them can be dedicated to compressing and decompressing data from the external I/O increasing the effective bandwidth "seen" by the other cores.
0 Votes
+ -
Marketing Spin
moloned@... 21st May 2007
http://forums.techgage.com/attachment.php?attachmentid=155&d=1171236277

The ISSCC paper on Polaris is posted above and my reading of it is that this is a purely an exercise in technology development Borkar's group had published a paper on ISSCC a few years ago on a fast FP MAC out of which Polaris has been developed.

Despite Intel's claims Polaris in this form is not usable for HPC applications as it only supports single-precision arithmetic.

The FP unit uses deferred normalisation so any loop of code which stores results back to memory will have to perform an additional normalisation step which will degrade the top-line performance.

Furthermore the on-chip data-memory per node is only 2kB (512x 32-bit words) deep, and the 3kB instruction memory will only hold 256x 96-bit instructions.

The small memory and reliance on the NoC to supply data will mean that performance and power will be very high when compared with the IBM Cell which has 256kB/node local data/program storage.

Most tellingly of all the instruction-set only supports FP MACs, no divides, square-roots etc. so it is only really of use for a headline-grabbing marketing exercise.
0 Votes
+ -
SISA: A Scalable Instruction Set Architecture
by Brian Fidel Davila
1. Introduction
The never-ending pursuit of higher performance dictates the ongoing development
of
processors that perform computations on progressively bigger blocks of data at a
time.
Two methods of achieving this goal are increasing the size of the data word,
demonstrated most recently by the migration from 32 to 64-bit computing, and
having
an instruction operate on more words at once, as seen in vector computers and the
?multimedia?, ?DSP? or ?SIMD? extensions now common in modern processors.
While it is tempting to design a new architecture carte blanche every time
technological
advances make new operations possible, this is rarely feasible. Market realities
dictate
that pre-existing software continue to be supported unmodified. As a result,
instructions
are added to an existing architecture to provide this functionality.
The need to modify instruction sets for new data formats is a fundamental
limitation of
standard architectures. Ideally, an architecture would not be tied to specific data
sizes.
New processors could then support different operations without requiring
instruction set
modifications.
This paper presents a scalable instruction set architecture, SISA, which meets these
requirements. While any processor has limits on the data formats it supports, SISA
implementations are able to support operations on larger data words and arbitrarily
long
vectors using the same interface.
Section two describes memory aliasing registers and their use in a simple
implementation of SISA, the SISA-I. Next, section three defines vector operations
and
support for a larger data words and introduces a vector machine, SISA-II. Section
four
continues with the SISA-III, which supports superscalar execution. Section five lists
additional work, section six related research and section seven concludes.
2. Registers
?Any problem in computer science can be solved by adding another level of
indirection?
-Alan Kay
To minimize demands on the instruction data path, SISA instructions are 16 bits
long.
This instruction length allows only eight architecture registers, r0 to r7. Each
register is
associated or aliased to a memory address with the rset instruction. These
addresses
may subsequently be accessed with the rget instruction. Addresses may be read or
written from any number of contiguous registers with a single instruction, though
register
storage bandwidth or other limitations may prevent operations on larger subsets of
the
registers from executing in a single cycle. Memory addresses may also be modified
with
the radd and raddi instructions to add the value at a memory address or
immediate,
respectively.
SISA registers also have a width field, which defaults to 32-bits. This may be
changed
with the rsetw instruction. Register widths may be retrieved with rgetw. Both
instructions
may modify multiple registers as with rset and rget. The maxwidth instruction
returns the
maximum width an implementation supports.
In SISA-I, a register?s width may be set to one, two, four, eight, 16 or 32 bits. The
architecture may support any width; SISA-I supports only these widths for
simplicity and
performance. The widths of each operand and the result are passed to the
execution
stage so that it may perform the correct operation. The radd and raddi instructions
also
use register widths when incrementing alias addresses.
SISA-I uses an eight-entry register table for storage of addresses and widths.
When an
rset or rsetw instruction is called, the corresponding entries are updated in the
register
table. This table then provides the values for rget and rgetw.
Data values are stored in four banks of 32-bit wide, 16-entry register files, each
with two
ports. This arrangement allows up to four address registers to be saved or restored
in
one cycle and all eight in two cycles. This storage and a 16-entry, 24-bit wide tag
file
function as a direct-mapped Level-0, or L0, cache. During instruction decode each
aliased register specified is used to retrieve the currently aliased address, which is
then
used to access data in this cache.
It is not feasible to solely use a direct-mapped cache. It would fail in the case
where two
operands are aliased to different addresses that would occupy the same cache line.
While it would be possible to fetch one value directly from memory, this would
result in
substantial performance degradation.
A victim cache is used to avoid this penalty in most cases. When a block is evicted
from
the direct-mapped cache, it is sent to this unit which can then supply it to the data
path
as required. The requirement for 28-bit fully-associative lookups suggests the
victim
cache be kept small, so SISA-I uses eight entries. It is 128 bits wide in order to
hold an
entry from each of the four L0 cache banks in one line.
The victim cache also functions as a write buffer. Normally, dirty cache lines will be
written only when the bus to the next level in the memory hierarchy, the L1 data
cache,
is not otherwise busy. The exception is when the victim cache is full, in which case
a
write must take priority over a read.
While a set-associative cache could also be used, there is a performance issue. The
associative lookup in the victim cache can be performed in parallel with the
directmapped
cache access. Thus, it will be known if the value resides in the victim cache at
the start of the execution phase and it can be muxed into the data path with little
delay.
When the value is not in the victim cache, the tag check for the direct-mapped
cache
can be performed in parallel with the execution stage, which is not possible with a
setassociative
cache. This is not an issue in an implementation where performing a serial
tag check in the execution stage is not in the critical path.
A simple implementation of SISA could directly access memory, but this would only
be
viable in very low-cost solutions or where memory requirements are so small that
it may
be implemented with performance comparable to the speed of a cache.
Operands pass from the decode stage to an ALU in the execution stage before
results
are written back to the L0 cache. Each instruction in these stages of the pipeline
will
keep a eight bit destination field that specifies where the result will be written.
Each
register cache line will have a two bit counter of the number of instructions in the
pipeline that will write to it. The control logic insures cache blocks waiting for
results are
not evicted and that instructions writing to a cache block that has not loaded from
memory are stalled.
Advantages of SISA-I over a conventional pipelined processor include the speed of
function calls and returns and context switches. The rset instruction may be used
to
quickly switch between many different sets of address registers as needed and
control
logic will move values in and out of register storage in parallel with other
operations.
Explicit loads and stores of data are removed from the instruction stream and are
performed in parallel with ALU operations. Loading longer immediate values will
take
more cycles, but the 16-bit instruction format has been shown to suffer only a
small
performance penalty relative to a 32-bit format in a load-store architecture.1
3. Vector operations
Because operands are simply references to memory, it is easy to specify
instructions
that perform vector operations. While both scalar and vector instructions use a
three
operand format, scalar instructions have two source operands and a destination
while
the vector versions have one operand that is both source and destination, one
source
operand and an operand to specify vector length.
While the 16-bit instruction length does not allow a reasonable number of
instructions
with two 3-bit operands and the 8-bit immediate used in scalar instructions, there
are a
number of possible solutions. Either the immediate could be reduced to five bits to
allow
for a vector length operand, the vector length operand could be defined to come
from a
hardwired register, perhaps r0, or they may simply not be supported. The last
option is
chosen for the current specification.
An implementation may support any number of lanes2 for vector operations,
including
one, and the maxveclen instruction, which takes a width as an argument, is
provided to
return this value. SISA-II supports operations on 128 bits, which may be sub-
divided by
any power of two bits. This restriction is not a property of the architecture. This
implementation supports only these formats for simplicity and performance.
To minimize data path complexity, the restriction that only one of the operands
may
come from the victim cache is enforced. When an instruction specifying two
operands
that reside in the data cache is encountered, one of the operands is moved to the
directmapped
cache before the instruction proceeds.
For the bit-wise logical instructions and the move instruction, there is no
difference
between operating on, for example, a vector of four 32-bit values and a vector of
two
64-bit values. For addition and subtraction, the difference is only which carry bits
are
passed. Larger operations may be performed over multiple cycles. SISA-II supports
64-
bit additions in one cycle and 128-bit additions in two.
Multiplication presents a bigger challenge. Multiplication of operands with widths
larger
than those natively supported can be synthesized by combining smaller width
multiplications and additions, essentially decomposing the operands into ?digits? of
the
smaller bit width. While providing a more consistent interface, these operations will
introduce complexity and may be better handled in software. SISA-II supports
multiplications on operands of up to 32 bits and on vectors returning a maximum
of 128
bits. Note that the width of the result register is passed to the execution stage and
that
this determines the limit rather than the sum of the widths of the operands. The
maxmult
instruction takes a result width and returns the maximum number of consecutive
multiplications supported.
4. Superscalar execution
The multiple execution units of SISA-II are underutilized with shorter scalar
operations.
To take advantage of this idle hardware, SISA-III examines two instructions at a
time.
The short instruction word keeps the demands placed on the instruction cache low
and
the three-bit register operands simplify decode logic. Three bits are now used to
record
the number of in-flight instructions that will write to a cache line.
Instead of adding ports to register storage, SISA-III simply does not allow the issue
of
two instructions that would require more than two operands be read from different
lines
in the same bank. Similarly, one pipeline is stalled when two instructions attempt
to
write to different lines in the same bank. While these decisions will decrease the CPI
below what would otherwise be possible, they provide a significant increase in the
CPI
for a modest investment in hardware. Simultaneous multi-threading is also
possible by
duplicating the address, width and PC registers and inserting logic to guard against
structural hazards.
5. Additional Work
An assembler and an implementation of this architecture for a Xilinx Spartan-3
FPGA
are currently undergoing testing. Results for assembly programs vs. equivalents
targeted for a MIPS processor will be available shortly. A more thorough
investigation
requires targeting a compiler for the SISA architecture and perhaps developing a
simulator as well.
There is no reason this architecture must be limited to 16-bits. An instruction set
with a
different instruction width could easily be defined to support, for example,
different
immediate fields. An instruction length of 32 bits allows support for directly
addressing
data values in addition to the indirect addressing specified. To support a legacy
instruction set alongside this new framework, architecture registers could be
defined to
reside at a certain memory address and instructions could be dynamically
translated. In
that data values in load-store architectures must be saved on context switches,
they are
already associated with a memory address. Static translation of legacy code for
traditional architectures to SISA is also possible using this technique.
The performance of a SISA implementation with an L0 cache is of course dependent
on
the performance of the cache. How the performance of various workloads varies
with
different L0 cache implementations warrants further investigation, but interesting
results
require a benchmark such as SPEC CINT2000 be ported to SISA.
To improve the performance of the L0 cache, operand address prediction could be
used
to prefetch data. Such a mechanism could either rely on the branch prediction
mechanism and record operations on the register addresses and widths in each
path or
a method of predicting operand addresses independent of the branch prediction
unit
could be developed.
Different compilation techniques and optimizations will be necessary for SISA.
Gcc3 4.0
supports auto-vectorization from which SISA would benefit. Because the radd and
raddi
instructions may operate on more than one consecutive registers, one or more
registers
may be used to refer to the top values of a stack or stacks. Multiple registers may
also
refer to the same address but have different widths.
6. Related Research
Ditzel and McLellan proposed4 using a cache instead of registers for operands,
citing
disadvantages that have been addressed by advances in technology and
architecture.
The Virtual Context Architecture5 first studies a memory-memory architecture that
uses
memory addresses as operands and then modifies the Alpha architecture to
support
register windows that map logical registers to memory locations based on context.
Banked register files have been studied by Cruz, et al.6
7. Conclusion
A Scalable Instruction Set Architecture has been defined that allows microprocessor
implementations to support operations on arbitrarily long data elements, vector
lengths
and memory address sizes. It has the additional advantages of using only a 16-bit
instruction and allowing for fast function calls and returns and context switches.
The
adoption of such an architecture would greatly reduce the costs of introducing new
processors supporting these new operations.
1 J. Bunda, D. Fussell, R. Jenevein, and W.C. Athas, ?16-Bit vs. 32-Bit Instructions
for Pipelined Microprocessors?,
Proceedings of the 20th Annual International Symposium of Computer Architecture,
pp. 237-246, 1993.
2 Hennessy, J.. and Patterson, D. Computer Architecture: A Quantitative Approach.
Morgan- Kaufmann Publishers,
San Mateo, CA, 2003.
3 Free Software Foundation. GNU Compiler Collection. http://gcc.gnu.org.
4 Ditzel, D.R.. and McLellan, H.R. Register Allocation for Free: The C Machine Stack
Cache. In Symposium on
Architectural Support for Programming Languages and Operating Systems. SIGPLAN
Not. 17, 4 (Apr. 1982), 48-56.
5 Oehmke, D.. N. Binkert, S. Reinhardt and T. Mudge. Design and Applications of a
Virtual Context Architecture.
https://www.eecs.umich.edu/techreports/cse/2004/CSE-TR-497-04.pdf
6 J.-L. Cruz, A. Gonzalez, M. Valero, and N. E Topham. Multiple-Banked Register
File Architectures. In
Proceedings of the ISCA-27, pages 316-325, 2000.

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix