Complexity-Effective Superscalar Processors
by Subbarao Palacharla, Norman P. Jouppi, J. E. Smith
show details
Details
type: | misc | booktitle: | Proceedings of the 24th Annual International Symposium on Computer Architecture | year: | 1997 | month: | jun # "~2--4 | series: | Computer Architecture News | annote: | Subbarao Palacharla (Computer Sciences Department; University of Wisconsin-Madison; Madison , WI 53706 | volume: | 25,2 | publisher: | ACM Press | pages: | 206--218 | organization: | ACM SIGARCH and IEEE Computer Society TCCA | abstract: | The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8?m, 0:35?m, and 0:18?m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore | note: | Published as 24th Annual International Symposium on Computer Architecture (24th ISCA'97), Computer Architecture News, volume 25, number 5 | editor: | Andrew R. Pleszkun and Trevor N. Mudge | address: | Denver, Colorado |
|
|
You need to log in to add tags and post comments.
The paper presents an interesting analysis on what might be the bottleneck in a out of order processor when desiring to increase the clock rate and also reduce the lithography of the design. The paper shows three components that cause the most significant delay in complex architectures. These three components are: the register renaming logic which mainly takes care of register mapping from architectural to physical, the instruction wakeup and selection logic that dispatches an instruction to its designated functional unit when all its operands are available and finally the bypassing logic that handles supply of values to stalling instructions as soon as possible. It is shown that when the issue width of the processor is increased the delay caused by the underlying circuits also increases. The paper proposes different solutions to this problem, first by applying a clustering on the window that handles the in flight instructions. The structures that are taken into account are FIFO queues and also windows that are not centralized, but actually distributed in different clusters.
As stated in the previous paragraph, the paper presents an interesting analysis on what my become bottlenecks when having complexity and high clock rate hand in hand. The analysis is very informative, especially that it goes from the circuit level and shows how by increasing the size of the issue width and the window size, circuits must be added to sustain the high number of instructions. This increase of circuits also adds extra delays in delivering the results. The paper makes a clear remark that with the decrease in lithography things might become even worse if complexity increases. Another interesting aspect related to the paper, is the solution that it is trying to give out. The solution is mainly the idea that the Alpha 21264 was based upon. Instead of having the a monolithic register file and instruction queue, why not cluster, make them distributed. It is an interesting idea, however as stated in the paper inter cluster communication would actually degrade performance. Another idea presented in the paper was to use FIFO queues instead of windows for storing the instructions, and instead of broadcasting the results to all of the instructions, broadcast it only to the head of the queues. This has a very big side effect, namely if there is a stream of dependent instructions that are fed in the queues, those instructions will block future instructions that are completely independent on the dependent ones, because those will not fire, because only the head of the queue goes first. Another negative aspect is how well does the separation of instructions work, because you would really need to have as little communication between clusters as possible.
All in all the paper presented some interesting aspects regarding complexity of out of order processors and high rate clock speeds. One conclusion would be that complexity and clock speed do not quite go hand in hand.