ResearchIndex

Complexity-Effective Superscalar Processors

by Subbarao Palacharla, Norman P. Jouppi, J. E. Smith

show details

You need to log in to add tags and post comments.

Tags

Public comments

#1 posted on Nov 27 2013, 20:33 in collection CMU 18-740: Computer Architecture -- Fall 13
The paper presents an interesting analysis on what might be the bottleneck in a out of order processor when desiring to increase the clock rate and also reduce the lithography of the design. The paper shows three components that cause the most significant delay in complex architectures. These three components are: the register renaming logic which mainly takes care of register mapping from architectural to physical, the instruction wakeup and selection logic that dispatches an instruction to its designated functional unit when all its operands are available and finally the bypassing logic that handles supply of values to stalling instructions as soon as possible. It is shown that when the issue width of the processor is increased the delay caused by the underlying circuits also increases. The paper proposes different solutions to this problem, first by applying a clustering on the window that handles the in flight instructions. The structures that are taken into account are FIFO queues and also windows that are not centralized, but actually distributed in different clusters.
As stated in the previous paragraph, the paper presents an interesting analysis on what my become bottlenecks when having complexity and high clock rate hand in hand. The analysis is very informative, especially that it goes from the circuit level and shows how by increasing the size of the issue width and the window size, circuits must be added to sustain the high number of instructions. This increase of circuits also adds extra delays in delivering the results. The paper makes a clear remark that with the decrease in lithography things might become even worse if complexity increases. Another interesting aspect related to the paper, is the solution that it is trying to give out. The solution is mainly the idea that the Alpha 21264 was based upon. Instead of having the a monolithic register file and instruction queue, why not cluster, make them distributed. It is an interesting idea, however as stated in the paper inter cluster communication would actually degrade performance. Another idea presented in the paper was to use FIFO queues instead of windows for storing the instructions, and instead of broadcasting the results to all of the instructions, broadcast it only to the head of the queues. This has a very big side effect, namely if there is a stream of dependent instructions that are fed in the queues, those instructions will block future instructions that are completely independent on the dependent ones, because those will not fire, because only the head of the queue goes first. Another negative aspect is how well does the separation of instructions work, because you would really need to have as little communication between clusters as possible.
All in all the paper presented some interesting aspects regarding complexity of out of order processors and high rate clock speeds. One conclusion would be that complexity and clock speed do not quite go hand in hand.