papers | collections | search login | register | forgot password?

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
by Huiyang Zhou
show details
You need to log in to add tags and post comments.
Tags
Dual core execution (1), single thread execution (1), "buffer queue" (1)
Public comments
#1 posted on Sep 26 2013, 17:58
Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, the author propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memory- intensive workloads while maintaining the flexibility to support multithreaded applications.
The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. Experimental results show remarkable latency hiding capabilities of the proposed architecture.In DCE, the front processor benefits the back processor in two major ways: (1) a highly accurate and continuous instruction stream as the front processor resolves most branch mispredictions during its preprocessing, and (2) the warmed up data caches as the cache misses initiated by the front processor become prefetches for the back processor. The front processor runs far ahead of the back processor since it is not stalled by long-latency cache misses (i.e., a virtually ideal L2 cache) and the back processor also runs faster with the assists from the front processor.
A multiplexer (MUX) is added in front of the fetch unit of the back processor and its control signal (mp) directs whether instructions to be fetched from the result queue (single-thread mode) or from the back processor’ s instruction cache (multithread mode). In this way, DCE has the flexibility to serve both single-thread and multithreaded workloads.Those long latency cache misses initiated long ago at the front processor become prefetches for the back processor. instruction window to hide memory access latencies. For computation-intensive workloads, however, DCE is less efficient as not many cache misses can be invalidated at the front processor and the multi-thread mode should be used. To speed up the back processor, the result queue can be used to carry the execution results from the front processor and provide them as value predictions [21] to the back processor. Compared to a single superscalar processor with a very large centralized instruction window, DCE has higher scalability, much less complexity and potentially higher clock speed .
One possible weakness: the handling of mis-predicted branch.