12.7-times Energy Efficiency Increase of 16-bit Integer Unit by Power Supply Voltage (V_{DD}) Scaling from 1.2V to 310mV Enabled by Contention-less Flip-Flops (CLFF) and Separated V_{DD} between Flip-Flops and Combinational Logics

Hiroshi Fuketa¹, Koji Hirairi², Tadashi Yasufuku¹, Makoto Takamiya¹, Masahiro Nomura², Hirofumi Shinohara², Takayasu Sakurai³

¹ Institute of Industrial Science, University of Tokyo, Japan
² Semiconductor Technology Academic Research Center (STARC), Japan
E-mail: {fuketa, tdsh, mtaka, tsakurai}@iis.u-tokyo.ac.jp, {hirairi.koji, nomura.masahiro, shinohara.hirofumi}@starc.or.jp

Abstract—Contention-less flip-flops (CLFF’s) and separated power supply voltages (V_{DD}) between flip-flops (FF’s) and combinational logics are proposed to achieve a maximum energy efficiency operation. The proposed technologies were applied to a 16-bit integer unit (IU) for media processing in a 65-nm CMOS process. Measurement results of fabricated chips show that the proposed CLFF reduces the minimum operating voltage of IU’s by 64mV on average. By scaling V_{DD} from 1.2V to 310mV with the proposed CLFF, the maximum energy efficiency of 1835GOPS/W and the highest energy efficiency increase of 12.7 times are achieved.

Keywords -flip-flop, subthreshold circuit, variations

I. INTRODUCTION

Energy efficient LSI’s including media processors are strongly required with the growing market of mobile devices such as smart phones. A lot of sub/near-threshold logic circuits are reported [1-5], because reducing the power supply voltage (V_{DD}) increases the energy efficiency of the logic circuits. Low power (LP) CMOS process with low leakage current is used for LSI’s in the battery-powered mobile devices. LP CMOS with high threshold voltage (V_{TH}), however, brings a new design challenge for the energy efficient sub/near-threshold logic circuits.

Fig. 1 shows simulated V_{DD} dependence of the energy efficiency of 31-stage fan-out-4 2NAND ring oscillators in 2 types of 65nm CMOS with V_{TH} difference of 0.2V. Maximum energy efficiency operation (MEEO) is achieved at V_{DD} between 300mV and 400mV [1-4]. In this paper, an energy efficiency improvement factor (EEIF) is defined as the maximum energy efficiency normalized by the energy efficiency at nominal V_{DD} of 1.2V. EEIF is 9.1 in the high performance (HP) CMOS process with low V_{TH}, which is consistent with [1]. In contrast, EEIF of LP CMOS is 11.4, which is higher than that of HP CMOS. Minimum operating voltage (V_{DDmin}) of LP CMOS, however, is higher than that of HP CMOS, because V_{DDmin} increases with increasing V_{TH}. In LP CMOS, therefore, MEEO can’t be achieved, because V_{DDmin} is higher than MEEO V_{DD} of 310mV.

Fig. 2 shows a block diagram of the developed 16-bit IU which has implemented popular media processing commands as shown. In this paper, (1) the proposed CLFF and (2) separated V_{DD} between the combinational logic (V_{DDLOGIC}) and the FF’s (V_{DD(FF)}) are implemented in IU. Two types of IU’s with the conventional FF’s and the proposed CLFF’s are developed for comparison in a 65-nm CMOS process. The chip micrograph is shown in Fig. 3. The area penalty of the separated V_{DD} is 6%.

In this paper, contention-less flip-flops (CLFF) and separated V_{DD} between flip-flops (FF’s) and combinational logics are proposed in order to realize MEEO in LP CMOS. The proposed technologies are applied to a 16-bit integer unit (IU) for media processing, and the highest EEIF of 12.7 is achieved by reducing V_{DD} from 1.2V to 310mV.

The remainder of this paper is organized as follows. Section II describes the structure of the developed IU and the proposed CLFF. Measurement results are shown in Section III. Finally, Section IV concludes this paper.

II. INTEGER UNIT (IU) AND CONTENTION-LESS FLIP-FLOP (CLFF)

A. Integer Unit (IU)

Fig. 2 shows a block diagram of the developed 16-bit IU which has implemented popular media processing commands as shown. In this paper, (1) the proposed CLFF and (2) separated V_{DD} between the combinational logic (V_{DDLOGIC}) and the FF’s (V_{DD(FF)}) are implemented in IU. Two types of IU’s with the conventional FF’s and the proposed CLFF’s are developed for comparison in a 65-nm CMOS process. The chip micrograph is shown in Fig. 3. The area penalty of the separated V_{DD} is 6%.

978-1-61284-660-6/11/$26.00 © 2011 IEEE
Fig. 4 (a) shows the measured shmoo plot of the IU with the conventional FF’s. $V_{DD(FF)}$ is equal to $V_{DD(LOGIC)}$ in Fig. 4 (a), while $V_{DD(FF)}$ and $V_{DD(LOGIC)}$ are separated in Fig. 4 (b). In Fig. 4 (a), the IU does not operate correctly below 450mV even if the clock frequency is reduced from 35MHz to 10kHz, which indicates that $V_{DDmin}$ of IU is 450mV. In this case, MEEO cannot be achieved, because $V_{DD}$ is less than 450mV. $V_{DDmin}$ is determined by the combinational logic or the FF’s. In Fig. 4 (b), in order to identify which circuit, the combinational logic or the FF’s, determines $V_{DDmin}$, $V_{DD(FF)}$ is fixed to 450mV and only $V_{DD(LOGIC)}$ is reduced when $V_{DD(LOGIC)}$ is less than 450mV. In this case, $V_{DDmin}$ is reduced to 350mV, which indicates that $V_{DDmin}$ of IU is determined by the FF’s. Therefore, reducing $V_{DDmin}$ of FF’s is required to achieve MEEO.

B. Contention-Less Flip-Flop (CLFF)

Fig. 5 (a) shows a schematic of a conventional tri-state buffer based FF (TBFF) used in Fig. 4. $V_{DDmin}$ of TBFF is high, because the outputs of the two tri-state buffers are wired-OR and the contention between the tri-state buffers will induce functional errors [5]. Eliminating the wired-OR in TBFF is required to reduce $V_{DDmin}$ and a classical NAND latch based flip-flop (NLFF) shown in Fig. 5 (b) is one of the candidates. The number of transistors in NLFF, however, increases by 67% compared with TBFF and the corresponding area penalty is not acceptable.
In order to solve the problems, this paper proposes an area-efficient CLFF shown in Fig. 5 (c) with low $V_{DD_{min}}$ by eliminating the contention. CLFF has master-slaved latches which are implemented with NOR-type and NAND-type 2:1 multiplexers. When $CK=0$, the master latch accepts the input data ($D$) and the slave latch retains data of the previous cycle. In contrast, when $CK=1$, the master latch retains the latest data and the slave latch accepts it. CLFF has smaller area than NLFF, because 8 2NAND-gates in NLFF are replaced with 6 2NOR/2NAND-gates in CLFF.

However, reducing the gates creates a racing problem. It is explained in the timing chart about the master latch of CLFF in Fig. 5 (c). After the rising edge of $CK$, the inverted data on DB is written to the feedback node (FB). Therefore, $t_{DB}$ should be larger than $t_{FB}$, because a write-error occurs if $t_{DB} < t_{FB}$. In an ideal case without delay variations, $t_{DB}$ is larger than $t_{FB}$, because $t_{DB} - t_{FB}$ equals to the delay of an inverter. In contrast, in reality with the large transistor delay variations at low $V_{DD}$, $t_{DB}$ may be smaller than $t_{FB}$ and the write-error occurs. In order to increase $t_{DB} - t_{FB}$, $t_{CK2}$ or $t_{DELAY}$ should be increased. Increasing $t_{CK2}$ is not a good choice, because the hold margin ($t_{H}$) is decreased. Therefore, $t_{DELAY}$ is increased to alleviate the racing problem by adding an nMOS (N1) in the 2NOR shown in gray in Fig. 5 (c) and increasing the propagation delay from rising $CK2$ to falling $DB$.

In the same way, a pMOS (P1) is added in the slave latch. N1 and P1 do not increase the switching power of CLFF, because they are normally ON transistors. As a result, compared with TBFF, the number of the transistors, total gate width, and the area of the proposed CLFF is 1.4x, 3.9x, and
2.8x, respectively. Compared with NLFF, the number of the transistor of CLFF is reduced by 15%.

In order to compare $V_{\text{DDmin}}$ between TBFF and CLFF, Fig. 6 shows the simulated $V_{\text{DD(FF)}}$ dependence of error probability of single FF. In order to compare $V_{\text{DDmin}}$ among various flip-flops, 3000 times Monte Carlo SPICE simulations were performed with random $V_{\text{TH}}$ variations. $V_{\text{DDmin}}$ of the proposed CLFF is improved compared with $V_{\text{DDmin}}$ of the CLFF without N1 and P1, since adding N1 and P1 alleviates the racing problem.

Compared with TBFF, $V_{\text{DDmin}}$ of CLFF is reduced by 100mV and 210mV at the error probability of 0.02 (= 50 FF's which is included in IU in Fig. 2) and $10^{-4}$ (= 10k FF's), respectively, which indicates the $V_{\text{DDmin}}$ reduction is more effective at larger scale digital circuits. At the same total gate width, $V_{\text{DDmin}}$ of CLFF is lower than that of TBFF, which proves the superiority of the circuit topology of CLFF over TBFF. In addition, $V_{\text{DDmin}}$ of the proposed CLFF (with added transistors N1 and P1) is reduced by 40mV at the error probability of $10^{-4}$ compared with $V_{\text{DDmin}}$ of the CLFF without N1 and P1, which implies N1 and P1 mitigate the racing problem and improve the robustness to the transistor variations.

III. MEASUREMENT RESULTS

Fig. 7 shows the measured die-to-die $V_{\text{DDmin}}$ distributions of IU’s with TBFF and CLFF derived as shown in Fig. 4 (a). In Fig. 7, the average $V_{\text{DDmin}}$ of TBFF and CLFF are 444mV and 380mV, respectively. Compared with TBFF, $V_{\text{DDmin}}$ of CLFF is reduced by 64mV. By replacing TBFF with CLFF in IU, the average $V_{\text{DDmin}}$ is reduced by 64mV.

Fig. 8 illustrates the measured $V_{\text{DD}}$ dependence of maximum clock frequency of IU’s with TBFF and CLFF, and measured die-to-die maximum clock frequency distributions of IU’s with TBFF and CLFF at 1.0V and 0.5V. Fig. 8 indicates that CLFF has no speed penalty over TBFF.

Figure 6. Simulated $V_{\text{DD(FF)}}$ dependence of error probability of single FF. In order to compare $V_{\text{DDmin}}$ among various flip-flops, 3000 times Monte Carlo SPICE simulations were performed with random $V_{\text{TH}}$ variations. $V_{\text{DDmin}}$ of the proposed CLFF is improved compared with $V_{\text{DDmin}}$ of the CLFF without N1 and P1, since adding N1 and P1 alleviates the racing problem.

Figure 7. Measured die-to-die $V_{\text{DDmin}}$ distributions of IU’s with TBFF and CLFF derived as shown in Fig. 4 (a).

Figure 8. Measured $V_{\text{DD}}$ dependence of maximum clock frequency of IU’s with TBFF and CLFF, and measured die-to-die maximum clock frequency distributions of IU’s with TBFF and CLFF at 1.0V and 0.5V.

Figure 9. Measured $V_{\text{DDLOGIC}}$ dependence of maximum clock frequency, total power, and leakage power of IU with CLFF. $V_{\text{DDmin}}$ of CLFF is 340mV.
Fig. 10 shows the measured $V_{DD\text{LOGIC}}$ dependence of the breakdown of total power and leakage power of IU with CLFF. IU consists of the combinational logic and FF’s as shown in Fig. 2. VDDmin of CLFF is 340mV.

Fig. 9 shows the measured $V_{DD\text{LOGIC}}$ dependence of the maximum clock frequency, the total power, and the leakage power of IU with CLFF. The total power dissipation was obtained when the add operation with random input patterns was performed. Fig. 10 indicates that the power dissipation of the combinational logic is larger than that of FF’s and the leakage power of FF’s increases due to the voltage difference between combinational logic and FF’s when $V_{DD\text{LOGIC}}$ is less than 340mV. The total leakage power shown in Fig. 9, however, decreases as $V_{DD\text{LOGIC}}$ is lowered. Therefore, the leakage power of FF’s due to the separate $V_{DD}$ is not a serious problem.

TABLE I. COMPARISON WITH THE PUBLISHED SUB/NEAR-THRESHOLD LOGIC CIRCUITS.

<table>
<thead>
<tr>
<th>Reference</th>
<th>CMOS technology</th>
<th>$V_{DD}$ Nominal (V)</th>
<th>$V_{DD}$ Min. energy (mV)</th>
<th>$V_{DD}$ Min. functional (mV)</th>
<th>Energy efficiency improvement factor (EEIF)*</th>
<th>Circuit type</th>
</tr>
</thead>
<tbody>
<tr>
<td>[1]</td>
<td>65nm</td>
<td>1.2</td>
<td>320</td>
<td>230</td>
<td>9.6x</td>
<td>Motion estimation accelerator</td>
</tr>
<tr>
<td>[2]</td>
<td>65nm</td>
<td>1.2</td>
<td>400</td>
<td>N.A.</td>
<td>8.3x</td>
<td>DCT &amp; quantization</td>
</tr>
<tr>
<td>[3]</td>
<td>45nm</td>
<td>1.1</td>
<td>300</td>
<td>230</td>
<td>8x</td>
<td>SIMD vector processing accelerator</td>
</tr>
<tr>
<td>[4]</td>
<td>32nm</td>
<td>1.0</td>
<td>340</td>
<td>260</td>
<td>5.7x</td>
<td>Re-configurable arrays</td>
</tr>
<tr>
<td>This work</td>
<td>65nm</td>
<td>1.2</td>
<td>310</td>
<td>220</td>
<td>12.7x (**)</td>
<td>Integer unit</td>
</tr>
</tbody>
</table>

(*) Energy efficiency improvement factor (EEIF) = (Energy efficiency @ Min. Energy $V_{DD}$) / (Energy efficiency @ Nominal $V_{DD}$)

(**) EEIF based on 1.2-V IU with proposed CLFF

(***) EEIF based on 1.2-V IU with conventional TBFF
IV. CONCLUSION

In this paper, contention-less flip-flop (CLFF) and separated $V_{DD}$ between FF's and combinational logics were proposed. The proposed technologies were applied to a 16-bit integer unit (IU) for media processing. Two types of IU’s with the conventional FF’s and with proposed CLFF’s were fabricated in a 65-nm CMOS process. Measurement results revealed that the proposed CLFF can reduce the average $V_{DDmin}$ of IU’s by 64mV compared with the conventional FF’s, and the maximum energy efficiency of 1835 GOPS/W (1.7µW and 3.2MHz) at $V_{DDLOGIC}=310$mV and $V_{DDFF}=340$mV is achieved by combining the proposed CLFF and the separated $V_{DD}$. Consequently, IU with CLFF accomplished the highest energy efficiency increase of 12.7 times.

ACKNOWLEDGMENTS

This work was carried out as a part of the Extremely Low Power (ELP) project supported by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO).

REFERENCES


