### Power Reduction Techniques in Clock Distribution Networks with Emphasis on LC Resonant Clocking

Seyed Ebrahim Esmaeili

A Thesis In the Department of Electrical and Computer Engineering

### Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy at Concordia University Montreal, Quebec, Canada

July 2011

© Seyed Ebrahim Esmaeili, 2011

#### CONCORDIA UNIVERSITY SCHOOL OF GRADUATE STUDIES

This is to certify that the thesis prepared

By: Seyed Ebrahim Esmaeili

Entitled: Power Reduction Techniques in Clock Distribution Networks with Emphasis on LC Resonant Clocking

and submitted in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY (Electrical & Computer Engineering)

complies with the regulations of the University and meets the accepted standards with respect to originality and quality.

Signed by the final examining committee:

|                   | Chair                                                                                         |
|-------------------|-----------------------------------------------------------------------------------------------|
| Dr. G. Gouw       |                                                                                               |
|                   | External Examiner                                                                             |
| Dr. E.G. Friedman |                                                                                               |
|                   | External to Program                                                                           |
| Dr. A. Youssef    |                                                                                               |
|                   | Examiner                                                                                      |
| Dr. R. Raut       |                                                                                               |
|                   | Examiner                                                                                      |
| Dr. C. Wang       |                                                                                               |
|                   | Thesis Co-Supervisor                                                                          |
| Dr. A. Al-Khalili |                                                                                               |
|                   | Thesis Co-Supervisor                                                                          |
| Dr. G. Cowan      |                                                                                               |
| oved by           |                                                                                               |
|                   | Chair of Department or Graduate Program Director<br>Dr. M. Kahrizi, Graduate Program Director |
| 1, 2011           |                                                                                               |
|                   | Dr. Robin A.L. Drew, Dean                                                                     |
|                   | Faculty of Engineering & Computer Science                                                     |

#### ABSTRACT

# Power Reduction Techniques in Clock Distribution Networks with Emphasis on LC Resonant Clocking

Seyed Ebrahim Esmaeili, Ph.D. Concordia University, 2011

In this thesis we propose a set of independent techniques in the overall concept of LC resonant clocking where each technique reduces power consumption and improve system performance.

Low-power design is becoming a crucial design objective due to the growing demand on portable applications and the increasing difficulties in cooling and heat removal. The clock distribution network delivers the clock signal which acts as a reference to all sequential elements in the synchronous system. The clock distribution network consumes a considerable amount of power in synchronous digital systems. Resonant clocking is an emerging promising technique to reduce the power of the clock network. The inductor used in resonant clocking enables the conversion of the electric energy stored on the clock capacitance to magnetic energy in the inductor and vice versa.

In this thesis, the concept of the slack in the clock skew has been extended for an LC fully-resonant clock distribution network. This extra slack in comparison to standard clock distribution networks can be used to reduce routing complexity, achieve reduction in wire elongation, total wire length, and power consumption. Simulation results illustrate that by utilizing the proposed approach, an average reduction of 53% in the number of wire elongations and 11% reduction in total wire length can be achieved.

A dual-edge clocking scheme introduced in the literature to enable the operation of the flip-flop at the rising- and falling edges of the clock has been modified. The interval by which the charging elements in the flip-flop are being switched-on was reduced causing a reduction in power consumption. Simulating the flip-flop in STMicroelectronics 90-nm technology shows correct functionality of the Sense Amplifier flip-flop with a resonant clock signal of 500 MHz and a throughput of 1 GHz under process, voltage, and temperature (PVT) variations. Modeling the resonant system with the proposed flip-flop illustrates that dual-edge compared to single-edge triggering can achieve up to 58% reduction in power consumption when the clock capacitance is the dominating factor.

The application of low-swing clocking to LC resonant clock distribution network has been investigated on-chip. The proposed low-swing resonant clocking scheme operates with one voltage supply and does not require an additional supply voltage. The Differential Conditional Capturing flip-flop introduced in the literature was modified to operate with a low-swing sinusoidal clock. Low-swing resonant clocking achieved around 5.8% reduction in total power with 5.7% area overhead. Modeling the clock network with the proposed flip-flop illustrates that low-swing clocking can achieve up to 58% reduction in the power consumption of the resonant clock.

An analytical approach was introduced to estimate the required driver strength in the clock generator. Using the proposed approach early in the design stage reduces area and power overhead by eliminating the need for programmable switches in the driving circuit.

### Acknowledgments

I would like to express my sincere and profound gratitude to my supervisors, Prof. Asim Al-Khalili and Dr. Glenn Cowan for their guidance and generous support. I was truly blessed for working under their supervision. This work would never have been done without their valuable advice, criticism, experience, and insight.

I want to thank my thesis defense committee members: Dr. Eby Friedman, Dr. Amr Youssef, Dr. Rabin Raut, and Dr. Chunyan Wang for their helpful comments and suggestions. Special thanks go to my external examiner, Dr. Eby Friedman, for the time he took to read my dissertation. His comments and insightful suggestions helped me enhance this work significantly.

I am grateful to Prof. Yousef Shayan of the ECE Department at Concordia University for his permission to use the equipments in the Wireless Lab for chip testing. Special thanks to Prof. Mojtaba Kahrizi and his Ph.D. student Svetlana Spitsina for taking chip images using the microscope in the Nano Lab. Also, thanks to Tadeusz Obuchowicz for his prompt feedback and help in solving some of the problems faced with Cadence even during weekends and especially his kind help in running the TSMC 90-nm kit before the tape-out deadline. I would like also to thank Dave Chu from the ECE workshop for lending me the "T" fitting for BNC connectors. Thanks to the ECE staff members, Pamela, Diane, Connie, Lyne, Tatyana, and Kimberly for all their kindness and help. I am also grateful to CMC, especially Hsu L. Ho, Jim J. Quinn, Sarah. J. Neville, Kathryn K. Campbell, and Mariusz Jarosz for their help in facilitating chip fabrication and for loaning us the Agilent 81130 pattern generator. I thank my friend Ali Farhangi for his cooperation with me in the skew compensation scheme, specifically in writing the C++ code for the MDME Algorithm. I am also grateful to all of my friends in the office, in the VLSI group, and my roommates.

To my mother, I cannot really express what I feel. I know that these years were harder on you than they were on me. Thank you for your kindness, support, and for your prayers. I am grateful to my parents, sister, and two brothers for their support during my long academic journey.

I am in debt to my wife Hanan Al-Hashemi who had the patience to wait for me during these long years, the courage to take care of our children Hashem and Yasmin in my absence, and for making me laugh even at the saddest of moments. I thank Hashem for being the man of the house and I apologize to Yasmin for missing her first word and her first step. I promise that I will do my best to make it up to all of you, especially Yasmina.

I would like to thank Canada, this friendly and beautiful country who welcomed me with open arms. Thank you for restoring my belief in myself, humanity, and equality, basic principles that I had lost faith in before coming here.

Last but not least, I would like to use this last paragraph in remembering my uncle Is'haq who passed away in Bahrain during my Ph.D. and I was not able to attend his funeral. I still cannot believe that you left us and my eyes always fill up with tears whenever I think of you.

To everybody who knew me, helped me, or wished me luck, thank you.

## Dedication

To my beloved parents, wife, son, daughter.

### Contents

| List of Figures                                                          | xi            |
|--------------------------------------------------------------------------|---------------|
| List of Tables                                                           | XV            |
| List of Acronyms                                                         | xvi           |
| Chapter 1 Introduction                                                   | 1             |
| 1.1 Motivation                                                           | 1             |
| 1.2 Contributions                                                        | 2             |
| 1.3 Dissertation Overview                                                | 4             |
| Chapter 2 Background                                                     | 6             |
| 2.1 Clock Distribution Network Design Objectives                         | 6             |
| 2.1.1 Clock Skew                                                         | 6             |
| 2.1.2 Clock Jitter                                                       | 7             |
| 2.1.3 Clock Power                                                        | 7             |
| 2.2 Clock Distribution Network Structure                                 | 8             |
| 2.3 Resonant Clocking Techniques                                         | 10            |
| 2.3.1 Standing-Wave Resonant Clocking                                    | 11            |
| 2.3.2 Rotary Traveling-Wave Resonant Clocking                            | 13            |
| 2.3.3 LC Resonant Clocking                                               | 15            |
| 2.3.3.1 LC Globally-Resonant Locally-Square Clock Distribution Networ    | <b>·ks</b> 16 |
| 2.3.3.2 LC Fully-Resonant Clock Distribution Networks                    | 20            |
| 2.4 Challenges Associated with LC Resonant Clocking                      | 21            |
| 2.4.1 Dependency of the Sinusoidal Clock Rise Time on Its Frequency      | 21            |
| 2.4.2 Area Occupied by the Inductor                                      | 23            |
| 2.4.3 Clock Gating                                                       | 24            |
| 2.5 Sinusoidally Clocked Flip-Flops                                      | 24            |
| 2.5.1 Differential Conditional Capturing Flip-Flop (DCCFF)               | 26            |
| 2.5.2 Single-Ended Conditional Capturing Flip-Flop (SCCFF)               | 27            |
| 2.6 Conclusion                                                           | 28            |
| Chapter 3 Skew Compensation in LC Resonant Clock Distribution Networks . | 29            |
|                                                                          |               |

| 3.1 Lower Skew Bounds for the Proposed Technique                          | 30                |
|---------------------------------------------------------------------------|-------------------|
| 3.2 Skew Compensation in Short and Long Delay Paths                       | 37                |
| 3.3 New Modified Differed Merge Embedding (DME) Algorithm                 | 43                |
| 3.4 Simulation Results                                                    | 49                |
| 3.4.1 Matched Delay Values for the SCCFF                                  | 49                |
| 3.4.2 Comparing Data, Clock, and Flip-Flop Power Consumption for the F    | <sup>r</sup> ast, |
| Standard, and Slow Versions of the SCCFF                                  | 53                |
| 3.4.3 Effects of Process, Supply Voltage, and Temperature Variation on Fl | ip-               |
| Flop Speed                                                                | 56                |
| 3.4.4 Clock Tree Construction Using the New Compensation Technique        | 59                |
| 3.5 Conclusion                                                            | 60                |
| Chapter 4 Dual-Edge Triggered Sense Amplifier Flip-Flop for LC Resonant ( | Clock             |
| Distribution Networks                                                     | 62                |
| 4.1 Introduction                                                          | 62                |
| 4.2. Dual-Edge Triggered Dynamic Logic                                    | 63                |
| 4.3. Dual-Edge Sense Amplifier Flip-Flop (DE-SAFF)                        | 66                |
| 4.4. Timing Characterization of Dual-Edge Triggering                      | 69                |
| 4.5. Simulation Results                                                   | 71                |
| 4.5.1 Dual-Edge Flip-Flop Response at Positive and Negative Clock Edges,  | J                 |
| Schematic vs. Post Layout Simulation                                      | 72                |
| 4.5.2 Effects of Process, Supply Voltage, and Temperature (PVT) Variatio  | ns on             |
| the Generated Precharge and Evaluation Intervals                          | 74                |
| 4.5.2.1 Corner Analysis                                                   | 75                |
| 4.5.2.2 Supply Voltage                                                    | 77                |
| 4.5.2.3 Temperature Variation                                             | 78                |
| 4.5.2.4 Extreme Case                                                      | 79                |
| 4.5.3 Sharing the Inverter Chain                                          | 80                |
| 4.5.4 Comparing the DE-SAFF to Other Flip-Flops                           | 81                |
| 4.5.5 Potential Power Savings Achievable Through Dual-Edge Clocking       | 83                |
| 4.6 Conclusion                                                            | 86                |

| Chapter 5 Application of Low-Swing Clocking to LC Resonant Clock I | Distribution |
|--------------------------------------------------------------------|--------------|
| Networks                                                           | 88           |
| 5.1 Introduction                                                   | 89           |
| 5.2 Low-Swing LC Resonant Clocking                                 | 89           |
| 5.2.1 Low-Swing Differential Conditional Capturing Flip-Flop (LS-  | DCCFF) 89    |
| 5.2.2 Delay Associated with Low-swing LC Resonant Clocking         | 91           |
| 5.2.3 Power                                                        |              |
| 5.3 Test Chip                                                      |              |
| 5.4 Test Chip Extracted Simulation and Measurements                |              |
| 5.5 Conclusion                                                     |              |
| Chapter 6 Estimating Required Driver Strength in the LC Resonant C | lock         |
| Generator                                                          |              |
| 6.1 Introduction                                                   | 105          |
| 6.2 Estimating Required Driver Strength                            | 107          |
| 6.3 Simulation Results                                             | 111          |
| 6.4 Conclusion                                                     |              |
| Chapter 7 Conclusion                                               | 114          |
| 7.1 Summary and Contributions                                      | 114          |
| 7.2 Future Work                                                    |              |
| References                                                         | 120          |
| Appendix A Multiply-Accumulate (MAC) Unit Design                   |              |
| A.1 Multiply-Accumulate (MAC) Unit Design                          |              |
| A.1.1 Serial-In Parallel-Out Shift Register                        |              |
| A.1.2 Parallel-In Serial-Out Shift Register                        |              |
| A.2 Test Chip                                                      |              |
| A.2.1 Pad Description                                              |              |
| A.2.2 Chip Packaging and Test Fixture                              |              |

# **List of Figures**

| Figure 2.1: Schematic of a 3-D clock tree [12]                                                                                |
|-------------------------------------------------------------------------------------------------------------------------------|
| Figure 2.2: Common structures of clock distribution networks [8]                                                              |
| Figure 2.3: Tree-driven grid global clock distribution [13] 10                                                                |
| Figure 2.4: Standing-wave clock distribution network [16] 12                                                                  |
| Figure 2.5: Clock buffer simulated performance [16] 12                                                                        |
| Figure 2.6: Basic rotary clock architecture. The "=" signs denote points with equal phase                                     |
| [19]                                                                                                                          |
| Figure 2.7: Custom rotary clock architecture [19]13                                                                           |
| <ul> <li>Figure 2.8: Simplified Square-wave, globally-, and fully-resonant CDNs</li></ul>                                     |
| <ul> <li>Figure 2.9: Globally-resonant locally-square clock distribution networks [23]</li></ul>                              |
| Figure 2.10: Distributed differential oscillator (DDO) global clock network [25]19                                            |
| Figure 2.11: Rise time of resonant and square-wave clock signal with rise time of 33ps 22                                     |
| Figure 2.12: Spiral inductor with magnetic ring structure [30] 23                                                             |
| Figure 2.13: Differential Conditional Capturing Flip-Flop (DCCFF)                                                             |
| Figure 2.14: Single-Ended Conditional Capturing Flip-Flop (SCCFF)                                                             |
| Figure 3.1: Effect of long rising time of the sinusoidal clock signal on the operating speed<br>of the flip-flop              |
| (b) Flip-flop $T_{DQ}$ vs. $T_{DCLK}$                                                                                         |
| <ul> <li>Figure 3.2: Effect of short rising time of the square clock signal on the operating speed of the flip-flop</li></ul> |
| <ul> <li>Figure 3.3: Using one version of the flip-flop</li></ul>                                                             |
| Figure 3.4: Using three versions of the flip-flop                                                                             |

| (b) Flip-flop output with respect to skewed clock                                                                          |        |
|----------------------------------------------------------------------------------------------------------------------------|--------|
| <ul> <li>Figure 3.5: Matched delay for short and long delay paths</li></ul>                                                | 38     |
| Figure 3.6: Sequentially adjacent flip-flops<br>(a) Positive skew<br>(b) Negative skew                                     | 40     |
| Figure 3.7: Tuning a merging segment by changing the flip-flop type in left or right tree.                                 |        |
| Figure 3.8: Modified DME<br>(a) Clock tree topology<br>(b) Determining flip-flop type based on minimum wire length merging | 47     |
| <ul><li>Figure 3.9: Pseudo Code for modified DME algorithm</li></ul>                                                       |        |
| Figure 3.10: T <sub>DQ</sub> vs. T <sub>setup</sub> (T <sub>DCLK</sub> )                                                   | 50     |
| Figure 3.11: Slow, standard and fast flip-flop response with respect to a square cloc signal                               |        |
| Figure 3.12: Slow, standard and fast flip-flop response with respect to a sinusoidal signal                                |        |
| Figure 3.13: Pseudorandom Sequence                                                                                         | 53     |
| Figure 3.14: Power measurement circuit                                                                                     | 55     |
| Figure 3.15: Monte-Carlo simulation<br>(a) $T_{DQ}$ with a square clock<br>(b) $T_{DQ}$ with a sinusoidal clock            | 57     |
| Figure 3.16: Temperature effect on the TDQ delay                                                                           | 58     |
| Figure 4.1: Circuits used to enable precharging and evaluation at both clock transiti                                      | ons 64 |
| Figure 4.2: Clocking scheme used to enable dual-edge triggering in CMOS dynami                                             | -      |
| (a) Precharging circuit<br>(b) Evaluation circuit                                                                          | 64     |
| Figure 4.3: Single and dual-edge triggered dynamic CMOS logic<br>(a) Single-edge<br>(b) Dual-edge                          | 66     |
| Figure 4.4: Single-Edge Sense Amplifier Flip-Flop (SE-SAFF)                                                                | 67     |
| Figure 4.5: Dual-Edge Sense Amplifier Flip-Flop (DE-SAFF)                                                                  | 68     |

| Figure 4.6: Dual-edge triggering timing diagram                                                                      | 71    |
|----------------------------------------------------------------------------------------------------------------------|-------|
| Figure 4.7: T <sub>DQ</sub> delay vs. T <sub>setup</sub> (T <sub>DCLK</sub> ), schematic simulation                  | 73    |
| Figure 4.8: T <sub>DQ</sub> delay vs. T <sub>setup_time</sub> (T <sub>DCLK</sub> ), post-layout simulation           | 73    |
| Figure 4.9: Effect of supply voltage variation on TP2                                                                | 78    |
| Figure 4.10: Temperature variation effect on precharging and evaluation intervals                                    | 79    |
| Figure 4.11: Dual-edge triggered flip-flop output                                                                    | 83    |
| Figure 4.12: Dual-edge clocking percentage reduction in power                                                        | 86    |
| Figure 5.1: Low-Swing Differential Conditional Capturing Flip-Flop (LS-DCCFF)                                        | 90    |
| Figure 5.2: Delay between the low- and full-swing resonant clock signals to reach $V_{pull\_down}$                   | 91    |
| Figure 5.3: Modification to enable full- and low-swing flip-flop clocking                                            | 94    |
| Figure 5.4: Simplified floorplan of the test chip                                                                    | 95    |
| Figure 5.5: Die photopgraph of the test chip                                                                         | 95    |
| Figure 5.6: Measurement waveforms of the LS-DCCFF at 100MHz                                                          | 97    |
| <ul> <li>Figure 5.7: <i>T<sub>DQ</sub></i> delay versus setup time for the full- and low-swing flip-flops</li></ul>  | 98    |
| Figure 5.8: $T_{DQ}$ for the full- and low-swing flip-flops at the same setup time – extracted simulation.           |       |
| Figure 5.9: $T_{DQ}$ for the full- and low-swing flip-flops at a negative setup time of -1,10 – extracted simulation |       |
| Figure 5.10: Percentage reduction in power for the resonant clock network achievable through low-swing clocking      |       |
| Figure 6.1: Relative power savings as a function of driver transistor width (w) and reference signal pulse (d) [54]  | . 106 |
| Figure 6.2: Clock generator with programmable delay [55]                                                             | . 106 |
| Figure 6.3: LC resonant clock generator                                                                              | . 108 |
| Figure 6.4: Generated sinusoidal clock signal                                                                        | 108   |
| Figure 6.5: PMOS drain current during the application of Vref_P                                                      | . 110 |
| Figure 6.6: $I_D$ vs. $V_{DS}$ for PMOS with $W_p$ =120 nm, L=100 nm                                                 | . 112 |
| Figure 7.1: Globally integrated power and clock (GIPAC) distribution network [57]                                    | . 117 |
| Figure A.1: MAC unit                                                                                                 | . 128 |
| Figure A.2: Serial-in parallel-out shift register                                                                    | . 129 |

| Figure A.3: Parallel-in serial-out shift register | 130 |
|---------------------------------------------------|-----|
| Figure A.4: Bonding diagram for the CFP80 package | 132 |
| Figure A.5: RF CFP80TF test fixture [58]          | 132 |

## **List of Tables**

| Table 3.1    Matched delay (ps) for a clock period of 2ns    51                                                                                                                                    |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Table 3.2 Power consumption (µW)                                                                                                                                                                   |
| Table 3.3 Supply voltage effect on matched delay values                                                                                                                                            |
| Table 3.4 Comparison of MMM-DME and the new modified DME using the proposedskew compensation technique                                                                                             |
| Table 4.1 Timing characteristics of the DE-SAFF – post layout simulation                                                                                                                           |
| Table 4.2 DE-SAFF precharge and evaluation intervals obtained for different corners 75                                                                                                             |
| Table 4.3 Combinational logic delay obtained for each corner at positive and negative clock edges                                                                                                  |
| Table 4.4 Supply voltage effect on precharging and evaluation intervals                                                                                                                            |
| Table 4.5 Timing characteristics of the DE-SDFF, DE-DCCFF, and DE-SAFF at a clock frequency of 250MHz       81                                                                                     |
| Table 4.6 Timing characteristics of the CD-SAFF, AC-SAFF, and DE-SAFF at a clockfrequency of 500MHz82                                                                                              |
| Table 5.1 Area and power comparison between full- and low-swing clocking                                                                                                                           |
| Table 6.1 Estimated driver strength at different pulse widths ( <i>PW</i> ) for $C_{clk}$ =30 pF, $R_{clk}$ =0.5 $\Omega$ , <i>f</i> =1 GHz, $V_{DD}$ =1 V, $V_{OH}$ =0.95 V, and $V_{OL}$ =0.05 V |
| Table A.1 Pad name, type, and description    131                                                                                                                                                   |

# List of Acronyms

| AC-SAFF  | Adaptive Clocking Dual-edge Sense Amplifier Flip-Flop     |  |  |
|----------|-----------------------------------------------------------|--|--|
| ASIC     | Application Specific Integrated Circuit                   |  |  |
| CDN      | Clock Distribution Network                                |  |  |
| CD-SAFF  | Conditional Capturing Dual-edge Sense Amplifier Flip-Flop |  |  |
| СМС      | Canadian Microelectronics Corporation                     |  |  |
| DCCFF    | Differential Conditional Capturing Flip-Flop              |  |  |
| DDO      | Distributed Differential Oscillator                       |  |  |
| DE-DCCFF | Differential Conditional Capturing Flip-Flop              |  |  |
| DE-SAFF  | Dual-Edge Sense Amplifier Flip-Flop                       |  |  |
| DE-SDFF  | Dual-Edge Static Differential Flip-Flop                   |  |  |
| DME      | Differed Merge Embedding                                  |  |  |
| FF       | Fast-Fast                                                 |  |  |
| FO4      | Fan Out of Four                                           |  |  |
| FPGA     | Field Programmable Gate Array                             |  |  |
| FS       | Fast-Slow                                                 |  |  |
| FS-DCCFF | Full-Swing Differential Conditional Capturing Flip-Flop   |  |  |
| GIPAC    | Globally Integrated Power and Clock                       |  |  |
| HVT      | High Threshold Voltage                                    |  |  |
| LS-DCCFF | Low-Swing Differential Conditional Capturing Flip-Flop    |  |  |
| LSDFF    | Low-Swing Clock Double-Edge Flip-Flop                     |  |  |
| LVT      | Low Threshold Voltage                                     |  |  |

| MAC     | Multiply and Accumulate                       |  |  |
|---------|-----------------------------------------------|--|--|
| MMM     | Method of Means and Medians                   |  |  |
| PDN     | Pull-Down Network                             |  |  |
| PLL     | Phase-Locked Loop                             |  |  |
| PUN     | Pull-Up Network                               |  |  |
| PVT     | Process, Voltage, and Temperature             |  |  |
| PW      | Pulse Width                                   |  |  |
| SAFF    | Sense Amplifier Flip-Flop                     |  |  |
| SCCFF   | Single-ednded Conditional Capturing Flip-Flop |  |  |
| SE-SAFF | Single-Edge Sense Amplifier Flip-Flop         |  |  |
| SF      | Slow-Fast                                     |  |  |
| SS      | Slow-Slow                                     |  |  |
| SVT     | Standard Threshold Voltage                    |  |  |
| SWO     | Standing-Wave Oscillators                     |  |  |
| TT      | Typical-Typical                               |  |  |
| ZST     | Zero Skew Clock Tree                          |  |  |

## Chapter 1 Introduction

#### **1.1 Motivation**

Microprocessor power consumption is increasing by approximately 20% per year [1]. In deep sub-micron technology, the substantial increase in power leads to additional difficulties in cooling and heat removal [2]. Furthermore, low-power design is becoming a crucial objective due to the increasing demand on portable applications [3]. Approximately 30-50% of microprocessor power consumption is dissipated in the clock distribution network (CDN) which has the highest capacitance in the system and operates at high frequencies [4].

An attractive approach to reduce power is to scale down the supply voltage which has a quadratic effect on power consumption. However, scaling down the supply voltage would require decreasing the transistor threshold voltage level in order to maintain transistor driving capability. This leads to substantial increase in leakage power. In addition, decreasing the supply voltage would increase system susceptibility to variations [3]. As a result, there is an increasing demand for power reduction schemes that do not require a reduction in the supply voltage.

An emerging technique to reduce the power of the CDN is resonant clocking where low energy dissipation is achieved by recycling the energy stored on the clock capacitance [5]. From the three resonant clocking techniques offered to date, namely: standing-wave, rotary traveling-wave, and LC resonant clocking; LC resonant clocking has proven to be the most convenient since it requires minimum change from conventional square-wave design and its practicality was verified on functional chips.

Clock skew and jitter in buffered clock distribution networks are proportional to clock latency which is increasing relative to clock cycle in recent microprocessors. Resonant clocking techniques in addition to their low-power consumption enable phase stability and low jitter due to resonance [6].

The traditional approach for LC resonant CDNs is to use the LC tank to drive the global clock distribution while the local clock is being delivered through conventional clock buffers. However, around 66% of clock power is being dissipated in the last buffer stage driving the flip-flops [7] leading to minor power savings in LC globally-resonant locally-square CDNs. In order to achieve maximum power savings, the LC tank should drive the entire clock network (both global and local) without using intermediate buffers (see Figure 2.8 in Chapter 2).

Power reduction techniques for LC resonant CDNs in which the entire network including the flip-flops is being driven with a resonant (sinusoidal) clock signal will be the focus of this dissertation. The schemes and techniques developed in this thesis are applicable to both square and resonant CDNs.

#### **1.2 Contributions**

Given that the bulk of the CDN capacitance is in its leaves, the largest power advantage will come by extending the LC resonance down to the flip-flops. This would require understanding flip-flop performance with the sinusoidal characteristic of the clock signal generated in LC resonant networks. We have followed a similar approach to the one proposed in [7] in which the clock buffers are removed to allow the clock energy to resonate between the inductor and the clock capacitance enabling maximum power savings.

Our goal is to further reduce the power of an LC fully-resonant CDN through manipulating and modifying the characteristics of flip-flops under a sinusoidal clock signal.

• We have introduced a new type of slack in the skew that can be compensated for to reduce the CDN routing complexity and as a byproduct/substitute we can achieve reductions in wire elongations and total wire length as well as power consumption. The slack in the skew can also be used for incremental routing adjustments. The concept itself is applicable to both sine-wave resonant as well as conventional square-wave clocking if flip-flops of different delays are used. However, in our demonstration of the proposed technique, we took advantage of the slow rise time of the sinusoidal resonant clock signal and the different transistor threshold voltage levels available in the STMicroelectronics 90-nm technology to generate different delays of the flip-flop with separate means. In order to further illustrate the concept, five clock tree benchmarks with nominal zero skew have been constructed using a Modified Differed Merge Embedding Algorithm that takes advantage of the skew slack introduced by the new technique to highlight its benefits and practicality.

• We have also introduced a new Dual-Edge Sense Amplifier Flip-Flop (DE-SAFF) for LC fully-resonant CDNs using a modified clocking scheme that can be extended to enable dual-edge clocking in any dynamic CMOS logic circuit. In this work, the PMOS transistors used for precharging the nodes in the flip-flop are only switched on for a portion of the clock cycle in order to reduce short circuit power. Correct operation of the proposed flip-flop was verified on the extracted circuit layout in STMicroelectonics 90-nm technology under a sinusoidal clock at a frequency of 500 MHz.

• The application of low-swing clocking on LC fully-resonant CDNs is investigated. The Differential Conditional Capturing Flip-Flop (DCCFF) was modified to operate with a low-swing sinusoidal clock. The proposed low-swing resonant clocking scheme operates with one voltage supply and does not require additional supply voltage. The feasibility of low-swing resonant clocking and the power advantages were investigated on-chip in TSMC 90-nm technology.

• Though the main concentration of the dissertation is aimed at the flip-flop level, estimating the power savings achievable through dual-edge and low-swing resonant clocking required estimating the power of the clock driver. In doing so, an analytical approach was introduced to estimate the required driver strength in the resonant clock generator. Using the proposed approach early at the design stage eliminates the need for programmable switches in the driving circuit, thus reduces area and power overhead.

Although all proposed schemes have been simulated and tested under a sinusoidal clock signal assuming a fully-resonant LC CDN, the proposed techniques are equally applicable to conventional square-wave CDNs.

#### **1.3 Dissertation Overview**

This dissertation is organized as follows. Chapter 2 introduces the main objectives and metrics of CDN design. Different resonant clocking techniques introduced in the literature are reviewed and their advantages in reducing clock skew, jitter, and power are

shown. The remainder of Chapter 2 is devoted to discussing LC resonant CDNs with the concentration on LC fully-resonant CDN. Sinusoidally clocked flip-flops are also presented at the end of this chapter. The skew compensation technique for LC resonant CDNs is discussed in Chapter 3. Lower skew bounds for the proposed technique, the Modified Differed Merge Embedding Algorithm, and the results obtained on five benchmark CDNs are then presented. In Chapter 4, the new dual-edge clocking scheme, the Dual-Edge Sense Amplifier Flip-Flop (DE-SAFF), timing characterization for the proposed scheme, and potential power savings achievable through dual-edge clocking are described. Chapter 5 presents the Low-Swing Differential Conditional Capturing Flip-Flop (LS-DCCFF) modified to operate with a low-swing sinusoidal clock. Measured results from the test chip are also viewed. The analytical approach used to estimate required driver strength in the clock generator is introduced in Chapter 6. Conclusion that has been drawn from this work and ideas for future extension of this thesis are presented in Chapter 7.

## Chapter 2 Background

This chapter introduces the main design objectives for the clock distribution network. Different clock architectures are presented. Different resonant clocking schemes that have been proposed in the literature are reviewed and the promise in reducing clock skew, jitter, and power is shown. LC-based resonant clock distribution networks are examined in more detail. Challenges associated with LC resonant clocking are identified and sinusoidally clocked flip-flops are then discussed.

#### 2.1 Clock Distribution Network Design Objectives

CDNs in synchronous digital integrated circuits deliver the clock signal that controls the flow of data within the system. The input at each clock sink, i.e., flip-flop, is captured at the rising or falling clock edge (single-edge triggered flip-flops), or on both edges of the clock (dual-edge triggered flip-flops), or based on the voltage level of the clock (latches). The main objectives in the design of CDNs are to minimize skew, jitter, and power.

#### 2.1.1 Clock Skew

Clock skew is defined as the difference in the arrival time of the clock edges at different locations in the CDN. Skew is mainly caused by variations between clock buffers, interconnect widths, and loading at different clock paths. The main cause of skew in balanced well-designed CDN is the clock buffers [8]. It should be noted that skew is

only relevant in sequentially adjacent flip-flops. Since it is highly unlikely for a signal in one clock cycle to propagate across the entire chip, the skew between different parts of the chip is not important. However, due to the complexity of controlling skew in complicated and condensed clock paths, any skew is undesirable.

#### 2.1.2 Clock Jitter

Clock jitter is defined as the difference in the arrival time of the clock edge at the same location in the CDN. Jitter can make the clock period shorter or longer than nominal period. Jitter is mainly caused by temperature variation, power supply noise, and the phase-locked loop (PLL). Since the design of PLLs has improved, the main source of jitter in today's microprocessors is the CDN [9].

#### 2.1.3 Clock Power

Low-power design is becoming a crucial design objective due to the increasing demand on portable applications and the increased cost of cooling. 40% of the power in the 200 MHz 21064 Alph microprocessor is dissipated in the CDN [10]. The CDN and latches dissipate around 70% of the IBM POWER4 1.3 GHZ microprocessor's power [11].

The latest developments in integrated circuit design specifically in 3-D integration where multi-plane synchronization is required, lead us to believe that the power consumption of the CDN will remain at these high levels. Figure 2.1 shows the schematic of a 3-D clock tree. The clock driver as illustrated is on the second plane [12].



Figure 2.1: Schematic of a 3-D clock tree [12]

### 2.2 Clock Distribution Network Structure



Figure 2.2: Common structures of clock distribution networks [8]

Various clock distribution structures have been developed given that the routing area and complexity, speed, and power dissipation of the system are all factors affected by the clock network design. Figure 2.2 illustrates common CDN structures. An asymmetric buffered tree structure is shown in Figure 2.2(a). In this structure, the wire as well as the buffer delay is balance in each path in order to achieve zero skew at the clock leaves. When the clock sinks are uniformly distributed, a symmetrical tree structure is used such as the H- and X- tree structures shown in Figure 2.2(b), (c). Although the balanced trees shown in the figure are not buffered, buffers are usually inserted to drive different sections of the tree. Properly matched buffers and interconnect delays as well as loading capacitances in clock trees can achieve under ideal conditions zero skew. However, in reality some skew will certainly be present due to variations in interconnect parameters as well as mismatches in clock buffers.

Clock grid or mesh (Figure 2.2(d)) is another alternative to distribute the clock signal. The mesh actively reduces skew by connecting path resistances in parallel [8]. In the mesh structure, the skew is independent of unbalanced distribution of loading. However, unlike clock trees, the mesh structure uses more wiring resources and consume more power.

The clock signal in modern microprocessors is distributed using a hierarchical approach in which a global distribution delivers the clock signal across the chip and a local distribution carries it to sequential elements.

In IBM microprocessors [13], the global clock distribution consists of a tree-driven grid (Figure 2.3). The clock signal is distributed across the chip using a balanced clock tree while a global grid is used to short the clock tree ends together.

| FILI | FII   | FILT    |       |
|------|-------|---------|-------|
| FILT | FILT  |         | IT IT |
| FILT | FILT  | ITIIII  |       |
| FILI | FILI  | FIL     |       |
| FILI | FILIT | ITILIT. |       |
| FIFI | FILT  | ITI ITI |       |
| FILT | FILIT |         |       |
|      |       |         |       |

Figure 2.3: Tree-driven grid global clock distribution [13]

In this scheme, low wiring resources by the balanced clock tree and load-independent minimum skew by the grid are achieved. The local clock is distributed by additional levels of buffers which deliver the clock signal to the circuits.

#### 2.3 Resonant Clocking Techniques

Resonant clocking reduces power dissipation in CDNs while enabling the generation of high frequency clock signals. There are three resonant clocking techniques offered to date [14]. The first one is the standing wave oscillation which generates a clock signal with varying amplitude and constant phase [15]. The second technique is the travelling rotary-wave oscillation which generates a clock signal with constant amplitude and varying phase making it suitable for non-zero clock skew systems. The third technique is the LC oscillation which generates a clock signal with constant amplitude and phase and requires minimum change from conventional clock design [14].

#### 2.3.1 Standing-Wave Resonant Clocking

The interaction of two identical waves of equal magnitude and frequency propagating in opposite directions forms a standing-wave [15]:

$$V_f(x,t) = V_A \sin(\omega t - \beta x)$$
(2.1)

$$V_r(x,t) = V_B \sin(\omega t + \beta x)$$
(2.2)

where  $V_f$  and  $V_r$  are the forward and reverse travelling waves,  $V_A$  and  $V_B$  represent the amplitudes,  $\omega$  is the angular frequency, t is the time,  $\beta$  represents the phase constant, and x represents the position. Setting  $V_A = V_B$  and adding the two waves at location x and time t results in a standing wave [15]:

$$V(x,t) = V_A(\sin(\omega t - \beta x) + \sin(\omega t + \beta x)) = 2V_A \sin(\omega t) \cos(\beta x)$$
(2.3)

Equation (2.3) illustrates that the phase shift in an ideal standing-wave is independent of position but the amplitude varies sinusoidally [15]. Within any region in which the sign of  $\cos(\beta x)$  or  $\sin(\omega t)$  does not change, the phase of the standing wave remains constant. However, this is not the case in traditional travelling-wave clocking where the phase changes linearly with position. In travelling-wave clock signals and in order to lower the skew, delay of the propagating signal from clock source to the sinks must be balanced.

In [16], on-chip design and operation of a 10-GHz global standing-wave clock distribution network using coupled oscillators is described. As illustrated in Figure 2.4, standing-wave oscillators (SWO) are coupled together to create a grid of standing waves.



Figure 2.4: Standing-wave clock distribution network [16]



Figure 2.5: Clock buffer simulated performance [16]

Due to the large losses of on-chip interconnects, a clock buffer is used to convert the low-swing standing-wave clock signal to digital levels to enable the generation of a conventional digital clock at the clock sinks (Figure 2.5).

### 2.3.2 Rotary Traveling-Wave Resonant Clocking



Figure 2.6: Basic rotary clock architecture. The "=" signs denote points with equal phase [19]



Figure 2.7: Custom rotary clock architecture [19]

Another technique of resonant clocking is the rotary traveling-wave CDN. In [17], an average of 63% power savings were reported for a rotary traveling-wave clock network compared to conventional clock tree in microprocessor design.

Transmission line rings are used to distribute the clock signal. A rotating multiphase (360°) square-wave within a closed-loop differential transmission line is driven by distributed anti-parallel CMOS inverter pairs as shown in Figure 2.6. Line losses are overcome and phase lock is achieved by the anti-parallel inverters [18]. The traveling wave is generated by turning ON the power supply. Afterwards, at least one set of anti-parallel inverters are needed to overcome interconnect loss and maintain resonance. The clock frequency is determined by the length of the transmission line ring.

$$f_{osc} = \frac{1}{2\sqrt{L_T C_T}} \tag{2.4}$$

where  $L_T$  and  $C_T$  are the total inductance and capacitance along the rotary signal path in the ring [19].

Unlike the case in standing-wave resonant clocking where the clock signal has a sinusoidally varying amplitude, the amplitude of the rotary traveling-wave clock is constant. However, the phase of the rotary traveling clock changes with position.

As illustrated in Figure 2.6, the waveforms in the two signal lines at any point of the loop are 180° out of phase. To implement a zero skew synchronized circuit using rotary clocking, all synchronous elements need to be connected to the same location in the loop [14]. However, in [20], the feasibility of using rotary clocking as a zero clock skew synchronizing technology was proven.

In [19], a custom rotary clock router was introduced. Rotary oscillations can be sustained for non-regular custom structures like the one shown in Figure 2.7. Though

regular rectangular rings have advantages in terms of manufacturability, non-regular custom rings can easily be used in multi-core implementations where an independent core is synchronized by each ring. Custom ring topologies also reduce the tapping (connecting) wire lengths between the oscillatory ring and the registers.

In [19, 20], rotary traveling-wave CDNs with regular and custom structures were used to implement zero skew circuits by taking advantage of the delay associated with the tapping wire length used to connect the register to the ring at a particular tapping point with known phase. For phase information, a point along the ring is chosen as a reference point with clock delay t = 0 and phase  $\theta = 0$ . The clock delay t and phase  $\theta$  at any point on the ring can be obtained using  $\frac{\theta}{360} = \frac{t}{T}$  where T is the clock period. All synchronous elements can connect to the ring at specific nodes with known phase called tapping points. Each tapping point has two locations one on the inner line and the other on the outer line of the ring separated by 180°. The delay between registers connected to different tapping points is equalized by manipulating the tapping wire length and hence the delay associated with it.

#### 2.3.3 LC Resonant Clocking

Figure 2.8(a) illustrates a simplified balanced clock tree structure for a conventional square-wave based CDN. Bufferes are used to drive different sections in the tree. As shown in the figure, the clock signal propagating in the global tree to the lower bracnhes which in turn feed the local clock and flip-flops is a square-wave clock signal. A globally-resonant locally-square CDN is shown in Figure 2.8(b). An inductor is connected at the center of the H-tree in order to generate a resonanting clock signal at the



Figure 2.8: Simplified Square-wave, globally-, and fully-resonant CDNs

fundamental ferquency of the clock node. Lower branches of the H-tree are driven by buffers which in turn convert the sinosodial clock to sqaure-wave signal feedign the flipflops. Design guidelines and methodology for globally-resonant H-tree are presented in [21]. The bufferes in globally-resonant locally-square CDN are removed in Figure 2.8(c) where the resonant clock signal is distributed all the way down to the flip-flop level.

LC resonant CDNs in addition to their low-power consumption have the advantage of generating a clock signal with uniform phase and amplitude. It also requires minimum change from conventional square-wave design. In the following, globally-resonant locally-square, and fully-resonant CDNs will be discussed.

#### 2.3.3.1 LC Globally-Resonant Locally-Square Clock Distribution Networks

In [22], the design of a resonant global CDN was introduced. In their approach, the traditional tree-driven grids are augmented with on-chip inductors to resonate the clock

capacitance at the fundamental frequency of the clock node as shown in Figure 2.9. The energy resonates between electrical form in the clock capacitance and magnetic form in the inductor. As shown in the figure, at the center of the H-tree is the clock driver which consists of a buffer chain. One end in each spiral inductor is connected to the clock tree and the other end is connected to a large decoupling capacitance. This capacitance provides dc voltage around which the clock oscillates.

Resonance results in a reduced effective capacitance of the clock grid which in turn reduces clock latency. It also allows a reduction in the driving strength as well as the number of buffer stages required to drive the grid. This reduction in number of buffers leads to improvements in clock skew and jitter since the effect of power-supply noise on these buffers is reduced.

Simulation results using model extraction at a frequency of 1.1 GHz have shown a power reduction of over 80% as well as improved clock latency [22].

The resonant global clock distribution proposed in [22] was fabricated on chip using 90 nm 1 V ten-level Cu CMOS technology [23], [24], and 0.18  $\mu$ m 1.8 V six-level Al mixed-signal CMOS technology, where on-chip measurements showed approximately 20% of the energy being recovered and reused in each cycle. In addition, the ability to significantly scale down the required buffers in the global clock distribution allows total power savings of about 80%.

The natural band-pass characteristics of the resonant network along with buffer reduction, results in over 60% improvement in jitter [24]. A drawback in the resonant global clock distribution proposed in [22]-[24] is the requirement of large on-chip decoupling capacitance to serve as a charge reservoir.

17



(a) Global clock distribution with a resonant load – eight clock sectors



(b) Components and topology of a resonant clock sector

Figure 2.9: Globally-resonant locally-square clock distribution networks [24]



Figure 2.10: Distributed differential oscillator (DDO) global clock network [26]

Another resonant clock design with a distributed differential oscillator (DDO) global clock network is presented in Figure 2.10 [25], [26]. Here, the distribution is differential where spiral inductors and negative differential transconductors are placed between two clock phases. The negative differential transconductor acts as a gain element to maintain oscillation and overcome losses. Clock amplitude is controlled by the bias current in the gain element. The distribution network is injection-locked to an external reference. In this approach the need for large decoupling capacitors is eliminated. In addition, jitter and skew caused by process variation, power-supply noise, and common-mode noise sources are reduced due to differential detection at local clock buffers.

The practicality of globally resonant clocking has been proven on an LC resonant clock in a fully-functional Cell Broadband Engine processor [27]. Hardware measurements show full functionality at 3.2 GHz and power savings of 25% in the global clock and 5% in total chip power at 4 GHz. It should be noted that in [27], only the global clock tree was modified to enable resonant clocking where an additional metal layer was added on top of the conventional tree to attach the inductors and decoupling capacitors. The local clock sectors were buffered; hence the clock signal feeding the registers is a square signal and not a sinusoidal one.

# 2.3.3.2 LC Fully-Resonant Clock Distribution Networks

In all of the resonant clocking techniques presented so far, the local clock signal feeding the flip-flops was a square-wave signal. In resonant clocking, the largest power advantage is achieved by extending the resonance all the way down to the flip-flop level given that the bulk of the capacitance is in the leaves of the clock tree [24]. In this approach, the clock buffers are removed to allow the clock energy to resonate between the inductor and the clock capacitance (Figure 2.8(c)). However, this would require more understanding of flip-flop performance with the sinusoidal clock characteristics of the LC fully-resonant CDNs.

In [7], a fully-resonant CDN was fabricated in an IBM 0.13 µm process. Though the target design frequency was in the gigahertz range using integrated inductors, external inductors were used instead due to startup difficulties and the chip was operational at the megahertz range. Test results show approximately 35% power savings compared to a conventional buffered CDN.

Two 64×64 pipelined multipliers were fabricated on-chip in TSMC 0.25-µm CMOS process with LC fully-resonant CDN [28]. One was designed with resonantly clocked flip-flops and the other with conventional square-wave clocked flip-flops. Overall power savings of 25%-69% in the resonantly clocked multiplier were measured depending on data switching activity.

A two-phase fully-resonant LC CDN was used in ultra-low power hearing aid applications. An experimental test chip with more than 2,500 resonantly clocked latches was fabricated and tested in 0.25µm process. Results show that compared to single-edge triggered one phase benchmark, resonant clocking dissipates less energy by 7.5% [29].

#### 2.4 Challenges Associated with LC Resonant Clocking

Despite the promising power savings achieved in resonant CDNs, resonant clocking presents several design challenges because of the dependency of the clock rise time on its frequency and the susceptibility to process variation due to the long rise time of the clock, the need for different inductor values to generate different frequencies, the additional chip area occupied by the inductor, and the difficulty in clock gating without affecting energy recovery.

# 2.4.1 Dependency of the Sinusoidal Clock Rise Time on Its Frequency

The rise time of the conventional square-wave clock signal does not depend on clock frequency and is restricted to less than 10-15% of the clock period [30]. However, this is not the case in resonant clocking where the rise time of the sinusoidal clock signal depends on its frequency. Let the generated resonant clock signal be given by the



Figure 2.11: Rise time of resonant and square-wave clock signal with rise time of 33ps

following equation:

$$v(t) = \frac{1}{2} V_{DD} \sin(2\pi f t) + \frac{1}{2} V_{DD}$$
(2.5)

Taking the rise time of the clock signal as the time difference between the 10-90% of the clock peak, the rise time of the sinusoidal clock signal would be given by the following equation:

$$T_{rise} = 0.29T \tag{2.6}$$

where T is the clock period.

Equation (2.6) illustrates the dependency of the rise time of the sinusoidal clock signal on its frequency. It also shows that the clock rise time does not depend on its amplitude. Figure 2.11 illustrates the difference in rise time for the resonant clock signal as compared to a square wave clock with a constant rise time of 33.33 ps (10% of clock period at 3 GHz).

### 2.4.2 Area Occupied by the Inductor



Figure 2.12: Spiral inductor with magnetic ring structure [31]

As was discussed in section 2.3.3.1, LC resonant CDNs require several inductors to be integrated and distributed across the chip. The difficulty of inductor on-chip integration and the large area occupied by the inductor complicates the design of LC resonant CDNs and limits their applications.

A promising technique to solve this problem is the use of magnetic inductors (Figure 2.12) in LC resonant clocking. Magnetic inductors are compatible with CMOS process and occupy nearly 100× less area compared to conventional inductors. They can achieve for example inductance values of up to 4 nH and a quality-factor of 3 at 1 GHz [31]. Using magnetic inductors in LC resonant CDN reduces area overhead associated with distributed inductors.

# 2.4.3 Clock Gating

By using logic gates (NAND/NOR), clock gating in conventional square-wave CDNs is achieved by using an ENABLE signal that controls the clock feeding a specific sector. However, this approach is not desirable in LC resonant CDNs since it would reduce the energy being recovered from the remaining capacitance of that sector. In [28], a clock gating scheme was proposed for LC resonant CDN by adding a NOR gate with an ENABLE signal at the clock input of every resonantly clocked flip-flop. Simulation results show that clock gating would reduce the power consumption of the flip-flop by more than 1000× in the idle mode compared to the power consumed without clock gating for 50% data switching activity [28]. However, the extra routing resources and complexity of connecting the ENABLE signal to the input of each flip-flop as well as the power overhead associated with it was neglected.

#### 2.5 Sinusoidally Clocked Flip-Flops

LC resonant clocking needs the least modification from traditional CDN design due to the constant phase and amplitude of the generated clock signal. In addition, LC resonant clocking up to this date, is the most developed and practical resonant clocking technique.

The clocking scheme adopted and assumed in this dissertation from here on is the fully-resonant LC scheme. The clock signal feeding the flip-flops is assumed to be purely sinusoidal clock since extending the resonance down to the flip-flop level results in most power savings as discussed previously.

The long rise time of the sinusoidal clock signal compared to that of the square clock where the rise time is restricted to around 10-20% of the clock period affects the flip-flop

speed, power, and susceptibility to variations. In [32], the performance and power of six flip-flops in 130-nm process were analyzed under square and sinusoidal clocking at an operating frequency of 1GHz. Simulation results show that the dominating effects are an increased flip-flop delay of 20-30% for a sinusoidal clock with a rise time  $10\times$  slower than that of the conventional clock. It is also illustrated that as the frequency increased from 1 to 3 GHz, the difference in rise time between sinusoidal and square clock reduced causing an improvement in flip-flop performance.

In [33], a study was conducted on the effect of clock slope on the energy and performance of fifteen flip-flops covering the pulsed, differential, and dual-edge triggered flip-flop classes in 65-nm CMOS technology. Smoother clock slope, i.e., longer rise time, increases flip-flop power due to the increase in short-circuit current between the pull-up and pull-down networks. Furthermore, as occurs in any CMOS logic circuit, longer rise time results in increased delay. Post-layout simulation illustrates that the increase in flipflop delay as the clock slope increase by  $6\times$  is modest and is less than 5.5%. Results also show that the flip-flop setup and hold times have low sensitivity to clock slope. Furthermore, flip-flops with negative setup time experience more negative setup time with increased clock slope since the transparency window expands with longer fall time due to the falling-edge of the clock being smoother. The flip-flop power increases by no more than 70% as the clock slope deccreases  $6\times$ . It should be noted that the increase in flip-flop power due to the long rise time of the sinusoidal clock compared to square-wave clock with 0.1T rise time would be much less than 70% since the ratio between the rise times is 2.9.

Any flip-flop can operate at both square and/or sinusoidal clocks since a sine-wave can be considered as a square-wave with longer rise and fall times. In the following, a brief description of two flip-flops that were proposed in the literature as energy-recovery flip-flops that operate with a sinusoidal clock will be presented: these are; the Differential Conditional Capturing Flip-Flop (DCCFF) and the Single-Ended Conditional Capturing Flip-Flop (SCCFF) [5]. The Sense Amplifier Flip-Flop (SAFF) will be presented in Chapter 4.

# 2.5.1 Differential Conditional Capturing Flip-Flop (DCCFF)



Figure 2.13: Differential Conditional Capturing Flip-Flop (DCCFF)

The DCCFF is shown in Figure 2.13. Conditional capturing is used to minimize flipflop power at low data switching activities by eliminating redundant internal transitions. The DCCFF operates in a precharge and evaluate fashion. Pull-up PMOS transistors are used for charging nodes *SET* and *RESET*. The effect of charge sharing can be reduced by ensuring a constant path to  $V_{DD}$ . This is done by properly sizing the PMOS transistors. A short evaluation interval occurs after the rising edge of the clock when both the clock and inverted clock signals applied to transistors MN1/MN2 are above the threshold voltage level of the NMOS transistor. The DCCFF uses a NAND latch for storage. Using feedback from the output to control transistors MN3 and MN4 in the evaluation paths ensures conditional capturing. Therefore if the state of the input data is not changed, *SET* and *RESET* are not discharged.





Figure 2.14: Single-Ended Conditional Capturing Flip-Flop (SCCFF)

Figure 2.14 presents the SCCFF. The SCCFF is a single ended version of the DCCFF. Transistor MN3 controlled by the output *QB*, provides conditional capturing. The right-hand-side evaluation path is static and does not require conditional capturing. If input *D* 

was low and then goes high. Node QB would still be high from the previous state, thus pulling the gate of the PMOS transistor to ground and turning it on. The input node of the cross-coupled inverters will be pulled up to  $V_{DD}$ , their output QB becomes low and Qbecomes high. If the state of the input D remains the same, QB remains low, and transistor MN3 will be turned off and no discharging occurs since there is no path to ground.

### **2.6 Conclusion**

Reducing clock skew, jitter, and power are the main design objectives in CDNs. Though the main objective of the resonant clocking techniques is to reduce the clock power, they also enable reduction in clock skew and jitter as well. LC resonant clocking is still the most suitable low-power clocking scheme generating a clock signal with constant phase and amplitude and requires minimum change from conventional square-wave clock design. LC resonant clocking still, however, presents several design challenges associated with the long rise time of the sinusoidal clock, area occupied by the inductor, and clock gating.

# Chapter 3 Skew Compensation in LC Resonant Clock Distribution Networks

In this chapter a new approach for skew compensation in LC fully-resonant CDNs is introduced by manipulating the operating speed of the flip-flops. The STMicroelectronics 90-nm technology allows the use of devices with different threshold voltages, namely: HVT (High threshold voltage), SVT (Standard threshold voltage), and LVT (Low threshold voltage). Three types of flip-flops of equal input load: "fast", "standard", and "slow" are used. Timing parameters of the flip-flops are adjusted by manipulating the switching threshold of the clock port of the flip-flops. A fast/slow flip-flop has a shorter/longer  $T_{DQ}$  delay, compared to a standard flip-flop for the same setup time  $(T_{DCLK})$ . Distributing flip-flops according to their delay requirements would reduce the effect of the clock skew on the outputs of sequentially adjacent flip-flops. Due to the slow rise time of the sinusoidal clock signal generated in LC resonant CDNs compared to the conventional square-wave clock, the skew that can be compensated for in LC resonant CDNs using this approach would be much higher than in square-wave CDNs. This approach increases the skew bounds required by algorithms to balance the skew in the clock tree leading to reduced design complexity.

Theoretical analysis and simulation results using STMicroelectronics 90-nm technology at a clock frequency of 500 MHz show that this approach is feasible and effective where a skew of up to 6.2% of the clock period can be compensated for in the example used. In addition, constructing clock trees using the skew slack provided in the

proposed technique in a new modified Differed Merge Embedding (DME) algorithm on five benchmarks has shown that the proposed technique enables an average reduction of 11.5% in total wire length and 53.2% reduction in the number of wire elongations. As an example of illustrating the proposed methodology, we have used the Elmore delay model with a selected sinusoidally clocked flip-flop to verify the practicality of the proposed scheme. The method can generally be applied to resonant or square-wave clocking if different flip-flops of various speeds are used.

### **3.1 Lower Skew Bounds for the Proposed Technique**

Figures 3.1 and 3.2 illustrate the effect of the slow rising time of the sinusoidal clock signal generated in LC fully-resonant CDN on the speed of the flip-flops. The data to output delay ( $T_{DQ}$ ) versus data to clock delay ( $T_{DCLK}$ ) is plotted for the slow, standard, and fast flip-flops as compared to a square-wave clock signal with short rise time. As shown in these figures, the increase in the required time for the sinusoidal signal to reach from one threshold voltage level to the next is reflected in the increase in the difference between the operating speeds of the flip-flops. The opposite is true for the square-wave, since the short time required for the square signal to reach from one threshold voltage level to the next signal to reach from one threshold voltage level to the square signal to reach from one threshold voltage level to the square signal to reach from one threshold voltage to the next causes small differences in the flip-flops operating speed. Here  $V_{th}$  refers to the threshold voltage of the NMOS transistors triggered by the clock. The beginning of the evaluation phase in each flip-flop starts when the clock voltage exceeds the threshold voltage. The fast, standard, and slow flip-flops are defined in the following:



Figure 3.1: Effect of long rise time of the sinusoidal clock signal on the operating speed of the flip-flop



(a) Clock short rise time (b) Flip-flop *TDQ* vs. *TDCLK* 

Figure 3.2: Effect of short rise time of the square clock signal on the operating speed of the flip-flop

1- Standard flip-flop: the threshold voltage of the NMOS devices triggered by the clock signal in the standard flip-flop is the standard threshold voltage for the technology  $(V_{th \ standard}= 0.24 \text{ V}).$ 

- 2- Fast flip-flop: for the same setup time ( $T_{DCLK}$ ), a fast flip-flop will have a shorter  $T_{DQ}$  delay as compared to the standard and slow versions of the flip-flop. This is due to the fact that the threshold voltage of the NMOS devices triggered by the clock signal in the fast flip-flop is lower ( $V_{th\_low}$ = 0.18 V) than the threshold voltages for the same devices used in the standard and slow flip-flops.
- 3- Slow flip-flop: for the same setup time ( $T_{DCLK}$ ), a slow flip-flop will have a longer  $T_{DQ}$  delay as compared to the standard and fast versions of the flip-flop. This is because the the threshold voltage of the NMOS devices triggered by the clock signal in the slow flip-flop is higher ( $V_{th_high}$ = 0.32 V) than the threshold voltages for the same devices used in the standard and fast flip-flops.

Note that all the transistors in the three versions of the flip-flop have the same size and present equal load to the CDN. Only the threshold voltage of the NMOS transistors connected to the clock signal changes from one flip-flop version to the other.

Due to the fact that the speed of the flip-flops is highly affected by the threshold voltages of the NMOS devices connected to the clock signal and in order to make sure that two versions cannot have the same speed of operation, the following two constraints must be fulfilled:

$$V_{th\_standard\_max} < V_{th\_high\_min} \tag{3.1}$$

$$V_{th\_low\_max} < V_{th\_standard\_min} \tag{3.2}$$

where  $V_{th\_standard\_max}$  refers to the maximum value of the standard threshold voltage,  $V_{th\_high\_min}$  refers to the minimum value of the high threshold voltage,  $V_{th\_low\_max}$  refers to the maximum value of the low threshold voltage, and  $V_{th\_standard\_min}$  refers to the minimum value of the standard threshold voltage of the NMOS devices used in each version of the flip-flop due to process and environmental effects.

In the following timing definitions, flip-flop equations, and lower skew bounds for implementing this technique will be presented. Let  $T_{DCLK}$  be given by:

$$T_{DCLK} = T_D - T_{CLK} \tag{3.3}$$

The time difference between arrival of data and the edge of the clock due to the skew  $(T_{sk})$  would affect the  $T_{DCLK}$  time for each flip-flop. For a lagging clock signal by  $T_{sk}$  the  $\overline{T_{DCLK}}$  of the flip flop would be:

$$\overline{T_{DCLK}} = T_{DCLK} + T_{sk} = T_D - T_{CLK} + T_{sk} + \Delta skew$$
(3.4)

and for a leading clock signal by  $T_{sk}$ :

$$\overline{T_{DCLK}} = T_{DCLK} - T_{sk} = T_D - T_{CLK} - T_{sk} + \Delta skew$$
(3.5)

The bar on the  $T_{DCLK}$  shown in equations 3.4 and 3.5 is used to distinguish between the nominal value of  $T_{DCLK}$  with a zero skewed clock signal and its value with a lagging or a leading clock signal. It should be noted that the deviation from the skew value  $T_{sk}$ due to process and environmental variations ( $\Delta skew$ ) is assumed to be small compared to  $T_{sk}$  and is neglected to simplify the analysis.



(a)  $T_{DQ}$  vs.  $T_{DCLK}$ 



(b) Flip-flop output with respect to skewed clock Figure 3.3: Using one version of the flip-flop

Figure 3.3(a) and (b) shows an illustration of the effect of different  $T_{DCLK}$  on the difference between  $T_{DQ}$  delays for a single flip-flop. Figure 3.4(a) and (b) shows an illustration of the effect of different  $T_{DCLK}$  on the difference between  $T_{DQ}$  delays for flip-flops with different operating speeds. Note that Figures 3.3 and 3.4 have the same time scale. Assuming that the equations for the  $T_{DQ}$  lines for the slow, standard, and fast flip-flops are given by:

$$T_{DQ\_SLOW}(T_{DCLK}) = B_{SL} + m \times T_{DCLK}$$
(3.6)

$$T_{DQ\_STANDARD}(T_{DCLK}) = B_{ST} + m \times T_{DCLK}$$
(3.7)

$$T_{DQ\_FAST}(T_{DCLK}) = B_{FA} + m \times T_{DCLK}$$
(3.8)

where *m* is the slope of the  $T_{DQ}(T_{DCLK})$  line, and  $B_{SL}$ ,  $B_{ST}$ , and  $B_{FA}$  denote the intercepts on the  $T_{DQ}$  axis for the slow, standard, and fast flip-flops, respectively. It should be noted that all the lines in Equations 3.6 to 3.8 are assumed to be parallel and have the same slope of *m*. Also, note that the equation of  $T_{DQ\_STANDARD}$  is also the one of the line of Figure 3.3(a). Equations 3.6 to 3.8 present the relationship between  $T_{DQ}$  and  $T_{DCLK}$  in the linear operating region of the flip-flop.

As shown in Figure 3.3(a) and (b), the difference between the  $T_{DQ}$  delay of a single standard flip-flop due to a leading clock signal is referred to by  $\Delta 1$  where:

$$\Delta 1 = T_{DQ} - T_{DQ1}$$
  
=  $(B_{ST} + m \times T_{DCLK}) - (B_{ST} + m \times (T_{DCLK} - T_{sk}))$   
=  $m \times T_{sk}$  (3.9)



(a)  $T_{DQ}$  vs.  $T_{DCLK}$ 



(b) Flip-flop output with respect to skewed clock

Figure 3.4: Using three versions of the flip-flop

Figure 3.4(a), shows that the leading clock signal is fed to a slow flip-flop instead of the standard flip-flop. The difference between the  $T_{DQ}$  delays of the standard and slow flip-flops is given by  $\Delta S$ :

$$\Delta S = T_{DQ\_STANDARD}(T_{DCLK}) - T_{DQ\_SLOW}(T_{DCLK} - T_{sk})$$
  
=  $(B_{ST} + m \times T_{DCLK}) - (B_{SL} + m \times (T_{DCLK} - T_{sk}))$   
=  $B_{ST} - B_{SL} + m \times T_{sk}$  (3.10)

In order to reduce the effects of the different arrival times of the clock signal on the  $T_{DQ}$  delay of sequentially adjacent flip-flops by using flip-flops with different operating speeds as shown in Figure 3.4, we require that  $|\Delta S| < |\Delta 1|$ . This implies that:

$$T_{sk} > \frac{B_{SL} - B_{ST}}{2m} \tag{3.11}$$

Equation 3.11 gives a lower bound for the clock skew when  $|\Delta S| < |\Delta 1|$ .

Using the same approach for  $\Delta 2$  and  $\Delta F$  and in order for  $|\Delta F| < |\Delta 2|$ , the following condition has to be satisfied:

$$T_{sk} > \frac{B_{ST} - B_{FA}}{2m} \tag{3.12}$$

Equation 3.12 gives a lower bound for the clock skew when  $|\Delta F| < |\Delta 2|$ .

### 3.2 Skew Compensation in Short and Long Delay Paths

The main problem with a CDN usually occurs on the critical paths. Figure 3.5 illustrates how the effect of clock skew on the generated output of the flip-flops can be eliminated. In this work, the CDN is balanced not to achieve a zero skew between two



(b) Flip-flop output with respect to skewed clockFigure 3.5: Matched delay for short and long delay paths

sequentially adjacent flip-flops but rather to achieve a matched delay for short and long delay paths referred to as  $T_{delay\_E}$  and  $T_{delay\_L}$ , respectively. As illustrated in this figure when leading clock signals in short delay paths are skewed by  $T_{delay\_E}$  and fed to a slow flip-flop while lagging clock signals in long delay paths are skewed by  $T_{delay\_L}$  and fed to a fast flip-flop, respectively, the  $T_{DQ}$  delays of the flip-flops will balance out and the effect of the clock skew would be absorbed by the flip-flops.

In the following, we illustrate mathematically the proposed skew compensation technique and the skew slack that is introduced and can be used without affecting the minimum clock period. Figure 3.6 shows three sequentially adjacent flip-flops where in Figure 3.6(a), CLK2 and CLK3 feeding flip-flops R2 and R3 are lagging CLK1 feeding flip-flops R1. Figure 3.6(b) shows the opposite scenario where CLK2 and CLK3 lead CLK1. In order to find the matched delay for long delay paths with respect to the reference clock as shown in Figure 3.6(a) we require that T1=T2, where T1 is the minimum clock period for the data path starting from the input of R1 and ending at the input of R2 and T2 is the minimum clock period for the data path starting for the data path starting from the input of R2 and ending at the input of R3. The equation of the minimum clock period is given by [34]:

$$T = T_{CLKQ_Ri} + T_{CLi} + T_{SU_R(i+1)} - T_{sk_CLK(i+1)}$$
(3.13)

Where:

T = minimum clock period

 $T_{CLKQ Ri}$  = clock to output delay of the first flip-flop

 $T_{CLi}$  = the time necessary to propagate through the logic and interconnect

 $T_{SU R(i+1)}$  = set up time for the final register in the data path



(b) Negative skew

Figure 3.6: Sequentially adjacent flip-flops

 $T_{sk\_CLK(i+1)}$  = the skew affecting the clock signal feeding the last flip-flop in the path

Equation 3.13 illustrates that skew can improve circuit performance by allowing a reduction in the clock period. However, increasing skew increases circuit susceptibility to race conditions [34]. Applying equation 3.13 on the data paths shown in Figure 3.6(a) and noting that:

$$T_{CLKQ} = T_{DQ} - T_{DCLK} \tag{3.14}$$

$$T_{DCLK_{R2}} = T_{DCLK_{R1}} + T_{sk_{CLK2}}$$
(3.15)

Using Equations 3.7 and 3.14 in 3.13 we get:

$$T1 = T_{CLKQ_R1} + T_{CL1} + T_{SU_R2} - T_{sk_CLK2}$$
  
=  $T_{DQ_R1} - T_{DCLK_R1} + T_{CL1} + T_{SU_R2} - T_{sk_CLK2}$   
=  $(B_{R1} + m \times T_{DCLK_R1}) - T_{DCLK_R1} + T_{CL1} + T_{SU_R2} - T_{sk_CLK2}$   
=  $B_{R1} + T_{DCLK_R1}(m-1) + T_{CL1} + T_{SU_R2} - T_{sk_CLK2}$  (3.16)

and for T2 we get:

$$T2 = T_{CLKQ_R2} + T_{CL2} + T_{SU_R3}$$
  

$$= T_{DQ_R2} - T_{DCLK_R2} + T_{CL2} + T_{SU_R3}$$
  

$$= T_{DQ_R2} - (T_{DCLK_R1} + T_{sk_CLK2}) + T_{CL2} + T_{SU_R3}$$
  

$$= (B_{R2} + m \times (T_{DCLK_R1} + T_{sk_CLK2})) - (T_{DCLK_R1} + T_{sk_CLK2}) + T_{CL2} + T_{SU_R3}$$
  

$$= B_{R2} + T_{DCLK_R1}(m - 1) + T_{sk_CLK2}(m - 1) + T_{CL2} + T_{SU_R3}$$
(3.17)

Generally,  $T_{CL1}$  doesn't equal  $T_{CL2}$  which provides a data path slack that can be utilized for skew scheduling in the clock network design. However, in the following we will assume that  $T_{CL1} = T_{CL2}$  in order to highlight the methodology used for introducing a new type of slack. Using T1=T2, and assuming that  $T_{SU_R2} = T_{SU_R3}$ , we obtain:

$$T_{sk\_CLK2} = \frac{B_{R1} - B_{R2}}{m}$$
(3.18)

Equation 3.18 illustrates that when using the same type of flip-flops where  $B_{RI}=B_{R2}$ , the optimum skew that does not affect the clock period is  $T_{sk\_CLK}=0$ . In the same time, Equation 3.18 illustrates that when using flip-flops with different operating speeds, i.e.,  $B_{RI}>B_{R2}$ , a certain skew can be present in the clock path but has no effect on the minimum clock period since the different speed of operation of the flip-flops within the path absorbs this skew as illustrated in Figure 3.5(a) and (b). Thus the matched delay for a lagging clock signal in long delay paths with respect to the reference clock is noted as:

$$T_{delay\_L} = \frac{B_{ST} - B_{FA}}{m} \tag{3.19}$$

where  $B_{ST}$  and  $B_{FA}$  refer to the  $T_{DQ}$  intercepts for the standard and fast flip-flops shown in Equations 3.7 and 3.8, respectively.

Using the same approach, the skew for a leading clock signal in short delay paths with respect to the reference clock can be compensated as shown:

$$T_{delay\_E} = \frac{B_{SL} - B_{ST}}{m} \tag{3.20}$$

where  $B_{SL}$  and  $B_{ST}$  refer to the  $T_{DQ}$  intercepts for the slow and standard flip-flops shown in Equations 3.6 and 3.7, respectively.

Also, the same approach can be used to balance the  $T_{DQ}$  delay of the flip-flops for leading clock signals in short delay paths with respect to lagging clock signals in long delay paths by using slow and fast flip-flops in the data path, where the required matched delay in this case would be:

$$T_{delay\_E\_L} = T_{delay\_E} + T_{delay\_L}$$
(3.21)

Equations 3.19, 3.20, and 3.21 illustrate that this approach would reduce the need for designing a zero skew CDN to designing a bounded skew CDN where the skew bounds are restricted to the matched delay values. This increase in the skew bounds would lead to reduced complexity in designing CDNs.

It should be noted that due to the change in the rise time of the sinusoidal clock signal generated in LC resonant CDNs with respect to its frequency, the matched delay values presented in Equations 3.19, 3.20, and 3.21 are frequency dependent. As the clock frequency decreases, the rise time increases and the differences between the  $T_{DQ}$  intercepts of the different versions of the flip-flop increases causing an increase in the matched delay values.

# 3.3 New Modified Differed Merge Embedding (DME) Algorithm<sup>1</sup>

In the previous sections we introduced a new skew compensation technique using flipflops with different operating speeds. The new technique provides timing slacks that could be used in a clock distribution algorithm in order to reduce the total wire length, routing complexity, and power. Traditionally to manage the clock skew in a clock network, clock distribution algorithms attempt to balance the delay from the source to all sinks. This is accomplished mainly through wire length adjustment, wire width sizing, and buffer insertion. However, buffer insertion is not considered in LC fully-resonant CDNs because inserting a buffer in the clock path eliminates the energy recovery property. The clock distribution algorithm could take advantage of the new proposed skew compensation technique along with other traditional balancing approaches to get

<sup>&</sup>lt;sup>1</sup> The DME implementation was done by Mr. Ali Mohammadi Farhangi.

the desired skew in a clock network with less total wire length. Consequently clock network power consumption will be decreased. Additional benefits of the proposed compensation technique are the reduction in the number of wire elongations and the added flexibility in the distribution network layout.

The new compensation technique was incorporated into a Zero Skew Clock Tree router (ZST). A ZST is able to construct a clock tree that delivers the clock edges to all sinks with equal delay (nominal zero skew). The Differed Merge Embedding algorithm (DME) [35], was modified to accommodate the proposed skew compensation technique.

In order to use the new technique in any ZST, two major issues should be considered:

- 1- Selecting which type of flip-flop to be used in every single location.
- 2- Taking advantage of timing slacks provided by the new technique during bottom up tree construction in order to reduce total clock tree wire length.

Usually, a typical clock tree router is not aware of the underlying data-path and data flow dependency between the clock sinks. This assumption indicates that, at first there is no preference among the clock sinks to guide the algorithm in order to select between different types of flip-flops. In the proposed approach, the flip-flop have three operating speeds: standard, slow and fast. Initially all sinks are chosen from the standard type. The best choice for different types of flip-flop at the sinks will be identified while the clock tree is being constructed. This new algorithm is developed based on the observation that a zero skew merging segment obtained by the traditional DME can be shifted towards one of its children by changing the flip-flop type in its left and right sub-trees. The tuning of a merging segment by changing the flip-flop type is illustrated in Figure. 3.7.



Figure 3.7: Tuning a merging segment by changing the flip-flop type in left or right subtree

In Figure. 3.7, U and V are two sub-trees where their roots are embedded at location u and v, respectively. U and V are to be merged such that the new sub-tree W has zero skew and minimum wire length. The rectangle with u and v as opposite vertices encloses all minimum distance, Manhattan connections between u and v. ms(w) is the locus of the points (merging segment) that can merge two points, u and v with minimum wire length and zero skew. In Figure 3.7,  $ms_1(w)$  is the merging segment that merged v and u, where sub-trees V and U both contain standard flip-flops. As illustrated in the figure, by changing the flip-flops operating speed in either U or V, the merging segment shifts either towards u or v.

For a pair of nodes (u,v), the algorithm considers up to seven different combinations of flip-flop operating speeds in u and v;  $(u_{\text{standard}}, v_{\text{standard}})$ ,  $(u_{\text{standard}}, v_{\text{fast}})$ ,  $(u_{\text{standard}}, v_{\text{slow}})$ ,  $(u_{\text{fast}}, v_{\text{standard}})$ ,  $(u_{\text{fast}}, v_{\text{slow}})$ ,  $(u_{\text{slow}}, v_{\text{standard}})$ , and  $(u_{\text{slow}}, v_{\text{fast}})$ . There are two redundant combinations,  $(u_{\text{slow}}, v_{\text{slow}})$  and  $(u_{\text{fast}}, v_{\text{fast}})$ . Both of these combinations result in the same merging segment as in  $(u_{\text{standard}}, v_{\text{standard}})$ . The algorithm does not consider the two redundant cases to compute the merging segment.

During the bottom-up phase, the algorithm computes the locus of the merging points (merging segment) where two sub-trees can join such that the new sub-tree has a zero skew. The new merging segment is computed for different combination of flip-flops in both sub-trees. Unlike the traditional DME, in the modified DME algorithm there is a set of merging segments corresponding to each node. Each merging segment is computed similarly to the DME, but the algorithm considers the proper matched delay for either left or right sub-tree. The three types of flip-flops enable the algorithm to use the matched delay values in order to compensate for the skew.

A greedy strategy was used to choose the types of flip-flops. This means that if the types of flip-flops in a set of leaves in a sub-tree have already been determined, the algorithm will not change it in later stage. For example in Figure 3.7, if the algorithm specifies the slow flip-flop for the leaves in the sub-tree rooted at v and the fast flip-flops for the leaves in the sub-tree rooted at v and the fast flip-flops of flip-flops in the sub-tree rooted at u, this implies that the decision for the types of flip-flops in the sub-tree w is already made. Indeed to achieve more optimum results, one can defer the decision making to the upper levels, but this will increase the timing complexity of the algorithm.

Let  $s_1$ ,  $s_2$ ,  $s_3$  and  $s_4$  be four leaves or internal nodes in a clock tree. The nodes are to be merged corresponding to the topology shown in Figure 3.8(a), where  $s_1$  and  $s_2$  are children of node *v* and node *u* is the parent of  $s_3$  and  $s_4$ . In the upper level of the tree, *u* 



(a) Clock tree topology

(b) Determining flip-flop type based on minimum wire length merging



and v are to be merged into node w. Assume the flip-flop types in the sub-trees rooted at  $s_1$ ,  $s_2$ ,  $s_3$  and  $s_4$  have not been specified by the algorithm. The algorithm enumerates all seven different choices for the flips-flops in  $(s_1, s_2)$  and  $(s_3, s_4)$ . The merging segments for node v and u are calculated for all combinations. ms(u) and ms(v) refer to the set of the merging segments for all different combinations for nodes u and v, respectively. Two newly determined sub-trees rooted at v and u need to be merged into w. To compute the merging point for node w, the algorithm selects one merging segment from ms(u) and ms(v) which results in minimum wire length.

To reduce total wire length, a sub-tree needs to be merged to another sub-tree that is not only nearby but also minimizes wire elongation. Therefore a merging cost function to include distance and wire elongation in a unified form is proposed. This merging cost is

| <b>Procedure</b> <i>Modified_BottomUpTree_Contruction</i> ( <i>A</i> , <i>B</i> )       |  |  |  |  |  |
|-----------------------------------------------------------------------------------------|--|--|--|--|--|
| Input: Two Sets of Merging Segments A and B to be merged                                |  |  |  |  |  |
| Output : A Set of merging segment V                                                     |  |  |  |  |  |
| 1- a ,b $\leftarrow$ Greedy_merging_segment_selection (A , B)                           |  |  |  |  |  |
| 2- If all leaves in the subtree rooted at a are normal FFs then                         |  |  |  |  |  |
| Compute $t_a^{fast}$ , $t_a^{slow}$ , $t_a^{normal}$                                    |  |  |  |  |  |
| 3- If all leaves in the subtree rooted at b are normal FFs then                         |  |  |  |  |  |
| Compute $t_b^{fast}$ , $t_b^{slow}$ , $t_b^{normal}$                                    |  |  |  |  |  |
| 4- For all FFs operating speed in subtree a $(t_a^{fast}, t_a^{slow}, t_a^{normal})$ do |  |  |  |  |  |
| For all FFs operating speed in subtree b $(t_b^{fast}, t_b^{slow}, t_b^{normal})$ do    |  |  |  |  |  |
| $v_i \leftarrow DME$ -Zero skew merging of a and b                                      |  |  |  |  |  |
| Insert v <sub>i</sub> into set V                                                        |  |  |  |  |  |
| 5- Return V                                                                             |  |  |  |  |  |

(a) Modified merging segment construction using the new type of slack provided by our new compensation technique

| Procedure Greedy_merging_segmanet_selection (A, B)                                                      |  |  |  |  |
|---------------------------------------------------------------------------------------------------------|--|--|--|--|
| Input: Two Sets of Merging Segments A and b B be merged                                                 |  |  |  |  |
| Output : One merging segment corresponding to A and one merging segment                                 |  |  |  |  |
| corresponding to B                                                                                      |  |  |  |  |
| For each merging segment $a_i \in A$ and $b_j \in B$ do                                                 |  |  |  |  |
| Wiring( $a_i, b_j$ ) $\leftarrow$ Find distance of $a_i$ and $b_j$ + wire snaking needed to merge $a_i$ |  |  |  |  |
| and b <sub>j</sub>                                                                                      |  |  |  |  |
| Select ai and bj such that Wiring( ai, bj) is minimal                                                   |  |  |  |  |
| Return a <sub>i</sub> and bj                                                                            |  |  |  |  |

(b) Greedy merging selecting procedure

Figure 3.9: Pseudo Code for modified DME algorithm

the same as the Manhattan distance between the roots of two sub-trees if there was no elongation; otherwise the extra wire due to wire snaking is included in the merging cost. The algorithm uses the unified wire length cost function to determine which merging segments should be selected from each one of its children. The best possible choice indicates the types of the flip-flops in  $s_1$ ,  $s_2$ ,  $s_3$  and  $s_4$  as shown in Figure 3.8(b). The pseudo code for the modified merging segment construction using the new type of slack provided by the proposed technique and the greedy merging selection procedure are presented in Figure 3.9(a) and (b), respectively.

It should be noted that the Elmore delay used to model the delay in square-wave based CDNs algorithms is valid for signals other than step signals and that the actual delay approaches the Elmore Delay as the input signal rise time increases [36]. This illustrates that the algorithms used to construct square-wave based CDNs can be extended and applied to construct LC CDNs with a sinusoidal clock signal.

# **3.4 Simulation Results**

### 3.4.1 Matched Delay Values for the SCCFF

The Single-ended Conditional Capturing Flip-Flop (SCCFF) introduced previously in Chapter 2 in Figure 2.14 was modified in order to design a slow, standard, and fast versions of it. In this flip-flop, transistors MN1 and MN2 were replaced with high threshold voltage devices (HVT), standard threshold voltage devices (SVT), and low threshold voltage devices (LVT) in order to generate slow, standard, and fast flip-flops, respectively.

The three versions of the SCCFF have been simulated in STMicroelectronics 90-nm technology with a sinusoidal clock frequency of 500 MHz. These three flip-flops have different operating speeds, i.e., different  $T_{DQ}$  for a given  $T_{DCLK}$ . The  $T_{DQ}$  delays for the fast, standard, and slow flip-flops have been plotted for different setup times ( $T_{DCLK}$ ) and are shown in Figure 3.10. Note that in this figure a line equation of the  $T_{DQ}$  delay is stated



Figure 3.10:  $T_{DQ}$  vs.  $T_{setup}$  ( $T_{DCLK}$ )

in the legend and plotted against the actual values obtained from simulation. The line equation was obtained by simply taking two points along the  $T_{DQ}$  line and then computing the slope *m* and the  $T_{DQ}$  intersection *B*. From this figure, the effect of the relatively slow rise time of the sinusoidal clock signal on the  $T_{DQ}$  delay for different versions of the flip-flop appears in the gap between the  $T_{DQ}$  lines or the difference between the points of intersection with the  $T_{DQ}$  axis.

The optimum setup time for the flip-flop is the  $T_{DCLK}$  delay at which the minimum  $T_{DQ}$  delay of the flip-flop occurs.  $T_{DCLK}$  delays that are less than the optimum setup time would not lead to any further reduction in the  $T_{DQ}$  delay of the flip-flop thus causing

Table 3.1 Matched delay (ps) for a clock period of 2ns

| Flip-Flop | $T_{\text{delay}\_E}$ | $T_{delay\_L}$ | $T_{delay\_E\_L}$ |
|-----------|-----------------------|----------------|-------------------|
| SCCER     | 45                    | 77             | 123               |



Figure 3.11: Slow, standard and fast flip-flop response with respect to a square clock signal



Figure 3.12: Slow, standard and fast flip-flop response with respect to a sinusoidal clock signal

a deviation from the straight lines obtained from the equations in Figure 3.10. The flipflops are considered to be operating in the linear region, i.e., Tsetup  $\geq$  180 ps.

From the obtained line equations in Figure 3.10, equations 3.19, 3.20, and 3.21 were used to calculate the matched delay values for skew compensation which are shown in Table 3.1. The values of  $T_{delay\_E}$ ,  $T_{delay\_L}$ , and  $T_{delay\_E\_L}$  presented in this table illustrate that a skew of 2.3%, 3.85%, and 6.2% of the clock period (2ns) can be compensated for, respectively. Note that in obtaining these values, the slope was considered to be equal to "1" which is a close approximation. Balancing a CDN using the values in Table 3.1 should be made with careful consideration. That is for the  $T_{delay\_E}$  path, a slow and a standard flip-flop should be inserted, for the  $T_{delay\_L}$  path, a standard and a fast flip-flop should be inserted, and for the  $T_{delay\_E\_L}$  a slow and a fast flip-flop should be inserted in the data path to eliminate any skew effects on the clock period.

Figure 3.11 presents the simulation results showing the difference in the flip-flop output  $T_{DQ}$  for the standard, slow, and fast versions with response to a square-wave clock signal with a frequency of 500 MHz and a rise time of 10% of the clock period. Figure 3.12 shows the same response but with 500 MHz sinusoidal clock signal. The large difference in the response time ( $T_{DQ}$ ) of the three flip-flop versions with a sinusoidal clock wave as compared to their response with a square-wave is clearly shown in these figures.

# **3.4.2** Comparing Data, Clock, and Flip-Flop Power Consumption for the Fast, Standard, and Slow Versions of the SCCFF

In this section, the difference in the data, clock, and flip-flop power consumption between the three versions of the SCCFF is investigated to show that there is no significant power increase in adopting this technique.

Power consumption of a circuit depends strongly on its structure and the statistics of the applied data. Thus power measurements should be conducted for the range of different data patterns comprising worst and best cases [37]. In general, a pseudorandom sequence with equal probability of all transitions (with data activity rate  $\alpha$ =0.5) shown in Figure 3.13 will result in the average internal power consumption under typical operation [37].



Figure 3.13: Pseudorandom Sequence

The circuit used to measure the data, clock, and flip-flop power based on square-wave clocking signal is shown in Figure 3.14. The role of this circuit is to provide measurement of power dissipated on switching of the clock and data inputs, the realistic data and clock signals, and the fan-out signal degradation from the previous stage to the succeeding one [37]. The local clock power in this circuit is calculated as the difference in power dissipation of the gray inverter when loaded with the flip-flop and when unloaded. The local data power dissipation is calculated as the difference in power

dissipation of the black inverter when loaded with the flip-flop and C1 and when loaded only with C1.

All power measurements were conducted for the 16 clock cycle data sequence presented in Figure 3.13 for the SCCFF operating at 500 MHz. In the circuit shown in Figure 3.14, the load capacitance for the data buffer (the black inverter) as well as the load capacitance at both outputs of the flip-flop was chosen to be 30 fF. The black inverter, the gray inverter, and each flip-flop were identified as sub-circuits and a .MEASURE average power statement in HSPICE was used to measure the power dissipation of interest in each of these sub-circuits.

Table 3.2 presents the power consumption in  $\mu$ W of data, clock, and flip-flop for the standard version in the SCCFF. The percentage increase in the power consumption for data, clock, and flip-flop for the slow and fast flip-flops are also presented in the table.

As shown in Table 3.2, the percentage increase in data power for the fast and slow flip-flop as compared to the standard one is less than 1%. As for the clock power, we notice that there is an increase in clock power for the fast flip-flop and a decrease in clock power for the slow flip-flop as compared to the standard flip-flop. However, the increase in clock power is less than 5%. As for the flip-flop power, we notice an insignificant increase in the power of the fast flip-flop which is less than 1%.

The percentage increase in power consumption shown in Table 3.2 illustrates that using this new approach for skew compensation would not affect the dynamic power consumption of the system since the standard, fast, and slow versions of the flip-flop would be placed with somewhat equal probabilities in the CDN and thus the dynamic

54



Figure 3.14: Power measurement circuit

|       |          | Data   | Clock  | FF     |
|-------|----------|--------|--------|--------|
| SCCFF | Standard | 0.388  | 5.034  | 40.482 |
|       | Fast     | -0.72% | 4.45%  | 0.00%  |
|       | Slow     | 0.80%  | -1.51% | 0.63%  |

Table 3.2 Power consumption (µW)

power increase introduced by one version of the flip-flop will be eliminated by the power decrease introduced by the other. An approximation of the leakage power neglecting the stacking effect [38] and the fact that the transistors are switched on and off at every clock cycle, shows that lowering the threshold voltage in the fast flip-flop case would result in 18% increase in leakage power whereas increasing the threshold voltage in the slow flip-flop would result in 20% decrease in leakage power.

# **3.4.3 Effects of Process, Supply Voltage, and Temperature Variation on Flip-Flop Speed**

Since manipulating the flip-flop speed in this chapter was done by mainly changing the threshold voltages of the NMOS devices being fed by the clock signal, the effect of variations in process, supply voltage, and temperature on the flip-flop speed was investigated to make sure that the generated slack remains valid and that the mean of the  $T_{DQ}$  delays are separated and within tolerable range of the design slack. The three versions of the SCCFF were simulated using 200 runs of Monte-Carlo simulation under mismatch only (intradie variation) with a square clock and a sinusoidal clock signal feeding the three versions of the flip-flop at a frequency of 500 MHz. The  $T_{DQ}$  delays for each version of the flip-flop under both clock signals were obtained for the 200 runs.

Figure 3.15(a) and (b) presents the  $T_{DQ}$  histograms for the three versions of the SCCFF for square and sinusoidal clock signals, respectively.

Figure 3.15 show that although the spread of the  $T_{DQ}$  delay for the fast, standard, and slow flip-flops under square wave clocking is less than the spread of the  $T_{DQ}$  delay for the same flip-flops under sinusoidal clock, the  $T_{DQ}$  delay of each version of the SCCFF under sinusoidal clock does not overlap with each other and that their delays are distinct. This illustrates that though the slow rise time of the sinusoidal clock can lead to a larger spread of the  $T_{DQ}$  delays under process variation, it would also increase the gap between the  $T_{DQ}$ delays for each version of the flip-flop thus eliminates the overlap between them. Figure 3.15(b) also shows a mismatch induced skew of approximately 20 ps from the mean of the  $T_{DQ}$  delay for each flip-flop. This would lead to a worst case mismatch induced skew for this data set of around 40 ps if the  $T_{DQ}$  delay of the flip-flops deviate to the opposite extremes.







(b)  $T_{DQ}$  with a sinusoidal clock

Figure 3.15: Monte-Carlo simulation

| Supply  | Percentage change in |                |                   |  |
|---------|----------------------|----------------|-------------------|--|
| voltage | $T_{delay\_E}$       | $T_{delay\_L}$ | $T_{delay\_E\_L}$ |  |
| VDD+5%  | 4%                   | -4%            | 0%                |  |
| VDD-5%  | -4%                  | 2%             | -1%               |  |

Table 3.3Supply voltage effect on matched delay values



Figure 3.16: Temperature effect on the TDQ delay

Table 3.3 illustrates the effect of static voltage increase or decrease in the power supply on the matched delay values obtained from the  $T_{DQ}$  of the flip-flops. It shows that a 5% increase in the supply voltage feeding the flip-flops would cause less than 5% change in the matched delay values. This observation also holds with a 5% decrease in the supply voltage.

To investigate the temperature effect on the  $T_{DQ}$  delays of the three versions of the flip-flop, the spatial temperature gradient used in [39] was adopted here at which the

temperature changes from 25°C to 125°C in steps of 25°C. Figure 3.16 shows the temperature effect on flip-flop delays. As illustrated in the figure, the  $T_{DQ}$  delay of the flip-flops decreases as temperature increase. This is mainly due to the decrease in the threshold voltage of the transistors as the temperature increase [40]. An increase in temperature from 25°C to 125°C causes a reduction in flip-flop delays by approximately 5%. It is also observed in Figure 3.16 that although the  $T_{DQ}$  delays of the flip-flop decrease with temperature, the  $T_{DQ}$  lines of the three versions of the flip-flop remain parallel to one another. This means that the matched delay values between sequentially adjacent flip-flops would suffer minor changes assuming that these flip-flops are placed within close proximity to one another and experience the same variation in temperature.

#### 3.4.4 Clock Tree Construction Using the New Compensation Technique

The traditional DME and the new modified DME algorithms were implemented in C++ to construct the clock tree. The initial clock tree topology in both cases was obtained by the Method of Means and Medians (MMM) [41]. Both algorithms were run on a set of benchmarks (r1-r5) that contain from 267 up to 3,101 clock sinks. The clock sink distribution in the benchmarks is the same as the one in [42].

By applying the proposed skew compensation technique, the total clock tree wire length has been reduced by an average of 11.5%. Reducing the total wire length leads to a reduction in the routing complexity as well as a reduction in the clock tree power consumption which is a major concern in CDN design. One of the drawbacks associated with the DME based clock routers is the fact that they introduce many wire elongations to

| #          |       | MMM-DME      |                   | New Modified DME |                   | Improvement |                          |
|------------|-------|--------------|-------------------|------------------|-------------------|-------------|--------------------------|
| Benchmarks | Sinks | Cost<br>(µm) | # wire elongation | Cost<br>(µm)     | # wire elongation | Cost<br>(%) | # wire<br>elongation (%) |
| r1         | 267   | 1738292      | 10                | 1568484          | 7                 | 9.7         | 30                       |
| r2         | 598   | 3635828      | 38                | 3169582          | 18                | 12.8        | 52                       |
| r3         | 862   | 4716032      | 71                | 4095191          | 24                | 13.1        | 66                       |
| r4         | 1903  | 8906613      | 136               | 7995124          | 59                | 10.2        | 56                       |
| r5         | 3101  | 13123125     | 324               | 11581785         | 123               | 11.7        | 62                       |

Table 3.4 Comparison of MMM-DME and the new modified DME using the proposed skew compensation technique

achieve a zero skew clock network. The elongation problem is exacerbated usually when the clock routers only consider the spatial proximity to find the best matching pairs.

The results obtained in Table 3.4 show a reduction of an average of 53.2% in the number of wire elongation. Wire elongation is a real burden in the detailed phase of routing, because they introduce unnecessary bends and vias.

The new algorithm is only a simple greedy heuristic that was developed to verify the advantages of using the new skew compensation technique. Indeed the algorithm is not guaranteed to get the best and optimal results. Nevertheless the results are encouraging.

#### **3.5 Conclusion**

In this chapter a new approach for skew compensation in LC resonant CDNs is proposed. The method uses three different operating speeds of a flip-flop to achieve its goal. Three types of flip-flops: "fast", "standard", and "slow" were simulated using devices with low, standard, and high threshold voltages, respectively, readily available as part of the technology. Distributing the flip-flops according to their delay requirements reduces the effect of the clock skew on the outputs of sequentially adjacent flip-flops. The proposed approach adds a certain burden to the design of the CDN with respect to determining the appropriate placement of each type of flip-flop in the circuit. In addition, the dependency of the matched delay values on clock frequency should be taken into account. On the other hand, this approach also increases the skew bounds required by algorithms to balance the skew in CDNs leading to reduced design complexity and enhanced performance. Constructing five benchmark CDNs using a new modified DME algorithm tailored to accommodate the matched delay values introduced by the new technique has shown that the new approach would reduce total wire length by an average of 11.5% as well as achieve an average reduction of 53% in the number of wire elongations for the given example.  $^2$ 

It should be noted that the adopted approach to manipulate the operating speed of the flip-flops in this chapter is not the only approach. Changing the driving capabilities of different transistors in the input and output stages of the flip-flop can also serve as a means of manipulating the flip-flop speed. In addition, different flip-flop types with different speed can also be used. However, changing the flip-flop speed by manipulating the threshold voltages of the transistors fed by the clock signal has an advantage of enabling the use of regular flip-flop blocks of the same area and transistor size at different locations on the chip. Increasing the difference in the operating speeds of the three versions of the flip-flop by changing the driving capabilities can lead to maximizing the skew that can be absorbed by the flip-flops as well as increasing the feasibility of expanding this approach to square-wave based CDNs.

<sup>&</sup>lt;sup>2</sup> The DME implementation was done by Mr. Ali Mohammadi Farhangi

## Chapter 4 Dual-Edge Triggered Sense Amplifier Flip-Flop for LC Resonant Clock Distribution Networks

In the previous chapter we took advantage of the long rise time of the sinusoidal clock signal in LC resonant CDNs and the availability of different threshold voltage levels in the technology to generate a new slack in the skew. In this chapter, we propose a Dual-Edge Sense Amplifier Flip-Flop (DE-SAFF) for LC resonant CDNs. The clocking scheme used to enable dual-edge triggering in the proposed SAFF reduces short circuit power by allowing the precharging transistors to be switched on only for a portion of the clock period. The extracted circuit layout of the proposed DE-SAFF has been simulated in STMicroelectronics 90-nm technology with a sinusoidal clock signal at a frequency of 500MHz. Simulation results show correct functionality of the flip-flop under process, voltage, and temperature (PVT) variations. Two low-power clocking techniques: the dual-edge triggering method and the emerging LC resonant (sinusoidal) clocking technique have been combined to enable further power reduction in the CDN. Modeling the resonant clock distribution system with the proposed flip-flop illustrates that dualedge triggering can achieve up to 58% reduction in the power consumption of LC resonant clock networks.

### 4.1 Introduction

The Sense Amplifier Flip-Flop (SAFF) has been proposed in [5] as an energy recovery

flip-flop that operates with a single phase sinusoidal clock and consumes less power, delay, and area compared to the pass-gate energy recovery flip-flop introduced previously. The pass-gate energy recovery flip-flop requires four phase sinusoidal clocks which increases the complexity of clock signal generation as well as routing overhead. In addition, it suffers from long delay that uses a large portion of total cycle time thus significantly reducing the allowable time for combinational logic [5]. In [43], the Conditional Capturing Dual-edge Sense Amplifier Flip-Flop (CD-SAFF) and the Adaptive Clocking Dual-edge Sense Amplifier Flip-Flop (AC-SAFF) are presented. In these flip-flops, the PMOS transistors used to precharge nodes SET and RESET are always switched on since their gates are connected to ground causing short circuit power dissipation during evaluation.

The main contribution here is presenting a DE-SAFF for LC fully-resonant CDNs using a modified clocking scheme that can be extended to enable dual-edge clocking in any dynamic CMOS logic circuit. In this scheme, the PMOS transistors used for precharging nodes SET/RESET are only switched on for a portion of the clock cycle in order to reduce short circuit power.

#### 4.2 Dual-Edge Triggered Dynamic Logic

The clocking scheme used to enable dual-edge triggering in the Low-Swing Clock Double-Edge Flip-Flop (LSDFF) operating with a square-wave clock introduced in [44] was further modified to enable dual-edge triggering in CMOS dynamic logic circuits with precharge and evaluation phases.



Figure 4.1: Circuits used to enable precharging and evaluation at both clock transitions



Figure 4.2: Clocking scheme used to enable dual-edge triggering in CMOS dynamic logic

The main circuits used in the pull-up and pull- down networks to enable precharge and evaluation at both edges of the clock are shown in Figure 4.1(a) and (b), respectively. The approach used to enable dual-edge triggering in CMOS dynamic logic circuits is presented in Figure 4.2. In this figure, an inverter chain is used to generate the clock signals CLK2, CLK3, CLK4, and CLK5.

In Figure 4.2, TP1/TP2 and TE2/TE1 intervals indicate the generated time intervals for precharging and evaluation, respectively, when the clock is low/high. As shown in Figure 4.1(a) and 4.2, the first precharge interval TP1 is defined and bounded by CLK5 and CLK1. The falling edge of CLK5 determines the starting point of the precharging interval while the rising edge of CLK1 defines the end of precharging. These two signals are fed to the series PMOS transistors shown in Figure 4.1(a); namely MP1 and MP2. The second precharging interval TP2 is bounded by CLK4 and CLK2 as is shown in Figures 4.1(a) and 4.2. These signals are fed to the series transistors MP3 and MP4 shown in Figure 4.1(a). The first evaluation interval TE1 is bounded by CLK1 and CLK4. The rising edge of CLK1 determines the start of evaluation while the falling edge of CLK4 defines its end. These two signals are fed to the series NMOS transistors shown in Figure 4.1(b), MN1 and MN2. The second evaluation interval TE2 is bounded by CLK2 and CLK5 which are fed to the series transistors MN3 and MN4 shown in Figure 4.1(b).

Figure 4.3(a) and (b) present single- and dual-edge triggered dynamic logic circuits, respectively. In the single-edge triggered dynamic logic circuit shown in Figure 4.3(a), the output is precharged to  $V_{DD}$  when the clock is low and either remains at  $V_{DD}$  or is



(b) Dual-edge

Figure 4.3: Single and dual-edge triggered dynamic CMOS logic

discharged to ground depending on the state of the inputs feeding the Pull-Down Network (PDN) during the evaluation phase. The same is true for the circuit in Figure 4.3(b). However, in this circuit, two precharge and two evaluation intervals occur during one clock cycle. Though Figure 4.3 only shows dynamic logic with PDN, the proposed clocking scheme is equally applicable to CMOS dynamic logic circuits with Pull-Up-Networks (PUNs).

### 4.3 Dual-Edge Sense Amplifier Flip-Flop (DE-SAFF)

In this section, a brief description of the SAFF which is considered a representative high performance flip-flop is given [5], [37]. The schematic of the single-edge Sense



Figure 4.4: Single-Edge Sense Amplifier Flip-Flop (SE-SAFF)

Amplified Flip-Flip (SE-SAFF) is presented in Figure 4.4. This flip-flop has precharge and evaluation phases of operation. Evaluation occurs when the clock voltage exceeds the threshold voltage of the clock transistor (MN1). The difference between the differential data inputs (*D* and *DB*) is amplified during the evaluation phase and either *SET* or *RESET* is switched to low and is captured by the *SET* and *RESET* latch. The *SET* and *RESET* nodes are precharged high when the clock voltage falls below  $V_{DD}$ - $|V_{tp}|$ , where  $V_{tp}$  is the threshold voltage of the precharging transistors (MP1 and MP2). An overlap can occur between evaluation and precharge phases caused by the slow rising and falling transitions of the sinusoidal clock. This overlapping results in short-circuit current.



Figure 4.5: Dual-Edge Sense Amplifier Flip-Flop (DE-SAFF)

In order to reduce short-circuit current, we require that  $V_{tn} > V_{DD} - |V_{tp}|$ . To minimize the right hand side, the magnitude of the threshold voltages of the precharging transistors should be increased [5]. In our implementation, we improve this design by using high threshold voltage devices (HVT) for MP1 and MP2 available in STMicroelectronics 90-nm technology. One could also increase  $V_{tn}$ . However, this would decrease the speed of

operation of the flip-flop.

The modified dual-edge clocking scheme is applied to the SAFF with the same operating principles as that of dynamic CMOS logic circuits presented in Section 4.2. Figure 4.5 presents the dual-edge triggered version of the SAFF. The highlighted transistors in the figure are the extra transistors added to the single-edge version of the flip-flop to enable dual-edge triggering. Note that these transistors are the same as the transistors used in Figure 4.1 to enable dual-edge triggering in dynamic CMOS logic circuits. In order to reduce short circuit power, MP1 to MP8 were implemented with high threshold voltage devices.

#### 4.4 Timing Characterization of Dual-Edge Triggering

From Figure 4.2, the first evaluation interval TE1 is equal to the delay between the rising edge of CLK1 and the falling edge of CLK4 which equals the sum of the delays of the first three inverters in the inverter chain. The second precharging interval TP2 is equal to half the clock period minus the delay between the rising edge of CLK2 and CLK4 which equals the delay of the two inverters in the middle of the inverter chain. The second evaluation interval TE2 is equal to the delay between the falling edge of CLK5 and the rising edge of CLK2 which equals the delay of the last three inverters in the inverter chain. Finally, the first precharging interval TP1 is equal to half the clock period minus the delay between the rising edge of CLK1 and CLK5 which is equal to the delay of the four inverters in the inverter chain. Using these observations, the following equations for the generated precharge and evaluation time windows, TE and TP using the dual-edge triggering scheme are obtained:

$$TE1 = t_{invA} + t_{invB} + t_{invC}$$

$$\tag{4.1}$$

$$TP2 = \frac{T}{2} - (t_{invB} + t_{invC})$$
(4.2)

$$TE2 = t_{invB} + t_{invC} + t_{invD} \tag{4.3}$$

$$TP1 = \frac{T}{2} - (t_{invA} + t_{invB} + t_{invC} + t_{invD})$$
(4.4)

where  $t_{invA}$ ,  $t_{invB}$ ,  $t_{invC}$ , and  $t_{invD}$ , are the delays of the first, second, third, and fourth inverters in the inverter chain, respectively, and T is the clock period.

As illustrated by these equations, for the same inverter delays, the two evaluation intervals TE1 and TE2 are equal to one another. However, the first precharging interval TP1 is shorter than the second precharging interval TP2 by two inverter delays.

Figure 4.6 presents the timing diagram of two sequentially adjacent flip-flops under dual-edge triggering. The rising clock edge of CLK1 which defines the start of the first evaluation interval TE1 will be referred to as the positive clock edge and the rising edge of CLK2 which defines the start of the second evaluation interval TE2 will be referred to as the negative clock edge as illustrated in Figure 4.6. Neglecting clock uncertainties [45], the following relationships can be written:

$$T = TP1 + TE1 + TP2 + TE2 (4.5)$$

$$T_{CLKQ_{+VE}} + T_{CL} + T_{DCLK_{-VE}} \le TE1 + TP2$$

$$\tag{4.6}$$

$$T_{CLKQ_{-VE}} + T_{CL} + T_{DCLK_{+VE}} \le TE2 + TP1 \tag{4.7}$$

where  $T_{CL}$  is the combinational logic delay,  $T_{CLKQ_{+VE}}$ ,  $T_{DCLK_{+VE}}$ , are the flip-flop clock to output delay and data to clock delay ( $T_{setup}$ ) at positive clock edge, respectively, and  $T_{CLKQ_{-VE}}$ ,  $T_{DCLK_{-VE}}$  are the clock to output delay and data to clock delay at the negative clock edge, respectively.



Figure 4.6: Dual-edge triggering timing diagram

### 4.5 Simulation Results

The sine-wave single- and dual-edge triggered flip-flops were designed using STMicroelectronics 90-nm process technology with a fan out of four (FO4) loading and a supply voltage of 1 V at a resonant clock frequency of 1 GHz and 500 MHz, respectively. At first, the transistor widths of the SE-SAFF were chosen to ensure correct operation at the specified frequency. Then the widths of the transistors in the DE-SAFF were chosen to be exactly the same as those in the single-edge flip-flop except in paths where two transistors are added in series. In this case, the transistor widths are doubled in order to maintain the same driving capability. The transistor sizes of the single- and dual-edge flip-flops are shown in Figures 4.4 and 4.5, respectively.

# 4.5.1 Dual-Edge Flip-Flop Response at Positive and Negative Clock Edges, Schematic vs. Post Layout Simulation

In this section we will illustrate the differences in the results for the  $T_{DQ}$  response of the dual-edge triggered flip-flop at the positive and negative clock edges obtained from schematic and post-layout simulations.

The resonant sinusoidal clock signal becomes a square-wave clock when inverted using an inverter. The effect of the long rise time of the positive edge of the sinusoidal clock signal CLK1 which defines the start of the first evaluation interval TE1 compared to the effect of the short rise time of the inverted square signal CLK2 which defines the start of the second evaluation interval TE2 on the  $T_{DQ}$  delay versus  $T_{DCLK}$  delay ( $T_{setup}$ ) is investigated. As was illustrated in Chapter 2, the rise time of the sinusoidal clock signal CLK1 using equation (2.6) is 580 ps and the rise time of the square clock signal CLK2 generated from the inverter chain in schematic simulation is 175 ps. This means that for this design, there is 405 ps time difference between the rise time of the positive and negative clock edges. The  $T_{DQ}$  delay versus T<sub>setup</sub> is presented in Figure 4.7. As shown in Figure 4.7, the flip-flop experiences longer  $T_{DQ}$  delay in the positive edge of the clock compared to the negative edge. This is mainly related to the sharp negative clock edge which has a rise time that is 70% shorter compared to the positive clock edge. To investigate this point further, the sinusoidal clock signal CLK1 was replaced with a square clock that has a rise time equal to that of CLK2. The response of the flip-flop in this case was symmetrical at positive and negative edges of the clock. When performing the same simulation on the extracted circuit, we found that the rise time of the square



Figure 4.7:  $T_{DQ}$  delay vs.  $T_{setup}$  ( $T_{DCLK}$ ), schematic simulation



Figure 4.8:  $T_{DQ}$  delay vs.  $T_{setup\_time}$  ( $T_{DCLK}$ ), post-layout simulation

|          | T <sub>setup</sub> (ps) <sup>*</sup> | T <sub>DQ</sub> (ps) | T <sub>hold</sub> (ps) |
|----------|--------------------------------------|----------------------|------------------------|
| +ve edge | 10                                   | 245                  | 60                     |
| -ve edge | 10                                   | 210                  | 70                     |

Table 4.1 Timing characteristics of the DE-SAFF – post layout simulation

\* The setup time is TDCLK that results in minimum TDQ [5].

clock signal CLK2 has risen to 300 ps with approximately 71% increase compared to its rise time obtained from schematic. This is due to the extra parasitic capacitances and resistances obtained from layout extraction. The  $T_{DQ}$  delay versus  $T_{setup}$  for post layout simulation is presented in Figure 4.8 which illustrates that for this design, dual-edge triggering in LC resonant CDN would result in a symmetrical behavior of the flip-flop at both negative and positive clock edges. Table 4.1 presents a summary of the timing characteristics of the proposed DE-SAFF.

# 4.5.2 Effects of Process, Supply Voltage, and Temperature (PVT) Variations on the Generated Precharge and Evaluation Intervals

From here on, all simulation results are conducted on extracted circuit layout. For the DE-SAFF, the sizes of the transistors in the four inverters in the chain are  $W_p=W_n=2W_{min}$  ( $W_{min}=0.12 \ \mu$ m). In order to increase the width of the evaluation intervals, TE1 and TE2, a delay was added between the second and third inverters of the inverter chain. This delay is the delay of two minimum sized inverters, i.e.,  $W_p=W_n=W_{min}=0.12 \ \mu$ m. The generated precharge and evaluation intervals are equal to the following: TP1=554 ps, TE1=276 ps, TP2=946 ps, and TE2=225 ps. Note that the evaluation and precharging intervals were

measured at the  $V_{DD}/2$  voltage level, i.e., at 0.5 V and transistor sizes in the flip-flop as well as the inverter chain were chosen to ensure correct precharging and evaluation.

#### 4.5.2.1 Corner Analysis

| Corners | Precharge and Evaluation Intervals (ps) |     |      |     |  |
|---------|-----------------------------------------|-----|------|-----|--|
|         | TP1                                     | TE1 | TP2  | TE2 |  |
| ТТ      | 554                                     | 276 | 946  | 225 |  |
| FF      | 655                                     | 179 | 998  | 168 |  |
| FS      | 586                                     | 185 | 1055 | 174 |  |
| SF      | 648                                     | 248 | 898  | 203 |  |
| SS      | 407                                     | 390 | 876  | 331 |  |

 Table 4.2

 DE-SAFF precharge and evaluation intervals obtained for different corners

The extracted circuit was simulated for five corners; namely: *Typical-Typical* (TT), *Fast-Fast* (FF), *Fast-Slow* (FS), *Slow-Fast* (SF), and *Slow-Slow* (SS) with a sinusoidal clock signal of 500 MHz. The precharge and evaluation intervals were obtained for each corner as shown in Table 4.2. The table illustrates that in the FF corner, minimum evaluation intervals (TE1, TE2) occur since the inverters in the inverter chain experience minimum delays. The opposite is true for the SS corner where the inverters experience maximum delays thus resulting in minimum precharging intervals (TP1, TP2). The flip-flop was simulated under each corner to make sure that minimum precharge intervals are

| Corners | Combinational Logic Delay (ps) |           |  |  |
|---------|--------------------------------|-----------|--|--|
| Corners | TCL (+ve)                      | TCL (-ve) |  |  |
| ТТ      | 977                            | 534       |  |  |
| FF      | 932                            | 713       |  |  |
| FS      | 995                            | 550       |  |  |
| SF      | 901                            | 641       |  |  |
| SS      | 1021                           | 528       |  |  |

 Table 4.3

 Combinational logic delay obtained for each corner at positive and negative clock edges

long enough to guarantee correct functionality and that minimum evaluation intervals are long enough to ensure correct evaluation of the output.

Since variations in the precharge and evaluation intervals can affect the data path shown in Figure 4.6, Equations 4.6 and 4.7 were used to estimate the minimum combinational logic delay  $T_{CL}$  at both positive and negative edges of the clock for each corner with the  $T_{DCLK+VE,-VE}$  and  $T_{DQ+VE,-VE}$  previously introduced in Table 4.1. Note that  $T_{CLKQ}=T_{DQ}-T_{setup}$ . Table 4.3 presents the results obtained for the combinational logic delay at the positive and negative clock edges obtained for this design. As shown in Table 4.3, the combinational logic delay obtained for each corner is greater than 500 ps. In order to ensure that the timing constraints of equations 4.6 and 4.7 are not violated under process variation and for the chosen values of  $T_{setup}$  and  $T_{DQ}$ , the maximum combinational logic delay  $T_{CL}$  for this design should be restricted to less than or equal to approximately 500 ps.

#### 4.5.2.2 Supply Voltage

| Percentage change in supply voltage | TP1<br>(ps) | TE1<br>(ps) | TP2<br>(ps) | TE2<br>(ps) |
|-------------------------------------|-------------|-------------|-------------|-------------|
| VDD+5%                              | 586         | 274         | 910         | 230         |
| VDD-5%                              | 496         | 283         | 978         | 243         |

 Table 4.4

 Supply voltage effect on precharging and evaluation intervals

Table 4.4 illustrates the effect of a static increase or decrease in the power supply on the generated precharge and evaluation intervals. It shows that as the supply voltage increases, the delay of the inverters in the inverter chain decreases causing TE1 and TE2 to decrease and TP1 to increase. However, we notice that TP2 increases as the supply voltage decrease.

In order to explain why TP2 increases with decreased supply voltage, the generated clock signals of the inverter chain were plotted for a supply voltage of 1.2 V and 0.85 V as shown in Figure 4.9. These values for the supply voltage where chosen to better illustrate the behavior of the generated clock signals. The falling edge of CLK4 determines the start of TP2 and the rising edge of CLK2 defines its end. As shown in the figure, lowering the supply voltage has a greater impact on the rising edge of the generated signal compared to the falling edge. This is due to the fact that the inverters in the chain are not matched; instead  $W_p$  was chosen to be equal to  $W_n$  in order to reduce the area of the inverter. Since the PMOS transistor is slower than the NMOS transistor, a longer delay is observed between the rising edges of CLK2 and CLK4 for different



Figure 4.9: Effect of supply voltage variation on TP2

values of supply voltage while the falling edges of these signals coincide and experience virtually no delay. Hence, TP2 starts at approximately the same time with both values of the supply voltage (falling edge of CLK4) and ends with a delay in the lower supply voltage case (rising edge of CLK2), hence causing the increase in TP2 when lowering the supply voltage.

### 4.5.2.3 Temperature Variation

The spatial temperature gradient used in [39] at which the temperature changes from 25 to 125°C in steps of 25°C was adopted here to investigate the temperature effect on the generated precharging and evaluation intervals as shown in Figure 4.10. As illustrated in the figure, the precharging and evaluation intervals are independent of temperature



Figure 4.10: Temperature variation effect on precharging and evaluation intervals

variation where less than 40ps difference in all the precharging and evaluation intervals was observed as the temperature increase from 25°C to 125°C.

#### 4.5.2.4 Extreme Case

The proposed DE-SAFF performance was also simulated under worst case scenario with a chosen high temperature of 125°C, a varying supply voltage of  $\pm 10\%$ , and the FF, and SS corners. Simulation results show that the DE-SAFF functions correctly at this temperature with an increase in the supply voltage by 10% for the two corners. However, when the supply voltage decreases by 10%, only the FF corner functions correctly while the SS corner fail. This is because lowering V<sub>DD</sub> decreases the voltage swing of the generated clock signals. Hence the gate voltage of the NMOS transistors being fed by

CLK2, CLK4, and CLK5 is reduced causing the current flowing in these transistors to decrease thus the SET/RESET nodes cannot be pulled fully to ground. In addition, the precharge intervals are lowest at this corner. In order to ensure correct functionality of the DE-SAFF with the SS corner at 10% lower supply voltage, the gate width of the precharging PMOS transistors needs to be increased by  $2.5W_{min}$  and for the NMOS transistors (MN5, 6, 7, and 8 in Figure 4.5) by  $0.5W_{min}$ .

More accurate variability modeling can be achieved by accurately considering correlations through statistical treatment of variability. In fact, corner analysis increases design difficulty and results in overly pessimistic simulations since all parameters are assumed to be independent of each other [46].

#### 4.5.3 Sharing the Inverter Chain

The SAFF chosen to illustrate the modified dual-edge triggering scheme has two precharging paths, one for the SET and the other is for the RESET nodes. This leads to the addition of six extra transistors in the PUN instead of only three as is the case for the PDN (Figure 4.5), thus increasing DE-SAFF area compared to SE-SAFF. The same is true for power. Inverter chain sharing between several dual-edge triggered flip-flops has been proposed in the literature as a means of reducing area and power overhead [47], [48]. We have investigated the pros and cons of inverter chain sharing where the inverter chain sharing between two sequentially adjacent flip-flops. Simulation results show that standard deviation of the precharge and evaluation intervals under process variation decreases with increased number of flip-flops sharing the inverter

chain. This is due to the larger transistor sizes used in the chain which reduces the effects of process variations [49], [50]. Sharing the inverter chain also reduces area overhead from double the area in the case of one DE-SAFF to 70% and 66% in the cases of sharing the inverter chain between four and eight flip-flops, respectively. However, sharing the inverter chain between four and eight flip-flops causes an increase in power by 44% and 48% compared to the DE-SAFF with no sharing. The reason for this increase in power is the additional three wires carrying the CLK2, CLK4, and CLK5 signals to the remaining flip-flops in the array. The inverter chain used in sharing consisted of only four inverters with  $W_p=W_n=14W_{min}$ , and  $W_p=W_n=30W_{min}$ , in the four and eight flip-flops array cases, respectively.

#### 4.5.4 Comparing the DE-SAFF to Other Flip-Flops

Table 4.5 Timing characteristics of the DE-SDFF, DE-DCCFF, and DE-SAFF at a clock frequency of 250MHz

|                  | Tsetup (ps) | TDQ (ps) | T_hold (ps) |
|------------------|-------------|----------|-------------|
| DE-SDFF          | -50         | 182      | 190         |
| DE-DCCFF         | 75          | 188      | 275         |
| Proposed DE-SAFF | 225         | 351      | 205         |

The proposed DE-SAFF was compared to the Dual-Edge Static Differential Flip-Flop (DE-SDFF) and the Differential Conditional Capturing Flip-Flop (DE-DCCFF) presented in [5] operating with a sinusoidal clock signal. The flip-flops were simulated at a clock

|                  | Tsetup (ps) | TDQ (ps) | T_hold (ps) |
|------------------|-------------|----------|-------------|
| CD-SAFF          | -20         | 166      | 260         |
| AC-SAFF          | 0           | 166      | 360         |
| Proposed DE-SAFF | 10          | 210      | 70          |

Table 4.6 Timing characteristics of the CD-SAFF, AC-SAFF, and DE-SAFF at a clock frequency of 500MHz

frequency of 250 MHz with throughput of 500 MHz at 50% data switching activity. The total transistor widths of the proposed DE-SAFF are 41% and 42% less than that of the DE-SDFF and DE-DCCFF flip-flops, respectively. In addition, the DE-SAFF consumes less power by 32% and 29% compared to DE-SDFF and DE-DCCFF. Table 4.5 presents a summary of the timing characteristics of the flip-flops.

In addition, the proposed DE-SAFF was also compared to two square-wave driven dual-edge triggered flip-flops with similar structure; namely: the Conditional Capturing Dual-edge Sense Amplifier flip-flop (CD-SAFF) and the Adaptive Clocking Dual-edge Sense Amplifier flip-flop (AC-SAFF) presented in [43] at a clock frequency of 500 MHz and throughput of 1 GHz. The proposed DE-SAFF flip-flop has total transistor widths that are equal to that of the CD-SAFF and 5% less than the total transistor widths of the CD-SAFF. It also consumes 1% and 5% less power compared to the CD-SAFF and AC-SAFF, respectively. The timing characteristics of the flip-flops are given in Table 4.6.





Figure 4.11: Dual-edge triggered flip-flop output

Several factors affect the power savings achieved through dual-edge clocking. Dualedge clocking power savings is a complex function of the type of the CDN, resistances and capacitances of the CDN, number of flip-flops as well as the design of the flip-flop. It is not possible to reach an exact estimation of power savings achieved through dual-edge triggering without the knowledge of the entire system, especially since modeling of the clock system is difficult as the number of flip-flops and the size of the CDN is dependent on each system [51].

All flip-flop power measurements were conducted for the pseudorandom sequence with equal probability of all transitions comprising worst and best cases presented in [37], 16 clock cycles for the single-edge and 8 clock cycles (16 positive and negative clock transitions) for the dual-edge triggered flip-flop. Figure 4.11 illustrates the dual-edge triggered flip-flop output for the data sequence for 8 clock cycles. The 16 evaluation intervals at positive and negative clock edges are also illustrated in the figure. The power consumption of the dual-edge triggered flip-flop is 37  $\mu$ W with 106% increase in power compared to the single-edge triggered flip-flop.

In order to estimate the percentage reduction in power achieved through dual-edge clocking, the power consumption of the resonant generator (presented in Chapter 6) and the clock tree is divided into two parts. The first part is the power of the progressively sized inverters driving the gates of transistors MP and MN and is given by:

$$P_{inverters} = C_{gMP} f V_{DD}^2 \left(\frac{n}{n-1}\right) + C_{gMN} f V_{DD}^2 \left(\frac{n}{n-1}\right)$$
(4.8)

where  $C_{gMP}$  and  $C_{gMN}$  are the gate capacitances of transistors MP and MN, respectively, f is the frequency of operation,  $V_{DD}$  is the supply voltage, and n is the stage gain.

The second part is the power consumed in the resonant clock network being driven by transistors MP and MN which is derived as a first order estimation:

$$P_{resonant} = \frac{R_{clk}}{2} \left( \pi f V_{DD} (C_{clk} + \alpha N C_{FF}) \right)^2$$
(4.9)

where  $R_{clk}$  is the resistance of the clock wires,  $C_{clk}$  is the loading capacitance of the clock tree,  $C_{FF}$  is the loading capacitance of the flip-flop, N is the number of flip-flops loading the clock tree, and  $\alpha$  is the factor by which the loading capacitance of the flip-flops at the clock leaves is reflected to the driver side.

Noting that the DE-SAFF consumes twice as much power compared to the SE-SAFF with approximately the same loading capacitance, the anticipated percentage reduction in power for the entire system through dual-edge clocking can be written as:

Percentage Reduction in Power =

$$=\frac{\frac{1}{2}(C_{gMP}+C_{gMN})fV_{DD}^{2}\left(\frac{n}{n-1}\right)+\frac{3}{8}R_{clk}(\pi fV_{DD}(C_{CLK}+\alpha NC_{FF}))^{2}-N\times P_{SE}}{(C_{gMP}+C_{gMN})fV_{DD}^{2}\left(\frac{n}{n-1}\right)+\frac{1}{2}R_{clk}(\pi fV_{DD}(C_{CLK}+\alpha NC_{FF}))^{2}+N\times P_{SE}}\times100\%$$
(4.10)

where  $C_{gM P}$  and  $C_{gMN}$  are the gate capacitances of transistors MP and MN,  $R_{clk}$  is the resistance of the clock wires,  $C_{clk}$  is the loading capacitance of the clock tree,  $C_{FF}$  is the loading capacitance of the flip-flop, and  $\alpha$  is the factor by which flip-flop loading capacitance is reflected to the driver side, f is the frequency of operation in the single edge case,  $V_{DD}$  is the supply voltage, n is the stage gain, and  $P_{SE}$  is the power of the SE-SAFF.

The number of flip-flops and the size of the CDN are dependent on each system [5], [29], [51]. Figure 4.12 is a three dimensional plot of the percentage reduction in power achieved through dual-edge clocking as a function of the clock tree capacitance ( $C_{clk}$ ) and the number of flip-flops (N). As shown in the figure, when the clock capacitance is the dominating factor, dual-edge clocking can achieve up to 58% reduction in power. It should be noted that in plotting Figure 4.12, the size of transistors MP and MN of the clock driver (refer to Chapter 6 for more details) was kept constant as the capacitance of the clock tree increases. Simulation results have also shown that dual-edge triggering allows up to  $6\times$  reduction in the width of transistors MP and MN in the clock generator. This corresponds to a reduction in the clock generator area of approximately 83%. Though Dual-edge triggering in resonant CDNs would require an inductor that is  $4\times$  bigger than the inductor in the single-edge case, the increase in area due to the larger inductor was neglected since active circuits can be used in the area under the inductor.



Figure 4.12: Dual-edge clocking percentage reduction in power

### 4.6 Conclusion

In this chapter we have applied a modified clocking scheme to enable dual-edge clocking in the SAFF with a resonant clock. This scheme reduces short circuit power by allowing the precharging transistors to be switched on only for a portion of the clock period. The precharging and evaluation intervals generated using this scheme have been characterized. The transistor sizes in the inverter chain must be carefully chosen in order to ensure that minimum precharging and evaluation intervals are long enough to guarantee correct evaluation of the output. In addition, the effects of variations in process, supply voltage, and temperature (PVT) on the precharging and evaluation intervals and

consequently the operation of the flip-flop were investigated. Sharing the inverter chain between several flip-flops reduces area overhead as well as susceptibility to variations but causes an increase in power. Modeling the entire system of the CDN with the proposed flip-flop illustrates that dual-edge resonant clocking has the potential of achieving up to 58% reduction in power when the clock capacitance is the dominating factor. The proposed flip-flop has lower total transistor width and power consumption compared to other dual-edge triggered flip-flops presented in the literature.

## Chapter 5 Application of Low-Swing Clocking to LC Resonant Clock Distribution Networks

In the previous chapter, reducing clock frequency by half through dual-edge triggering was used to save power. In this chapter we reduce power through a reduction in clock swing by introducing a new flip-flop for use in a low-swing LC resonant clocking scheme. The proposed Low-Swing Differential Conditional Capturing Flip-Flop (LS-DCCFF) operates with a low-swing sinusoidal clock through the utilization of reduced swing inverters at the clock port. The functionality of the proposed flip-flop was verified at extreme corners through simulations with parasitics extracted from layout. The LS-DCCFF enables 6.5% reduction in power compared to the full-swing flip-flop with 19% area overhead. In addition, a frequency dependent delay associated with driving pulsed flip-flops with a low-swing sinusoidal clock has been characterized. The LS-DCCFF has 870 ps longer data to output delay compared to the full-swing flip-flop at the same setup time for a 100 MHz sinusoidal clock. The functionality of the proposed flip-flop was tested and verified by using the LS-DCCFF in a dual-mode MAC unit fabricated in TSMC 90-nm CMOS technology. Low-swing resonant clocking achieved around 5.8% reduction in total power with 5.7% area overhead for the MAC. Modeling the clock network with the proposed flip-flop illustrates that low-swing clocking can achieve up to 58% reduction in the power consumption of the resonant clock.

#### 5.1 Introduction

C. Kim *et al.* [44] demonstrated that a low-swing square-wave clock double-edge triggered flip-flop has enabled 78% power savings in the CDN. Low-swing clocking would normally require two voltage levels,  $V_{DD}$  and  $V_{DD-Low}$ . These voltage levels can be generated using one of two schemes: (i) dual-supply voltages, and (ii) regular power supply. The first scheme adds circuit and extra area complexity to the overall chip design and layout. However, it leads to a reduction in the number of clock network transistors which improves power savings [52]. The second scheme uses circuit methods to achieve low-swing. However, the design of low-swing buffers becomes challenging in the absence of a second power supply [52].

We have followed a similar approach to the one proposed in [7] in which the clock buffers are removed to allow the global and local clock energy to resonate between the inductor and entire clock capacitance enabling maximum power savings. In addition, removing the clock buffers simplifies LC low-swing clocking since only reduced swing buffers are used at the flip-flop gate and not in intermediate levels within the clock tree [53].

## 5.2 Low-Swing LC Resonant Clocking 5.2.1 Low-Swing Differential Conditional Capturing Flip-Flop (LS-DCCFF)

Figure 5.1 shows the proposed LS-DCCFF. Conditional capturing is used to minimize flip-flop power at low data switching activities by eliminating redundant internal transitions [28]. As shown in Figure 5.1, reduced swing inverters similar to the one presented in [53] are used at the node fed by the low-swing sinusoidal clock signal. This is done to reduce short circuit power by minimizing the interval at which both the PMOS



Figure 5.1: Low-Swing Differential Conditional Capturing Flip-Flop (LS-DCCFF)

and NMOS of the inverter turn on simultaneously. The load PMOS transistor in the reduced swing inverters is always in saturation since  $V_{gs} = V_{ds}$ . It lowers the voltage at the source of the second PMOS in each inverter to approximately  $V_{DD} - |V_{TP}|$  thus turning it off when the low-swing sinusoidal clock signal reaches its peak voltage. The peak voltage for the low-swing clock was chosen to be equal to  $0.65V_{DD}$  since the threshold voltage of the PMOS transistor is approximately -0.34 V.

From here on and for simplicity the term LS, FS refers to low-swing and full-swing, respectively.



5.2.2 Delay Associated with Low-swing LC Resonant Clocking

Figure 5.2: Delay between the low- and full-swing resonant clock signals to reach  $V_{pull\_down}$ 

 $V_{pull\_down}$  presented in Figure 5.2 is the voltage level at which transistor *MN1* with the resonant clock signal applied to its gate (Figure 5.1) is able to pull down node *SET/RESET* to the low voltage level required to trigger the NAND latch. Due to the time difference between the low- and full-swing sinusoidal clock signals to reach  $V_{pull\_down}$ , the low-swing flip-flop experiences longer data to output delay ( $T_{DQ}$ ) compared to the full-swing flip-flop for the same setup time ( $T_{DCLK}$ ).

In the following, an analysis is conducted to estimate the delay in reaching  $V_{pull\_down}$  for the low-swing resonant clock signal. Let the full- and low-swing clock signals be given by the following equations:

$$v(t)_{full\_swing} = \frac{1}{2} V_{DD} \sin(2\pi f t - \frac{\pi}{2}) + \frac{1}{2} V_{DD}$$
(5.1)

$$v(t)_{low\_swing} = \frac{0.65}{2} V_{DD} \sin(2\pi f t - \frac{\pi}{2}) + \frac{0.65}{2} V_{DD}$$
(5.2)

where *f* is the clock frequency,  $V_{DD}$  and  $0.65V_{DD}$  is the peak voltage for the full- and low-swing sinusoidal clock signals, respectively.

Depending on the input state, either node *SET* or *RESET* is pulled down to trigger the NAND latch when the clock signal reaches  $V_{pull\_down}$ . Substituting this value in equation (5.1) and referring to Figure 5.2:

$$V_{pull\_down} = \frac{1}{2} V_{DD} \sin(2\pi f T_1 - \frac{\pi}{2}) + \frac{1}{2} V_{DD}$$
(5.3)

from which:

$$T_{1} = \frac{1}{2\pi f} \left( \sin^{-1} \left( \frac{2V_{pull\_down}}{V_{DD}} - 1 \right) + \frac{\pi}{2} \right)$$
(5.4)

Using the same approach for the low-swing clock signal:

$$T_2 = \frac{1}{2\pi f} \left( \sin^{-1} \left( \frac{2V_{pull\_down}}{0.65V_{DD}} - 1 \right) + \frac{\pi}{2} \right)$$
(5.5)

The time difference between the two clock signals to reach  $V_{pull\_down}$  which defines the  $T_{DQ}$  delay between the low- and full-swing flip-flops is given by:

$$T_{DQ\_delay} = T_2 - T_1 = \frac{1}{2\pi f} \left( \sin^{-1} \left( \frac{2V_{pull\_down}}{0.65V_{DD}} - 1 \right) - \sin^{-1} \left( \frac{2V_{pull\_down}}{V_{DD}} - 1 \right) \right)$$
(5.6)

Equation (5.6) gives the delay between the full- and low-swing flip-flops. It illustrates that this delay is inversely proportional to clock frequency, i.e., at higher frequencies, the delay decreases.

### **5.2.3 Power**

Following the approach proposed in Chapters 4, the power dissipation of the resonant clock network is given by the following equation:

$$P_{resonant\_clock} = \frac{R_{clk}}{2} (\pi f V_{peak} (C_{clk} + \alpha N C_{FF}))^2$$
(5.7)

where  $R_{clk}$ ,  $C_{clk}$  are the clock capacitance and resistance as seen by the driver, f and  $V_{peak}$ are the frequency and peak voltage of the generated clock signal,  $C_{FF}$  is the loading capacitance of the flip-flop, N is the number of flip-flops, and  $\alpha$  is the factor by which the loading capacitance of the flip-flop connected at clock leaves is reflected to the driver side. Equation (5.7) illustrates that generating a low-swing clock signal with  $V_{peak} =$  $0.65V_{DD}$  results in around 58% power reduction in the clock network.

### 5.3 Test Chip

To demonstrate the correct operation of the proposed LS-DCCFF and to highlight potential power savings enabled through low-swing clocking, a test chip with a MAC unit designed using the proposed flip-flop under low-swing sinusoidal clocking was fabricated in TSMC 90-nm CMOS technology.



Figure 5.3: Modification to enable full- and low-swing flip-flop clocking

The MAC unit in the test chip consists of a 16×16-bit multiplier, a 32-bit serial-in parallel-out shift register to load the multiplier and multiplicand, 32-bit full-adder array, two 32-bit parallel-in parallel-out registers at the adders input and output, and a 33-bit parallel-in serial-out shift register at the output stage. Since the multiplier itself was not pipelined, a clock frequency of 100 MHz was chosen for the test chip.

Due to the large inductor needed for clock generation and the limited area available, the clock generator was not implemented on-chip. The sinusoidal clock signal is fed by an external source through an analog pad. Furthermore, the DCCFF was modified to enable dual-mode of operation for the MAC unit under full- and low-swing clocking without significant area overhead. As illustrated in Figure 5.3, the LS-DCCFF presented



Figure 5.4: Simplified floorplan of the test chip



Figure 5.5: Die photopgraph of the test chip

in Figure 5.1 was modified at node X to allow the operation under full- and low-swing clocking. When signal *FULL\_SWING* is high, full-swing clocking is enabled and the inverted clock output of the normal inverters *CLKD\_FS* is feeding transistor *MN1*. Whereas Low-swing clocking is enabled when signal *FULL\_SWING* is low and the output of the reduced voltage swing inverters *CLKD\_LS* feeds transistor *MN1*.

A simplified floorplan and a die photo of our chip are shown in Figures 5.4 and 5.5, respectively. The chip covers an area of 1mm×1mm. Two separate instances of the fulland low-swing DCCFF were implemented at the lower portion of the chip for testing. Due to the large capacitance associated with the pads, all outputs were connected to the pads through a buffer stage consisting of four progressively sized inverters (Figure 5.4).

### 5.4 Test Chip Extracted Simulation and Measurements

Figure 5.6 demonstrates the correct operation of the LS-DCCFF at a supply voltage of 1 V with an operating frequency of 100 MHz and a low-swing sinusoidal clock. This figure shows the low-swing sinusoidal clock signal (channel 1, first signal from top), the inverted clock signal (channel 2, second from top), the input D (channel 3, third from top), and the output Q (channel 4).

HSPICE post-layout-simulation on extracted circuits verifies correct functionality of both flip-flops under best conditions of *Fast-Fast (FF)* corner at low temperature of -25° C, normal conditions of *Typical-Typical (TT)* corner at room temperature, and worst conditions of *Slow-Slow (SS)* corner at high temperature of 125° C. An average reduction



Figure 5.6: Measurement waveforms of the LS-DCCFF at 100MHz

in the  $T_{DQ}$  delay of 130 ps was observed in the *FF* corner whereas the *SS* corner resulted in 76 ps increase in delay compared to the *TT* corner. Furthermore, correct functionality of both flip-flops under low- and full-swing sinusoidal clocking with ±10% variation in the supply voltage was verified through measurements.

Post-layout-simulation results presented in Figure 5.7 illustrate that for the same setup time, the difference between the  $T_{DQ}$  delays for the full- and low-swing flip-flops is approximately 870 ps. This confirms the accuracy of equation (5.6) with an error of 4% compared to simulation results for  $V_{pull\_down}$ = 500 mV. The measuremt results presented in the figure (limited by our experimental setup)<sup>3</sup> are within close proximity to post-layout-simulation. The extra delay associated with measurements can be related to the extra capacitance of the pads, package, wires, and test fixture. The response presented in Figure 5.7 was obtained at the  $\frac{1}{2}V_{DD}$  voltage level for the data *D* and output *Q* 

<sup>&</sup>lt;sup>3</sup>The function generator used for testing is AFG3101 which can generate sinusoidal and square signals with a maximum frequency of 100 MHz and 50 MHz, respectively. Since the generator has only one channel, the Reference Out of one generator was connected to the Reference In of the second generator for synchronization.



(b) LS-DCCFF

Figure 5.7:  $T_{DQ}$  delay versus setup time for the full- and low-swing flip-flops

waveforms and at half of the clock peak for the sinusoidal clock signals, i.e., at 0.5 V and 0.325 V for the full- and low-swing clock signals, respectively.

The behavior of the current flowing in node X, i.e.,  $I\_FS$  for the full-swing flip-flop and  $I\_LS$  for the low-swing flip-flop in Figure 5.1 as well as the voltage level of nodes SET, Q, and QB for the two flip-flops at the same setup time of 950 ps is illustrated in Figure 5.8. As shown in the figure, the maximum current flowing from node X to ground in the full- and low-swing flip-flops occurs when the full swing clock  $CLK\_FS$  and lowswing clock signal  $CLK\_LS$  at the gate of transistor MNI reaches  $V_{pull\_down}$ = 500 mV. At this point, node SET is pulled down and the output Q of the NAND latch is pulled up to  $V_{DD}$ . When QB is grounded, transistor MN3 turns off, thus cutting the flow of the current.

As illustrated in the Figure 5.7, the clock to output delay  $T_{DQ}$  becomes independent from the  $T_{DCLK}$  when data is applied on or after the point where the clock signal reaches or exceeds  $V_{pull\_down}$  since at this point transistors MN1/MN2 are completely switched on and are able to directly sink node *SET* or *RESET*. This occurs in the full-swing case for a setup time less than or equal to 0 ps, i.e., when input *D* is applied at or after point  $T_1$  in equation 5.4. In the low-swing case,  $T_{DQ}$  becomes independent from  $T_{DCLK}$  when input *D* is applied at or after point  $T_2$  in equation 5.5, i.e., at a setup time less than or equal to -905 ps which is the time difference between the  $\frac{0.65}{2}V_{DD}$  and  $V_{pull\_down}$  for the low-swing sinusoidal clock signal.

Figure 5.7 also shows that the low-swing flip-flop can operate at a negative setup time of approximately -2000 ps whereas the full-swing flip-flop can only operate at a negative setup time of approximately -950 ps. This is because the reduced swing inverters in the low-swing flip-flop experience more delay than the normal inverters used in the full-



Figure 5.8:  $T_{DQ}$  for the full- and low-swing flip-flops at the same setup time – extracted simulation

swing flip-flop. Figure 5.9 presents the behavior of the full- and low- swing flip-flops with a negative setup time of -1,100 ps. As shown in the figure, at this setup time, the inverted clock signal in the full-swing flip-flop  $CLKD\_FS$  has already reached ground thus turning off transistor MN2 and cutting the flow of current  $I\_FS$ . Node SET is not pulled down and the full-swing flip-flop does not capture data. However, due to the long delay of the reduced swing inverters, the inverted low- swing clock signal  $CLKD\_LS$  is still at  $V_{DD}$  enabling the current  $I\_LS$  to flow and pull down node SET to latch the output Q to  $V_{DD}$ .



Figure 5.9:  $T_{DQ}$  for the full- and low-swing flip-flops at a negative setup time of -1,100ps - extracted simulation

Table 5.1 gives area and power overhead of low-swing compared to full-swing resonant clocking. The area was estimated on gate level. As shown in the table, the LS-DCCFF experiences 6.5% reduction in power compared to the full-swing flip-flop with area overhead of 19%. Static power consumption in the full-and low-swing flip-flops was assumed to be equal since the flip-flops have exactly the same transistor size except for the load PMOS in the reduced swing inverters. The table also illustrates that the application of low-swing clocking with the LS-DCCFFs causes 5.7% increase in total area and 5.8% reduction in total power consumption.

|                                               | LS-DCCFF | MAC unit <sup>*</sup> |
|-----------------------------------------------|----------|-----------------------|
| Area (µm <sup>2</sup> )                       | 43       | 16,669                |
| % Increase in area compared to full-swing     | 19       | 5.7                   |
| Power (µW)                                    | 5.59     | 1,506                 |
| % Decrease in power compared<br>to full-swing | 6.5      | 5.8                   |

Table 5.1 Area and power comparison between full- and low-swing clocking

\* Including clock generator power

The clock distribution network capacitance was estimated by using Cadence's Calibre PEX extractor and then simulating the extracted netlist in Cadence's HSPICE simulator. Simulation results on the extracted network show that the clock net has a total capacitance of 8.48 pF. The inductor needed to resonate the clock network at 100 MHz is approximately 0.32  $\mu$ H. Such large inductor would normally be connected off-chip. Using the approach proposed in Chapter 6 to estimate required driver strength illustrates that to resonate the clock tree at 100 MHz with full-swing sinusoidal clock, the width of the PMOS transistor in the driver would be approximately 3.97  $\mu$ m. To generate the low-swing clock signal with a reduced peak voltage of  $0.65V_{DD}$ , the size of the transistor in the clock generator can be reduced by 66%. The scheme proposed in [54] to control the amplitude of the clock signal can be used to insure the integrity of the generated clock with the desired peak voltage.



Figure 5.10: Percentage reduction in power for the resonant clock network achievable through low-swing clocking

The percentage reduction in power achievable through low-swing clocking reported in Table I is based on the CDN and number of flip-flops for the MAC unit. However, the number of flip-flops and the size of the CDN and hence its capacitance is dependent on each system. Figure 5.10 is a three dimensional plot of the percentage reduction in power of the CDN including the flip-flops achieved through low-swing clocking as a function of the clock capacitance and the number of flip-flops. As shown in the figure, low-swing clocking enables around 7% reduction in the power in the clock network for the MAC unit with 8.48 pF capacitance and 129 flip-flops. The figure also illustrates that when the clock capacitance is the dominating factor, low-swing clocking can achieve up to 58% reduction in clock power.

### **5.5** Conclusion

We have proposed a low-swing sinusoidally clocked flip-flop to obtain further power reduction in LC resonant CDNs. Low-swing resonant clocking in pulsed flip-flops results in a delayed flip-flop response. Theoretical analysis has been performed and the delay associated with low-swing sinusoidal clocking was characterized.

The functionality of the proposed flip-flop has been investigated through HSPICE simulation on extracted circuit layout at extreme corners and tested through on-chip measurements. A MAC unit designed using the proposed flip-flop was tested on-chip where low-swing resonant clocking achieves around 5.8% reduction in total power with 5.7% area overhead.

# Chapter 6 Estimating Required Driver Strength in the LC Resonant Clock Generator

A detailed analytical approach is proposed to determine the required driver strength in the LC resonant clock generator. The proposed approach reduces area and power overhead by eliminating the need to have switches with programmable widths and reference pulses with programmable duty cycles. Simulation results show accurate estimation of the required driver strength at short pulse widths. However, as the pulse width increases, accuracy is reduced due to overestimation of the transistor driving capability.

### 6.1 Introduction

The resonant clock generators presented in [55], [56] use different combinations of programmable switches and programmable duty cycles of the reference pulses to generate a resonant clock signal with minimum power dissipation (Figures 6.1 and 6.2). The approach used to determine the optimum combination of required driving strength and duty cycle leads to overhead in complexity and area needed to implement the programming circuitry.

While other authors have presented different LC resonant clock generators with programmable driver and reference pulses, none of them have addressed the need to estimate the required driver strength at an early stage of the design. In this chapter, an



Figure 6.1: Relative power savings as a function of driver transistor width (w) and reference signal pulse (d) [55]



Figure 6.2: Clock generator with programmable delay [56]

analytical approach is proposed to estimate the required driving capability of the driver in the LC resonant clock generator.

### 6.2 Estimating Required Driver Strength

The LC resonant clock generator used in [5] and shown in Figure 6.3 is adopted here. The  $V_{DD}/2$  voltage source was replaced by a decoupling capacitance ( $C_{decap}$ ) following the approach used in [22]. The clock tree is modeled as an ideal RC network where  $C_{clk}$ ,  $R_{clk}$  are the clock capacitance and resistance as seen by the driver. The generated sinusoidal clock signal is shown in Figure 6.4. Reference pulse ( $Vref_N$ ) switches on the NMOS transistor and pulls-down the clock signal to  $V_{OL}$ . The PMOS transistor receives a reference pulse ( $Vref_P$ ) to pull-up the clock signal to  $V_{OH}$ . The reference signals are inverted and are out of phase by 180 degrees.

The generated resonant clock signal shown in Figure 6.4 is given by the following equation:

$$vout(t) = \left(\frac{V_{OH} - V_{OL}}{2}\right)\cos(w_0 t) + \left(\frac{V_{OH} + V_{OL}}{2}\right)$$
(6.1)

 $V_{OH}$  and  $V_{OL}$  are the highest and lowest voltage levels of the generated sinusoidal clock signal and  $w_o$  is the resonant frequency.

The current flowing in  $R_{clk}$  is equal to:

$$i(t) = C_{clk} \frac{dv_{out}}{dt} = -\left(\frac{V_{OH} - V_{OL}}{2}\right) w_0 C_{clk} \sin(w_0 t)$$
(6.2)







Figure 6.4: Generated sinusoidal clock signal

The average power dissipated is given by:

$$P = \frac{1}{T} \int i(t)^2 R_{clk} dt = \frac{R_{clk}}{2} \left( \pi f C_{clk} (V_{OH} - V_{OL}) \right)^2$$
(6.3)

Using:

$$i = \frac{P}{V_{DD}} = \frac{Q}{T} \tag{6.4}$$

where *i* is the average current and Q is the charge per cycle.

The charge that needs to be supplied in each clock cycle in order to sustain oscillation is equal to:

$$Q = \frac{R_{clk}T}{2V_{DD}} \left( \pi f C_{clk} (V_{OH} - V_{OL}) \right)^2$$
(6.5)

The transistor short channel model given in [57] is used to estimate the drain current  $(I_D)$  of the PMOS transistor in the clock generator. Given that  $W_p$  is the gate width of the PMOS transistor, L is the gate length,  $\mu_p$  is the holes mobility,  $C_{ox}$  is the oxide gate capacitance,  $V_{gs}$  and  $V_{ds}$  are the gate and drain voltages with respect to the source,  $V_{th}$  is the threshold voltage,  $\varepsilon_c$  is the critical value of the electrical field at which carrie velocity saturates, PW is the pulse width of the reference signal ( $Vref_P$ ),  $PW_{edge}$  is the pulse width of  $Vref_P$  at which  $V_{ds}=V_{ds\_sat}$ , i.e., at the edge of saturation, T is the clock priod, and  $\beta(t)=I_D(t)/W_P$ , the following equations are written:

$$V_{ds\_sat} = \frac{1}{1 + \frac{(V_{gs} - V_{th})}{\varepsilon_c L}} (V_{gs} - V_{th})$$
(6.6)

$$PW_{edge} = \frac{1}{w_0} \cos^{-1} \left( \frac{2(V_{ds\_sat} - V_{OL} + V_{DD})}{V_{OH} - V_{OL}} - 1 \right)$$
(6.7)

$$I_D(t) = W_P \beta(t) \tag{6.8}$$

For *PW*<*PW*<sub>edge</sub>:

$$\beta(t) = \frac{u_p c_{ox}}{L} \left( \left( V_{gs} - V_{th} \right) V_{ds}(t) - \frac{V_{ds}(t)^2}{2} \right) \left( \frac{1}{1 + \frac{V_{ds}(t)}{\varepsilon_c L}} \right)$$
(6.9)



Figure 6.5: PMOS drain current during the application of Vref\_P

For *PW*>*PW*<sub>edge</sub>:

$$\beta(t) = \frac{u_p c_{ox}}{L} \left( \left( V_{gs} - V_{th} \right) V_{ds\_sat} - \frac{V_{ds\_sat}^2}{2} \right) \left( \frac{1}{1 + \frac{V_{ds\_sat}}{\varepsilon_c L}} \right)$$
(6.10)

In order to estimate the charge being supplied by the voltage source during the pulse width (*PW*) of signal *Vref\_P* when the PMOS transistor is switched on, the voltage drop across  $R_{clk}$  is neglected and the drain voltage of the PMOS (node *Vx* in Figure 6.3) is assumed to be equal to *vout(t)*. Hence  $V_{ds}$  at the beginning of *Vref\_P* is equal to:

 $V_{ds}(T - PW) = vout(T - PW) - V_{DD}$ 

$$= \left(\frac{V_{OH} - V_{OL}}{2}\right) \cos\left(w_0(T - PW)\right) + \left(\frac{V_{OH} + V_{OL}}{2}\right) - V_{DD}$$
(6.11)

and at the end of  $Vref_P$ ,  $V_{ds}$  is equal to:

$$V_{ds}(T) = vout(T) - V_{DD} = V_{OH} - V_{DD}$$
(6.12)

The drain current of the PMOS transistor in the driver at the start and end of Vref\_P is shown in Figure 6.5. The charge supplied during that interval is equal to the area under the curve of Figure 6.5 given by:

$$Q = I_D(T) \times PW + \frac{1}{2} PW (I_D(T - PW) - I_D(T))$$
  
=  $\frac{1}{2} PW (W_P \beta(T) + W_P \beta(T - PW))$  (6.13)

Substituting (6.5) = (6.13), we obtain:

$$W_{P} = \frac{R_{clk}T}{V_{DD}PW} \times \frac{\left(\pi f C_{clk}(V_{OH} - V_{OL})\right)^{2}}{\beta(T) + \beta(T - PW)}$$
(6.14)

Equation 6.14 gives the required width of the PMOS transistor in the driver needed to generate a resonant clock signal with a desired voltage swing given the frequency of operation, clock capacitance, resistance, as well as the pulse width (*PW*) of the applied reference signals.

### **6.3 Simulation Results**

Equations 6.8 to 6.10 were verified through simulations using Spectre on a 90nm STMicroelectronics minimum sized PMOS transistor as shown in Figure 6.6. Compared to simulation, the equations are accurate with an average error of 5%.

In Table 6.1, Equation 6.14 was used to estimate the required driver strength in the LC resonant clock generator at different pulse widths. The table illustrates that Equation 6.14 is accurate with short pulse widths of the reference signals. It is also observed that the percentage error increases as the pulse width (*PW*) of *Vref\_P* increase approaching  $PW_{edge}$ , i.e., edge of saturation.



Figure 6.6:  $I_D$  vs.  $V_{DS}$  for PMOS with  $W_p$ =120 nm, L=100 nm

Table 6.1 Estimated driver strength at different pulse widths (*PW*) for  $C_{clk}$ =30 pF,  $R_{clk}$ =0.5  $\Omega$ , f=1 GHz,  $V_{DD}$ =1 V,  $V_{OH}$ =0.95 V, and  $V_{OL}$ =0.05 V

| PW (ps) | $W_p(\mu m)$ | Absolute error compared to $V_{OH}$ - $V_{OL}$ = 0.9 V |
|---------|--------------|--------------------------------------------------------|
| 100     | 402          | 3%                                                     |
| 150     | 191          | 0%                                                     |
| 200     | 109          | 11%                                                    |
| 250     | 75           | 12%                                                    |
| 300     | 58           | 17%                                                    |

This is due to the fact that the current model used overestimates the transistor driving capability as shown in Figure 6.6 at longer *PW*, i.e., as we move closer to the saturation region where  $V_{ds\_sat}$  = -0.69 V. As illustrated in Table 6.1, the PMOS and NMOS

transistors of the clock generator are large in size. Hence they are driven by progressively sized inverters [5]. Taking these inverters into account, an efficient approach in terms of minimizing total power is to use smaller transistors in the clock generator with longer pulse width for the reference signals.

### **6.4 Conclusion**

An analytical approach has been proposed to estimate the required driver strength in the LC resonant clock generator. Simulation results show that the derived equation is accurate compared to results obtained using Spectre. Using the proposed approach early in the design stage would save chip area and resources by replacing the need to have switches with programmable widths and reference pulses with programmable duty cycles. Although the mathematical derivation was illustrated on a specific LC resonant clock generator, it can be extended and used to estimate the required driving strength in other resonant clock generators.

## Chapter 7 Conclusion

### 7.1 Summary and Contributions

In this thesis various techniques are applied at the flip-flop level and different levels of clock generation and distribution, with the aim of reducing the power consumption. Each technique is individually evaluated and has been shown to be effective and produces the desired result.

Resonant clocking techniques have proven their ability to reduce the power of CDNs which consume the largest portion of total power in synchronous digital systems. From these techniques, the most practical is the LC resonant clocking which generates a clock signal with a constant phase and amplitude compared to a varying amplitude signal generated in standing-wave oscillation or varying phase signal in rotary traveling-wave oscillation.

The CDN assumed in this dissertation was that of an LC fully-resonant clock network. Extending the resonance all the way down to the local tree driving the flip-flops results in maximum power savings since the energy stored on the local clock capacitance which consumes around 2/3 of total clock power is being recycled.

We have introduced a new type of slack in the skew that can be compensated for to reduce the CDN routing complexity, wire elongations, total wire length, and power consumption. The slack in the skew can also be used for incremental routing adjustments. In our demonstration of the proposed technique, the slow rise time of the sinusoidal resonant clock signal and the different transistor threshold voltage levels available in the STMicroelectronics 90nm technology were used to generate different delays of the flipflop with separate means. Lower skew bounds for the proposed technique have been identified. Matched delay values for short and long delay paths were derived to compensate for positive and negative clock skew. The effects of process, power supply, and temperature (PVT) variations on flip-flop delay were investigated. CDNs with nominal zero skew have been constructed using the Modified Differed Merge Embedding Algorithm that takes advantage of the skew slack introduced by the new technique.

A new dual-edge triggering scheme has been proposed. This scheme allows the extension of dual-edge triggering to any dynamic logic circuit with precharge and evaluation phases. The delay of every precharge and evaluation interval generated by the sinusoidal resonant clock at positive and negative edges was characterized. The proposed scheme was tested on the SAFF with precharge and evaluation phases and the flip-flop response at both edges of the sinusoidal clock was investigated. In addition, the effects of (PVT) variations on the generated precharge and evaluation intervals were examined as well as the flip-flop behavior under worst case scenario. Furthermore, the pros and cons of inverter chain sharing were investigated and the potential power saving achievable through dual-edge clocking was highlighted.

Further reduction in LC fully-resonant CDN power consumption was achieved by reducing the clock swing. The DCCFF was modified to operate with a low-swing sinusoidal clock. The proposed low-swing LC fully-resonant clocking scheme operates with one voltage supply and does not require an additional supply voltage. The feasibility of low-swing resonant clocking and the power advantages are investigated on-chip.

### 7.2 Future Work

1- Latest developments in 3-D integrated circuit design with multi-plane synchronization as was illustrated in Figure 2.1 show that the traditional approach of using clock trees will lead to significant increase in power and metal overhead. For example, at least six metal layers will be dedicated to the clock network if H-Trees were used to distribute the clock signal in each of the three planes. This means that in addition to the increase in power, routing complexity will also increase.

In [58], Globally Integrated Power and Clock (GIPAC) integrated network has been proposed as a means of eliminating the on-chip global clock distribution network. The clock and power signals are integrated in the GIPAC and then separated in the local power and local clock networks using passive filters as shown in Figure 7.1. The input signal to the splitter circuit is a sinusoidal wave with a DC component of 1.2 V and a sinusoidal voltage-swing of 0.1 V. Simulation results show the feasibility of the proposed scheme. However, the proposed approach does not eliminate the need for the local clock distribution network.

The proposed scheme can be improved by investigating the feasibility of eliminating the clock network (both local and global) by using only the power network to distribute both the power and clock signals. A sinusoidal *power-clock* signal with suitable DC voltage level and sinusoidal swing is distributed through the power grid directly to the  $V_{DD}$  and the  $V_{DD}/Clock$  ports of combinational and sequential circuits, respectively. Correct functionality of the circuits should be verified with the ripples in the power-clock signal considered as noise.



Figure 7.1: Globally integrated power and clock (GIPAC) distribution network [58]

For the sequential circuits, both the  $V_{DD}$  and *clock* nodes in each circuit are connected to the power network. However, specially designed clock buffers like the one used in [16] need to be implemented at the *clock* node of each flip-flop to extract and generate a clock signal with suitable voltage levels from the *power-clock* signal. In this approach there will be no need for a clock network to distribute the clock signal leading to reduction in power consumption, metal overhead, and routing complexity. Correct functionality and feasibility of the proposed scheme can be verified through the fabrication and testing of a fully pipelined multiplier with only a power grid to distribute the *power-clock* signal.

2- Field Programmable Gate Arrays (FPGAs) compared to Application Specific Integrated Circuits (ASICs) provide high programming flexibility at the expense of power and area overhead. However, the uniform structure of the FPGA as well as the uniform distribution of sequential elements within the FPGA encourages the investigation of the feasibility of reducing the power of the CDN in the FPGA by applying resonant clocking techniques.

3- The matched delay values for long and short delay paths derived in the skew compensation technique in Chapter 3 are restricted to only three. This is because we took advantage of the three different voltage levels available in the technology to generate three versions of the same flip-flop.

Choosing different types of flip-flops with different delays or varying the transistor sizes in one flip-flop to generate more than three versions of the same flip-flop would increase the number of matched delay values to more than just three, thus maximizing design flexibility, wire length reduction, and power savings.

4- In the dual-edge triggering scheme, the interval by which the charging elements in the flip-flop are being switched on was reduced causing a reduction in power consumption. However, the proposed scheme requires three PMOS/NMOS transistors to be added in the pull-up/pull-down networks of the flip-flop in addition to the inverter chain. This causes an increase in area. The dual-edge triggering scheme can be further modified to reduce area overhead in such a way that the structure of the single-edge triggered flip-flop is not affected. An external circuit that is independent of the flip-flop and is responsible for generating the evaluation and precharging pulses can be used. Furthermore, adding conditional capturing and low-swing clocking to the dual-edge triggering scheme are additional angles to be investigated.

### **Publications from This Research**

### Journal Papers

- 1. S. E. Esmaeili, A. J. Al-Khalili, G. E. R. Cowan, "Application of low-swing clocking to LC resonant clock distribution networks", *IEEE Transactions on Very Large Scale Integration (TVLSI) Systems*, May 2011.
- 2. S. E. Esmaeili, A. J. Al-Khalili, G. E. R. Cowan, "Dual-edge triggered sense amplifier flip-flop for resonant clock distribution networks", *IET Computers and Digital Techniques*, vol. 4, issue 6, pp. 499-514, November 2010.
- 3. S. E. Esmaeili, A. M. Farhangi, A. J. Al-Khalili, G. E. R. Cowan, "Skew compensation in energy recovery clock distribution networks", *IET Computers and Digital Techniques*, vol. 4, issue 1, pp.56-72, January 2010.

### **Refereed Conference Papers**

- 1. S. E. Esmaeili, A. J. Al-Khalili, G. E. R. Cowan, "Estimating required driver strength in the resonant clock generator", *IEEE Asia Pacific Conference on Circuits and Systems*, accepted on publication, pp. 927-930, December 2010.
- 2. S. E. Esmaeili, A. J. Al-Khalili, G. E. R. Cowan, "Dual-edge triggered pulsed energy recovery flip-flops", 8<sup>th</sup> IEEE International NEWCAS Conference, pp. 345-348, June 2010.
- 3. S. E. Esmaeili, A. J. Al-Khalili, G. E. R. Cowan, "Dual-edge triggered energy recovery DCCER flip-flop for low energy applications", *European Conference on Circuit Theory and Design*, pp. 57-60, August 2009.
- 4. S. E. Esmaeili, A. J. Al-Khalili, G. E. R. Cowan, "A study on the effects of temperature and loading on the power consumption in energy recovery resonant clocking", *Proceedings of the International Conference on Information Science, Technology and Applications*, pp. 91-95, March 2009.
- 5. S. E. Esmaeili, A. J. Al-Khalili, G. E. R. Cowan, "A novel Approach for skew compensation in energy recovery clock distribution networks", *IEEE 20<sup>th</sup> International Conference on Microelectronics*, pp. 365-368, December 2008.
- 6. S. E. Esmaeili, G. E. R. Cowan, A. J. Al-Khalili, "Power reduction in energy recovery and square-wave clock distribution networks operating at half frequency with dual-edge triggered flip-flops", *Joint 6th International IEEE Northeast Workshop on Circuits and Systems and TAISA Conference*, pp.125-128, June 2008.

## References

- [1] Nedovic, N., Walker, W. W., Oklobdzija, V. G., and Aleksic, M. A low power symmetrically pulsed dual edge-triggered flip-flop. *Proceedings of the 28th European Solid-State Circuits Conference*, pages 399-402, 2002.
- [2] Liu, Y. T., Chiou, L. Y., and Chang, S. J. Energy-efficient adaptive clocking dual edge sense-amplifier flip-flop. *IEEE International Symposium on Circuits and Systems*, pages 4329-4332, 2006.
- [3] Kim, S., Ziesler, C. H., and Papaefthymiou, M. C. Charge-recovery computing on silicon. *IEEE Transactions on Computers*, 54(6):651-659, 2005.
- [4] Naffziger, S. D., and Hammond, G. The implementation of the next-generation 64
   b Itanium<sup>TM</sup> microprocessor. *Digest of Technical Papers, IEEE International Solid-State Circuits Conference, pages* 276-504, 2002.
- [5] Cooke, M., Mahmoodi-Meimand, H., and Roy, K. Energy recovery clocking scheme and flip-flops for ultra low-energy applications. *Proceedings of the International Symposium on Low Power Electronics and Design*, pages 54-59, 2003.
- [6] Sasaki, M. A high-frequency clock distribution network using inductively loaded standing-wave oscillators. *IEEE Journal of Solid-State Circuits*, 44(10):2800-2807, 2009.
- [7] Drake, A. J., Nowka, K. J., Nguyen, T. Y., Burns, J. L., and Brown, R. B. Resonant clocking using distributed parasitic capacitance. *IEEE Journal of Solid-State Circuits*, 39(9):1520-1528, 2004.

- [8] Friedman, E. B. Clock distribution networks in synchronous digital integrated circuits, *proceedings of the IEEE*, 89(5):665-692, 2001.
- [9] Chan, S. C. Design of multi-GHz resonant global clock distributions. PhD thesis, Columbia University, 2005.
- [10] Gowan, M. K., Biro, L. L., and Jackson, D. B. Power considerations in the design of the alpha 21264 microprocessor. *Design Automation Conference*, pages 726-731, 1998.
- [11] Anderson, C. J., Petrovick, J., Keaty, J. M., Warnock, J., Nussbaum, G., Tendier, J. M., Carter, C., Chu, S., Clabes, J., DiLullo, J., Dudley, P., Harvey, P., Krauter, B., LeBlanc, J., Pong-Fei Lu, McCredie, B., Plum, G., Restle, P. J., Runyon, S., Scheuermann, M., Schmidt, S., Wagoner, J., Weiss, R., Weitzel, S., and Zoric, B. Physical design of a fourth-generation POWER GHz microprocessor. *Digest of Technical Papers, IEEE International Solid-State Circuits Conference*, pages 232-233, 2001.
- [12] Pavlidis, V. F., Savidis, I., and Friedman, E. G. Clock distribution networks in 3-D integrated systems. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, pages 1-11, 2010.
- [13] Restle, P. J., McNamara, T. G., Webber, D. A., Camporese, P. J., Eng, K. F., Jenkins, K. A., Allen, D. H., Rohn, M. J., Quaranta, M. P., Boerstler, D. W., Alpert, C. J., Carter, C. A., Bailey, R. N., Petrovick, J. G., Krauter, B. L., and McCredie, B. D. A clock distribution network for microprocessors. *IEEE Journal* of Solid-State Circuits, 36(5):792-799, 2001.

- [14] Taskin, B., Wood, J., and Kourtev, I. S. Timing-driven physical design for VLSI circuits using resonant rotary clocking. 49th IEEE International Midwest Symposium on Circuits and Systems, pages 261-265, 2006.
- [15] Chi, V. L. Salphasic distribution of clock signals for synchronous systems. *IEEE Transactions on Computers*, 43(5):597-602, 1994.
- [16] O'Mahony, F., Yue, C. P., Horowitz, M. A., and Wong, S. S. A 10-GHz global clock distribution using coupled standing-wave oscillators. *IEEE Journal of Solid-State Circuits*, 38(11):1813-1820, 2003.
- [17] Yu, Z., and Liu, X. Power analysis of rotary clock. *IEEE Computer Society Annual Symposium on VLSI*, pages 150-155, 2005.
- [18] Wood, J., Edwards, T. C., and Lipa, S. Rotary traveling-wave oscillator arrays: A new clock technology. *IEEE Journal of Solid-State Circuits*, 36(11):1654-1665, 2001.
- [19] Honkote, V., and Taskin, B. Custom rotary clock router. *IEEE International Conference on Computer Design*, pages 114-119, 2008.
- [20] Honkote, V., and Taskin, B. Zero clock skew synchronization with rotary clocking technology. *Quality of Electronic Design*, pages 588-593, 2009.
- [21] Rosenfeld, J., and Friedman, E. G. Design methodology for global resonant Htree clock distribution network. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 15(2):135-148, 2007.
- [22] Chan, S. C., Shepard, K. L., and Restle, P. J. Design of resonant global clock distributions. 21st International Conference on Computer Design, pages 248-253, 2003.

- [23] Chan, S. C., Restle, P. J., Shepard, K. L., James, N. K., and Franch, R. L. A 4.6GHz resonant global clock distribution network. *Digest of Technical Papers, IEEE International Solid-State Circuits Conference,* pages 342-343, 2004.
- [24] Chan, S. C., Shepard, K. L., and Restle, P. J. Uniform-phase uniform-amplitude resonant-load global clock distributions. *IEEE Journal of Solid-State Circuits*, 40(1):102-109, 2005.
- [25] Chan, S. C., Shepard, K. L., and Restle, P. J. 1.1 to 1.6 GHz distributed differential oscillator global clock network. *Digest of Technical Papers, IEEE International Solid-State Circuits Conference*, pages 518-519, 2005.
- [26] Chan, S. C., Shepard, K. L., and Restle, P. J. Distributed differential oscillators for global clock networks. *IEEE Journal of Solid-State Circuits*, 41(9):2083-2094, 2006.
- [27] Chan, S. C., Restle, P. J., Bucelot, T. J., Liberty, J. S., Weitzel, S., Keaty, J. M., Flachs, B., Volant, R., Kapusta, P., and Zimmerman, J. S. A resonant global clock distribution for the cell broadband-engine processor. *IEEE Journal of Solid-State Circuits*, 44(1):64-72, 2009.
- [28] Mahmoodi, H., Tirumalashetty, V., Cooke, M., and Roy, K. Ultra low-power clocking scheme using energy recovery and clock gating. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 17(1): 33-44, 2009.
- [29] Carbognani, F., Buergin, F., Felber, N., Kaeslin, H., and Fichtner, W. Two-phase resonant clocking for ultra-low-power hearing aid applications, *Design Automation and Test in Europe*, pages 1-6, 2006.

- [30] Sarkar, P., and Koh, C. K. Repeater block planning under simultaneous delay and transition time constraints. *Design, Automation and Test in Europe*, pages 540-544, 2001.
- [31] Sinha, S., Xu, W., Velamala, J. B., Dastagir, T., Bakkaloglu, B., Yu, H., and Cao,
   Y. Enabling resonant clock distribution with scaled on-chip magnetic inductors.
   *IEEE International Conference on Computer Design*, pages 103-108, 2009.
- [32] Hansson, M., and Alvandpour, A. Power-performance analysis of sinusoidally clocked flip-flops. *NORCHIP Conference*, pages 153-156, 2005.
- [33] Alioto, M., Consoli, E., and Palumbo, G. Flip-flop Energy/Performance versus clock slope and impact on the clock network design. *IEEE Transactions on Circuits and Systems I: Regular Papers*, 57(6):1273-1286, 2010.
- [34] Rabaey, J. M., Chandrakasan, A., Nikolic, B. Digital integrated circuits: a design perspective. Prentice Hall Electronics and VLSI Series, 2nd edition, pages 491-498.
- [35] Chao, T. H., Hsu, Y. C., Ho, J. M., and Kahng, A. B. Zero skew clock routing with minimum wirelength. *IEEE Transactions on Circuits and Systems II: Analog* and Digital Signal Processing, 39(11):799-814, 1992.
- [36] Gupta, R., Krauter, B., Tutuianu, B., Willis, J., Pileggi, L. T. The Elmore delay as a bound for RC trees with generalized input signals. *32nd Conference on Design Automation*, pages 364-369, 1995.
- [37] Stojanovic, V., and Oklobdzija, V. G. Comparative analysis of master-slave latches and flip-flops for high-performance and low-power systems. *IEEE Journal* of Solid-State Circuits, 34(4):536-548, 1999.

- [38] Lin, S. and Yang, H. Leakage power reduction in flip flops by using MTCMOS and ULP switch. 49th IEEE International Midwest Symposium on Circuits and Systems, pages 21-25, 2006.
- [39] Tawfik, S. A., and Kursun, V. Dual-V<sub>DD</sub> clock distribution for low power and minimum temperature fluctuations induced skew. 8th International Symposium on Quality Electronic Design, pages 73-78, 2007.
- [40] Wu, C. H., Lin, S. H., and Chiueh, H. Logical effort model extension with temperature and voltage variations. 14th International Workshop on Thermal Investigation of ICs and Systems, pages 85-88, 2008.
- [41] Jackson, M. A. B., Srinivasan, A., and Kuh, E. S. Clock routing for highperformance ICs. 27th ACM/IEEE Design Automation Conference, pages 573-579, 1990.
- [42] Tsay, R. S. An exact zero-skew clock routing algorithm. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 12(2):242-249, 1993.
- [43] Liu, Y. T., Chiou, L. Y., and Chang, S. J. Energy-efficient adaptive clocking dual edge sense-amplifier flip-flop. *IEEE International Symposium on Circuits and Systems*, pages 4329-4332, 2006.
- [44] Kim, C., and Kang, S. A low-swing clock double-edge triggered flip-flop. *IEEE Journal of Solid-State Circuits*, 37(5):648-652, 2002.
- [45] Nedovic, N., Aleksic, M., and Oklobdzija, V. G. Timing characterization of dualedge triggered flip-flops. *International Conference on Computer Design*, pages 538-541, 2001.
- [46] Jie, Y. Manufacturability aware design. PhD thesis, University of Michigan, 2007.

- [47] Ghadiri, A., and Mahmoodi, H. Dual-edge triggered static pulsed flip-flops. 18th International Conference on VLSI Design, pages 846-849, 2005.
- [48] Nedovic, N., and Oklobdzija, V. G. Dual-edge triggered storage elements and clocking strategy for low-power systems. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 13(5):577-590, 2005.
- [49] Masuda, H., Okawa, S., and Aoki, M. Approach for physical design in sub-100 nm era. *IEEE International Symposium on Circuits and Systems*, pages 5934-5937, 2005.
- [50] Gettings, K. M. G. V., and Boning, D. S. Study of CMOS process variation by multiplexing analog characteristics. *IEEE Transactions on Semiconductor Manufacturing*, 21(4):513-525, 2008.
- [51] Kwon, Y. S., Park, B., Park, I., and Kyung, C. M. A new single-clock flip-flop for half-swing clocking. *Proceedings of the Asia and South Pacific Design Automation Conference*, pages 117-120, 1999.
- [52] Asgari, F. H. A., and Sachdev M. A low-power reduced swing global clocking methodology. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 12(5):538-545, 2004.
- [53] Pangjum, J., and Sapatnekar, S. S. Low-power clock distribution using multiple voltages and reduced swings. *IEEE Transactions on Very Large Scale Integration* (VLSI) Systems, 10(3):309-318, 2002.
- [54] Xu, Z., and Shepard, K. L. Design and analysis of actively-deskewed resonant clock networks. *IEEE Journal of Solid-State Circuits*, 44(2):558-568, 2009.

- [55] Chueh, J. Y., Sathe, V., Papaefthymiou, M. C. 900MHz to 1.2GHz two-phase resonant clock network with programmable driver and loading. *IEEE Custom Integrated Circuit Conference*, pages 777-780, 2006.
- [56] Sathe, V. S., Chueh, J. Y., Papaefthymiou, M. C. Energy-efficient GHz-class charge-recovery logic, *IEEE Journal of Solid-State Circuits*, 42(1):38-47, 2007.
- [57] Rabaey, J. M., Chandrakasan, A., Nikolic, B. Digital integrated circuits: a design perspective. Prentice Hall Electronics and VLSI Series, 2nd edition, pages 94-97.
- [58] Jakushokas, R., and Friedman, E. G., Globally integrated power and clock distribution network, *IEEE International Symposium on Circuits and Systems*, pages 1751-1754, 2010.
- [59] Canadian Microelectronics Corporation (CMC): www.cmc.ca

# Appendix A Multiply-Accumulate (MAC) Unit Design

### A.1 Multiply-Accumulate (MAC) Unit Design

Figure A.1 shows a simplified diagram of the MAC unit with a serial-in register at the input feeding the multiplier and multiplicand and a serial-out shift register at the output.



Figure A.1: MAC unit

### A.1.1 Serial-In Parallel-Out Shift Register

The serial-in parallel-out shift register used to feed the 32-bits of multiplier and multiplicand is shown in Figure A.2. As illustrated in the figure, the Differential



Figure A.2: Serial-in parallel-out shift register

Conditional Capturing Flip-Flop (DCCFF) under LC resonant clocking with dual modes of operation, i.e., full- and low-swing clocking is used. When *FULL\_SWING* signal is high, the flip-flop is operating in full-swing mode. When it is low, low-swing operation is enabled. As shown in the figure, the sinusoidal clock signal is fed to the flip-flops through a pass gate which is controlled by the *LOAD\_INPUT* signal. When *LOAD\_INPUT* is high, the clock is feeding the flip-flops and the 32 bits being fed by the first flip-flop are shifted to the next flip-flop in each positive clock edge. When *LOAD\_INPUT* signal is low, the pass gate is off, the clock signal at the flip-flop input is grounded by the NMOS transistor, and the output at each flip-flop will maintain its current state with no change. At the same time, the pass gate at the output of each flip-flop controlled by *LOAD\_INPUT* will be turned on, allowing the flip-flops to feed the multiplier and multiplicand bits to the 16 x 16-bits multiplier.

Post-layout simulation on the shift register has shown that the transparency interval of the flip-flops under the resonant clock with 100 MHz frequency is long. In that case, the input was not shifted properly between the flip-flops. The problem was fixed by adding a delay stage between the output of each flip-flop and the input of the next one.

### A.1.2 Parallel-In Serial-Out Shift Register



Figure A.3: Parallel-in serial-out shift register

Figure A.3 shows the 33 bits parallel-in/serial-out at the multiply and accumulate unit output. When *SHIFT\_LOADDB* signal is low, the AND gates on the right side are on, and the outputs S0 to S32 are stored in each flip-flop. When the signal is high, the AND gates on the left side are active

### A.2 Test Chip

### A.2.1 Pad Description

Table A.1 describes the name of the pad, type, and the signal being fed through each pad.

| Pad Name     | Туре   | Number of Pads | Description                                                                       |
|--------------|--------|----------------|-----------------------------------------------------------------------------------|
| VDD          | Input  | 4              | Supply for the entire chip                                                        |
| GND          | Input  | 5              | Ground for the entire chip                                                        |
| CLK          | Input  | 1              | Clock signal                                                                      |
| D            | Input  | 1              | Input to the serial-in/parallel-out register of the as well as the two flip-flops |
| FULL_SWING   | Input  | 1              | Signal switch between full- and low-swing flip-flop operation                     |
| LOAD_INPUT   | Input  | 1              | Loading input to the serial-in/parallel-out register                              |
| SHIFT_LOADDB | Input  | 1              | Loading input to the parallel-in/serial-out register                              |
| CLK_FF       | Input  | 1              | Clock feeding the two flip-flops                                                  |
| CLKD_FS      | Output | 1              | Inverted clock in the full-swing flip-flop                                        |
| CLKD_LS      | Output | 1              | Inverted clock in the low-swing flip-flop                                         |
| Q_FS         | Output | 1              | Output of the full-swing flip-flop                                                |
| Q_LS         | Output | 1              | Output of the low-swing flip-flop                                                 |
| S_OUT        | Output | 1              | Output of the serial register                                                     |
| Total        |        | 20             |                                                                                   |

Table A.1 Pad name, type, and description

### A.2.2 Chip Packaging and Test Fixture

The CFP80 package and the CFP80TF test fixture provided by CMC were chosen for the chip. The chip bonding diagram is shown in Figure A.4 and the test fixture is presented in Figure A.5 [59].



Figure A.4: Bonding diagram for the CFP80 package



Figure A.5: RF CFP80TF test fixture [59]