## Design of Stochastic Computing Architectures using Integrated Optics

Hassnaa El-Derhalli

A Thesis

in

The Department

of

Electrical and Computer Engineering

Presented in Partial Fulfillment of the Requirements

For the Degree of

Doctor of Philosophy (Electrical and Computer Engineering) at

Concordia University

Montréal, Québec, Canada

February 2021

© Hassnaa El-Derhalli, 2021

### **CONCORDIA UNIVERSITY**

#### SCHOOL OF GRADUATE STUDIES

This is to certify that the thesis prepared

By: Hassnaa El-Derhalli

Entitled: Design of Stochastic Computing Architectures using Integrated Optics

and submitted in partial fulfillment of the requirements for the degree of

Doctor Of Philosophy (Electrical and Computer Engineering)

complies with the regulations of the University and meets the accepted standards with respect to originality and quality.

Signed by the final examining committee:

|                                 | Chair                |
|---------------------------------|----------------------|
| Dr. Constantinos Constantinides |                      |
|                                 | External Examiner    |
| Dr. Sudeep Pasricha             |                      |
|                                 | External to Program  |
| Dr. Jamal Bentahar              |                      |
|                                 | Examiner             |
| Dr. Anjali Agarwal              |                      |
|                                 | Examiner             |
| Dr. Nawwaf Kharma               |                      |
|                                 | Thesis Co-Supervisor |
| Dr. Sebastien Le Beux           | 1                    |
|                                 | Thesis Co-Supervisor |
| Dr. Sofiene Tahar               |                      |
|                                 |                      |
|                                 |                      |

| Approved by    |                                                                                  |
|----------------|----------------------------------------------------------------------------------|
| 11 5           | Dr. Wei-Ping Zhu, Graduate Program Director                                      |
| March 19, 2021 |                                                                                  |
| ,              | Dr. Mourad Debbabi, Dean<br>Gina Cody School of Engineering and Computer Science |

### ABSTRACT

Design of Stochastic Computing Architectures using Integrated Optics

Hassnaa El-Derhalli, Ph.D.

Concordia University, 2021

Approximate computing (AC) is an emerging computing approach that allows to trade off design energy efficiency with computing accuracy. It targets error resilient applications, such as image processing, where energy consumption is of major concern. Stochastic computing (SC) is an approximate computing paradigm that leads to energy efficient and reduced hardware complexity designs. In this approach, data is represented as probabilities in bit streams format. The main drawback of this computing paradigm is the intrinsic serial processing of bit streams, which negatively impacts the processing time. Nanophotonics technology is characterized by high bandwidth and high signals propagation speed, which has the potential to support the electrical domain in computations to speed up the processing rate. The major issues in optical computing (OC) remain the large size of silicon photonics devices, which impact the design scalability. In this thesis, we propose, for the first time, an optical stochastic computing (OSC) approach, where we aim to design SC architectures using integrated optics. For this purpose, we propose a methodology that has libraries for optical processing and interfaces, e.g., bit stream generator. We design all-optical gates for the computation and develop transmission models for the architectures. The methodology allows for design space exploration of technological and system-level parameters

to optimize design performance, i.e., energy efficiency, computing accuracy, and latency, for the targeted application. This exploration leads to multiple design options that satisfy different design requirements for the selected application.

The optical processing libraries include designing a polynomial architecture that can execute any arbitrary single input function. We explore the design parameters by implementing a Gamma correction application for image processing. Results show a  $4.5 \times$  increase in the errors, which leads to  $47 \times$  energy saving and  $16 \times$  faster processing speed. We propose a reconfigurable polynomial architecture to adapt design order at run-time. The design allows the execution of high order polynomial functions for better accuracy or multiple low order functions to increase throughput and energy efficiency. Finally, we propose the design of combinational filters. The purpose is to investigate the design of cascaded gates architectures using photonic crystal (PhC) nanocavities. We use this device to design a Sobel edge detection filter for image processing. The resulting architecture shows 0.85nJ/pixel energy consumption and 51.2ns/pixel processing time. The optical interface libraries include designing different architectures of stochastic number generators (SNG) that are either electrical-optical or all-optical to generate the bit streams. We compare these SNGs in terms of computing accuracy and energy efficiency. The results show that all implementations can lead to the same level of computing accuracy. Moreover, using an all-optical SNG to design a fully optical 8-bit adder results in 98% reduction in hardware complexity and 70% energy saving compared to a conventional optical design.

To my father, my mother, my sister and brothers

### ACKNOWLEDGEMENTS

First, I would like to thank my supervisors, Dr. Sébastien Le Beux and Dr. Sofiène Tahar, for their help, guidance and encouragement throughout my Ph.D thesis. Their insights and deep research expertise have strengthened this work significantly. They were always approachable and I have learned a lot from my discussions with them. I would like to express my gratitude to Dr. Sudeep Pasricha for accepting to serve as my external PhD thesis examiner. I am sincerely grateful to Dr. Jamal Bentahar, Dr. Anjali Agarwal, and Dr. Nawwaf Kharma for serving on my advisory thesis committee.

I am very grateful to Dr. Asim Al-Khalili for his continuous support and encouragement. He was always available to listen and give advice. Many thanks to my friends and colleagues at the Hardware Verification Group (HVG), specially Yassmeen Elderhalli, Mbarka Soualhia, Mahmoud Masadeh and Saif Najmeddin, for being kind and supportive. I am extremely thankful for having my sister Yassmeen with me in HVG as a Ph.D student. Her accompaniment, caring, and support made this journey unforgettable.

I am deeply grateful to my father and my brothers, Omar and Mohamed, for their endless love and support. I could not have reached this stage of my life without them

## TABLE OF CONTENTS

| LI | ST O                                                                                                   | F TABLES                                                                                                                                                                                                                                                                                                                                                                                                                 |
|----|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LI | ST O                                                                                                   | F FIGURES xv                                                                                                                                                                                                                                                                                                                                                                                                             |
| LI | ST O                                                                                                   | F ACRONYMS                                                                                                                                                                                                                                                                                                                                                                                                               |
| 1  | $\operatorname{Intr}$                                                                                  | oduction 1                                                                                                                                                                                                                                                                                                                                                                                                               |
|    | 1.1                                                                                                    | Motivation                                                                                                                                                                                                                                                                                                                                                                                                               |
|    | 1.2                                                                                                    | State-of-the-Art                                                                                                                                                                                                                                                                                                                                                                                                         |
|    |                                                                                                        | 1.2.1 Stochastic Computing                                                                                                                                                                                                                                                                                                                                                                                               |
|    |                                                                                                        | 1.2.2 Optical Computing Architectures                                                                                                                                                                                                                                                                                                                                                                                    |
|    | 1.3                                                                                                    | Proposed Methodology                                                                                                                                                                                                                                                                                                                                                                                                     |
|    | 1.4                                                                                                    | Thesis Contributions                                                                                                                                                                                                                                                                                                                                                                                                     |
|    | 15                                                                                                     | Thesis Organization                                                                                                                                                                                                                                                                                                                                                                                                      |
|    | 1.0                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 2  | Opt                                                                                                    | ical Stochastic Computing Architecture for Polynomial Func-                                                                                                                                                                                                                                                                                                                                                              |
| 2  | Opt<br>tion                                                                                            | ical Stochastic Computing Architecture for Polynomial Func-<br>s                                                                                                                                                                                                                                                                                                                                                         |
| 2  | Opt<br>tion<br>2.1                                                                                     | ical Stochastic Computing Architecture for Polynomial Func-<br>s 22<br>Overview                                                                                                                                                                                                                                                                                                                                          |
| 2  | <ul> <li>1.5</li> <li>Opt</li> <li>tion</li> <li>2.1</li> <li>2.2</li> </ul>                           | ical Stochastic Computing Architecture for Polynomial Func-<br>s 22<br>Overview                                                                                                                                                                                                                                                                                                                                          |
| 2  | <ul> <li>Opt</li> <li>tion</li> <li>2.1</li> <li>2.2</li> <li>2.3</li> </ul>                           | ical Stochastic Computing Architecture for Polynomial Func-<br>22 Overview                                                                                                                                                                                                                                                                                                                                               |
| 2  | <ul> <li>Opt</li> <li>tion</li> <li>2.1</li> <li>2.2</li> <li>2.3</li> <li>2.4</li> </ul>              | ical Stochastic Computing Architecture for Polynomial Func-       22         s       23         Overview       23         Silicon Photonics       24         Proposed Methodology       27         Proposed Architecture       29                                                                                                                                                                                        |
| 2  | <ul> <li>Opt</li> <li>tion</li> <li>2.1</li> <li>2.2</li> <li>2.3</li> <li>2.4</li> <li>2.5</li> </ul> | ical Stochastic Computing Architecture for Polynomial Func-       22         s       23         Overview       23         Silicon Photonics       24         Proposed Methodology       27         Proposed Architecture       29         Implementation and Modeling       37                                                                                                                                           |
| 2  | <ul> <li>Opt</li> <li>tion</li> <li>2.1</li> <li>2.2</li> <li>2.3</li> <li>2.4</li> <li>2.5</li> </ul> | ical Stochastic Computing Architecture for Polynomial Func-         s       22         Overview       23         Silicon Photonics       24         Proposed Methodology       27         Proposed Architecture       29         Implementation and Modeling       37         2.5.1       Error Evaluation       37                                                                                                      |
| 2  | <ul> <li>Opt</li> <li>tion</li> <li>2.1</li> <li>2.2</li> <li>2.3</li> <li>2.4</li> <li>2.5</li> </ul> | ical Stochastic Computing Architecture for Polynomial Func-       22         s       23         Overview       23         Silicon Photonics       24         Proposed Methodology       27         Proposed Architecture       29         Implementation and Modeling       37         2.5.1       Error Evaluation       37         2.5.2       Transmission Model       38                                             |
| 2  | <ul> <li>Opt</li> <li>tion</li> <li>2.1</li> <li>2.2</li> <li>2.3</li> <li>2.4</li> <li>2.5</li> </ul> | ical Stochastic Computing Architecture for Polynomial Func-       22         S       22         Overview       23         Silicon Photonics       24         Proposed Methodology       27         Proposed Architecture       29         Implementation and Modeling       37         2.5.1       Error Evaluation       37         2.5.2       Transmission Model       38         2.5.3       Design Methods       40 |

|   |      | 2.6.1   | Case Study: Gamma Correction Application                  | 41 |
|---|------|---------|-----------------------------------------------------------|----|
|   |      | 2.6.2   | Application-level Computing Accuracy                      | 47 |
|   |      | 2.6.3   | Energy Efficiency Optimization                            | 50 |
|   |      | 2.6.4   | Accuracy and Energy Design Trade-off                      | 52 |
|   | 2.7  | Summ    | nary                                                      | 53 |
| 3 | Rec  | onfigu  | rable Optical Stochastic Computing Architecture for Poly- | -  |
|   | non  | nial Fu | inctions                                                  | 55 |
|   | 3.1  | Propo   | sed Methodology                                           | 56 |
|   | 3.2  | Propo   | sed Architecture                                          | 57 |
|   |      | 3.2.1   | Directional Coupler                                       | 57 |
|   |      | 3.2.2   | Reconfigurable Bernstein Polynomial Architecture          | 58 |
|   |      | 3.2.3   | Design Method                                             | 61 |
|   | 3.3  | Imple   | mentation and Modeling                                    | 63 |
|   | 3.4  | Simula  | ation Results                                             | 64 |
|   |      | 3.4.1   | Accuracy and Throughput Trade-off                         | 64 |
|   |      | 3.4.2   | Static vs Reconfigurable Architectures                    | 67 |
|   | 3.5  | Summ    | nary                                                      | 69 |
| 4 | Opt  | tical S | tochastic Computing Architecture for Combinational Fil-   | -  |
|   | ters | 5       |                                                           | 70 |
|   | 4.1  | Overv   | iew                                                       | 70 |
|   |      | 4.1.1   | All-optical Architecture                                  | 71 |
|   |      | 4.1.2   | Stochastic Computing Edge Detection Filter                | 72 |
|   | 4.2  | Propo   | sed Methodology                                           | 75 |
|   | 4.3  | Photo   | nics Crystal Nanocavity                                   | 76 |

|   |     | 4.3.1   | Nanocavity Device Overview                        | . 76  |
|---|-----|---------|---------------------------------------------------|-------|
|   |     | 4.3.2   | All-optical NOT Gate                              | . 77  |
|   |     | 4.3.3   | Design of All-optical XOR Gate and MUX            | . 80  |
|   | 4.4 | Nanoo   | eavity Model                                      | . 83  |
|   | 4.5 | Propo   | sed Edge Detection Filter Architecture            | . 86  |
|   |     | 4.5.1   | Architecture Overview                             | . 87  |
|   |     | 4.5.2   | Design Challenges                                 | . 92  |
|   | 4.6 | Imple   | mentation and Model                               | . 93  |
|   |     | 4.6.1   | Error Evaluation                                  | . 93  |
|   |     | 4.6.2   | Edge Detection Transmission Model                 | . 94  |
|   |     | 4.6.3   | Nanocavity Design Parameters                      | . 95  |
|   | 4.7 | Simula  | ation Results                                     | . 98  |
|   |     | 4.7.1   | Model Calibration                                 | . 99  |
|   |     | 4.7.2   | Design of XOR Gate                                | . 102 |
|   |     | 4.7.3   | Design of MUX                                     | . 104 |
|   |     | 4.7.4   | Application-level Design Comparison               | . 107 |
|   | 4.8 | Summ    | nary                                              | . 109 |
| _ |     |         |                                                   | 110   |
| 5 | Opt | ical St | tochastic Number Generator Architectures          | 110   |
|   | 5.1 | Overv   | iew                                               | . 110 |
|   | 5.2 | Propo   | sed Designs                                       | . 112 |
|   | 5.3 | Optica  | al SNGs Comparison                                | . 114 |
|   |     | 5.3.1   | Energy Consumption                                | . 114 |
|   | 5.4 | Towar   | ds All-optical Stochastic Computing Architectures | . 119 |
|   | 5.5 | Summ    | nary                                              | . 124 |

| 6  | Conclusions and Future Work 12 |             |       |
|----|--------------------------------|-------------|-------|
|    | 6.1                            | Conclusions | . 125 |
|    | 6.2                            | Future Work | . 129 |
| Bi | bliog                          | graphy      | 133   |
| Bi | ogra                           | phy         | 148   |

# List of Tables

| 2.1 | System-level and technological parameters                             | 36  |
|-----|-----------------------------------------------------------------------|-----|
| 3.1 | Energy and area overhead evaluation                                   | 68  |
| 4.1 | Device parameters.                                                    | 83  |
| 4.2 | Device/system-level parameters, and performance of two designs target |     |
|     | $PSNR_{Total} = 26.4$                                                 | 108 |
| 5.1 | Hardware complexity and power consumption of SNG-based LFSR de-       |     |
|     | signs                                                                 | 117 |
| 5.2 | Hardware complexity of $n$ -bit adder proposed in [75] and our work   | 120 |

# List of Figures

| 1.1 | SC blocks.                                                                 | 2  |
|-----|----------------------------------------------------------------------------|----|
| 1.2 | (a) SNG and (b) de-randomizer                                              | 6  |
| 1.3 | The basic structure of a neuron [36]                                       | 9  |
| 1.4 | Silicon photonics devices                                                  | 13 |
| 1.5 | (a) PCM as scalar multiplier and (b) two PCMs used to implement            |    |
|     | matrix vector multiplication [80]                                          | 15 |
| 1.6 | Proposed methodology                                                       | 17 |
| 2.1 | (a) ReSC architecture with (b) an example of $3^{\rm rd}$ order polynomial |    |
|     | function                                                                   | 24 |
| 2.2 | MZI device.                                                                | 25 |
| 2.3 | MRR device.                                                                | 26 |
| 2.4 | AOF device                                                                 | 27 |
| 2.5 | Proposed methodology for design space exploration of polynomial ar-        |    |
|     | chitectures.                                                               | 28 |
| 2.6 | OSC architecture of a $2^{nd}$                                             | 30 |
| 2.7 | The transmission of the output signal                                      | 32 |
| 2.8 | SNG-based LFSR + modulator                                                 | 33 |
| 2.9 | De-randomizer for OSC architecture.                                        | 34 |

| 2.10 | Generic architecture for OSC circuit.                                            | 35 |
|------|----------------------------------------------------------------------------------|----|
| 2.11 | Same accuracy level is reached                                                   | 37 |
| 2.12 | Minimum probe laser power according to                                           | 43 |
| 2.13 | The transmission of MRRs and AOF                                                 | 45 |
| 2.14 | Gamma correction application: (a) Output pixels according                        | 46 |
| 2.15 | Errors for data input ranging from 0 to 1                                        | 48 |
| 2.16 | $MED_{Total}$ of the processed image                                             | 49 |
| 2.17 | Laser energy consumption per computed bit according to a) $WLS$ and              |    |
|      | b) the polynomial degree                                                         | 51 |
| 2.18 | Designs that maximize the processing accuracy and energy efficiency              |    |
|      | for Gamma correction application                                                 | 52 |
| 3.1  | Proposed methodology for design space exploration of reconfigurable              |    |
|      | polynomial architectures                                                         | 57 |
| 3.2  | DC device.                                                                       | 58 |
| 3.3  | Proposed reconfigurable architecture for polynomial functions                    | 59 |
| 3.4  | $Cfg_{1\times 4}$ executes a single 4 <sup>th</sup> order function               | 60 |
| 3.5  | $Cfg_{2\times 2}$ executes two 2 <sup>nd</sup> order functions                   | 61 |
| 3.6  | Error free function $f(x)$ and approximate polynomial functions for              |    |
|      | $Cfg_{1\times 4}$ and $Cfg_{2\times 2}$ .                                        | 65 |
| 3.7  | Image processed for (a) $Cfg_{1\times 4}$                                        | 66 |
| 3.8  | Accuracy and energy efficiency results to process $160 \times 160$ pixels images |    |
|      | for <i>BSL</i> ranging from $2^8$ to $2^{12}$                                    | 69 |
| 4.1  | Stochastic implementation of edge detection filter using Robert's cross          |    |
|      | operator                                                                         | 73 |

| 4.2  | (a) XOR gate as absolute value subtractor and (b) $2{\times}1$ MUX as scaled                                                 |       |
|------|------------------------------------------------------------------------------------------------------------------------------|-------|
|      | adder                                                                                                                        | . 73  |
| 4.3  | Proposed methodology for design space exploration of combinational                                                           |       |
|      | filter architectures.                                                                                                        | . 75  |
| 4.4  | Photographs of the studied PhC nanocavity.                                                                                   | . 77  |
| 4.5  | An all-optical NOT gate implemented using nanocavity                                                                         | . 79  |
| 4.6  | Nanocavity operating as (a) a 2-input XOR gate                                                                               | . 81  |
| 4.7  | Nanocavity operating as (a) a $2 \times 1$ MUX                                                                               | . 82  |
| 4.8  | Proposed model.                                                                                                              | . 84  |
| 4.9  | (a) Transmission of nanocavity devices of $(Q_{S[gate]} = 700, 1500, 4000,$                                                  |       |
|      | $M_{[gate]} = 1 \dots \dots$ | . 85  |
| 4.10 | (a) Transmission of nanocavity devices of $(Q_{S[gate]}=1050, M_{[gate]}=1.5,$                                               |       |
|      | 1, 0.5)                                                                                                                      | . 86  |
| 4.11 | The OC architecture of edge detection filters                                                                                | . 88  |
| 4.12 | SNGs for (a) XOR gates and (b) MUXs                                                                                          | . 90  |
| 4.13 | The transmission of two XOR gates and one MUX per stage                                                                      | . 91  |
| 4.14 | Characterization results and model calibration                                                                               | . 100 |
| 4.15 | For a nanocavity of $Q_{S[NOT]} = 2000$ and $M_{[NOT]} = 2 \dots \dots \dots$                                                | . 101 |
| 4.16 | Total laser powers of XOR gate                                                                                               | . 103 |
| 4.17 | Achievable <i>BER</i> at each stage for nanocavities with $M_{[MUX]} = 2$                                                    | . 106 |
| 4.18 | Processed image: (a) error free, and $PSNR_{Total}$                                                                          | . 108 |
| 5.1  | Three implementations of SNG:                                                                                                | . 112 |
| 5.2  | SNG-based LFSR + modulated laser                                                                                             | . 113 |
| 5.3  | All-optical SNG using nanolasers                                                                                             | . 114 |

| 5.4 | Edge detection filter with (a) SNG-based LFSR $+$ modulator with off-      |     |
|-----|----------------------------------------------------------------------------|-----|
|     | chip lasers                                                                | 115 |
| 5.5 | Total energy consumption for processing images (A) and (B)                 | 118 |
| 5.6 | Processed images using (a) SNG-based LFSR and (b) a mix of SNG-            |     |
|     | based LFSR and all-optical SNG                                             | 120 |
| 5.7 | The design of an n-bit adder proposed in (a) [75] and (b) our work. $\ .$  | 121 |
| 5.8 | The number of devices (solid lines), without interface, and the process-   |     |
|     | ing time (dashed lines) for an $n$ -bit adder in [75] (blue color) and our |     |
|     | work (red color).                                                          | 122 |
| 5.9 | Energy consumption for an $n$ -bit adder implemented using the design      |     |
|     | in [75] (blue bars) and our work (red bars)                                | 123 |
|     |                                                                            |     |

## LIST OF ACRONYMS

| AC   | Approximate Computing                   |
|------|-----------------------------------------|
| ALU  | Arithmetic Logic Unit                   |
| AOF  | All-optical Add-drop Filter             |
| AOG  | All Optical Gate                        |
| BER  | Bit-Error Rate                          |
| BN   | Binary Number                           |
| BSL  | Bit Stream Length                       |
| CMOS | Complementary Metal Oxide Semiconductor |
| CNN  | Convolutional Neural Network            |
| CPU  | Central Processing Unit                 |
| CW   | Continuous Wave                         |
| D/A  | Digital/Analog                          |
| DC   | Directional Coupler                     |
| ED   | Error Distance                          |
| E/O  | Electro/Optics                          |
| ER   | Extinction Ratio                        |
| FIR  | Finite Impulse Response                 |
| FPGA | Field-Programmable Gate Array           |
| FSM  | Finite State Machine                    |
| FSR  | Free Spectral Range                     |
| IL   | Insertion Loss                          |
| IoT  | Internet-of-Things                      |
| LDPC | Low-Density Parity Check                |

| LFSR     | Linear Feedback Shift Register |
|----------|--------------------------------|
| LSB      | Least Significant Bit          |
| ML       | Machine Learning               |
| MTJ      | Magnetic Tunnel Junctions      |
| MED      | Mean Error Distance            |
| MRR      | Microring Resonator            |
| MSE      | Mean Square Error              |
| MUX      | Multiplexer                    |
| MZI      | Mach-Zehnder Interferometer    |
| NN       | Neural Networks                |
| OC       | Optical Computing              |
| OLUT     | Optical Lookup Table           |
| O/E      | Opto/Electronic                |
| ONoC     | Optical Networks on Chip       |
| OOK      | ON/OFF Keying                  |
| OSC      | Optical Stochastic Computing   |
| OTE      | Optical Tuning Efficiency      |
| PCM      | Phase Change Material          |
| PhC      | Photonic Crystal               |
| PIN      | Positive-Intrinsic-Negative    |
| PRN      | Pseudo Random Number           |
| PSNR     | Peak Signal-to-Noise Ratio     |
| Q factor | Quality factor                 |
| RAM      | Ransom Access Memory           |
| RDL      | Reconfigurable Directed Logic  |

| ReSC  | Reconfigurable Stochastic Computing    |
|-------|----------------------------------------|
| SN    | Stochastic Number                      |
| SNG   | Stochastic Number Generator            |
| SNR   | Signal-to-Noise Ratio                  |
| SOI   | Silicon-on-Insulator                   |
| SC    | Stochastic Computing                   |
| TIA   | Transimpedence Amplifier               |
| TMAC  | Tera Multiply-ACcumulate               |
| TPA   | Two-Photon Absorption                  |
| VCSEL | Vertical-Cavity Surface-Emitting Laser |
| WBG   | Weighted Binary Generator              |
| WDM   | Wavelength Division Multiplexing       |
| WLS   | Wavelength Spacing                     |

## Chapter 1

## Introduction

In this chapter, we first present the motivation of this PhD thesis and the problem statement. Then, we introduce the state-of-the-art for both stochastic and optical computing architectures. We present the proposed methodology and highlight the contributions of the thesis. Finally, we describe the thesis organization.

### 1.1 Motivation

The last decades witnessed a turnover in the concept of the computing paradigm. Due to intensive data processing in applications, such as image processing and the internet of things (IoT), there is a significant need for more resources, which in return increase the power consumption. Moreover, many computing systems nowadays are embedded and hence require energy efficient hardware. For example, the number of IoT-connected devices worldwide increased from 3.8 billion in 2015 to 7 billion in 2018 [1]. Furthermore, it is expected to reach 21 billion devices by 2025 [2]. These devices include smartphones and tablets, which require a computing approach that saves processing energy.

Approximate computing (AC) is an energy efficient technique that produces inexact results to reduce power consumption [3]. Therefore, this technique is suitable for error tolerant applications, such as image processing and signal processing [4]. One of the commonly known AC approaches is stochastic computing (SC) that emerged in the 60s [5]. In SC, numbers are represented as probabilities using stochastic bit streams [6]. The weight of all bits in a bit stream is the same, i.e., there are no least and most significant bits as in weighted binary numbers. In some conventional AC architectures [4], the approximation comes from truncating the least significant bits (LSB), and the accuracy can be enhanced by reducing the number of truncated bits. While in SC, the approximation results from generating bit streams and the accuracy is improved by increasing the bit stream length (BSL). Unlike AC, SC does not require any change in the design of the computing architecture since the accuracy can be controlled by only modifying the BSL. Figure 1.1 shows the main building blocks of an SC architecture. It is composed of one processing unit and two interfaces, i.e., stochastic number generator (SNG) and de-randomizer. The SNG receives a binary number and generates the equivalent probability in a bit stream format. The probability is evaluated as the ratio of the number of '1's in the bit stream to the total number of bits, i.e., BSL. The bit stream is processed serially by the computing unit. Then, the output bit stream is converted back to a binary number using a de-randomizer unit. Since all bits in the stream have the same weight, a flip in a bit results in a small change in the probability; hence it is suitable for domains where soft and transient errors are of major concern [7].



Figure 1.1: SC blocks.

SC is an energy efficient approach characterized by reduced hardware complexity. Elementary arithmetic operations can be implemented using simple logic gates. For example, 2-input multiplication and 2-input addition can be implemented using a single 2-input AND gate and a  $2 \times 1$  multiplexer, respectively [8]. Since bit streams are processed serially, this represents a tremendous drawback in the processing time. Moreover, in order to increase the computing accuracy, longer bit streams are required, which significantly impacts the computation latency. Researchers have investigated parallel design techniques to overcome the slow computation speed [9]. However, such approaches may lead to significant area and power overhead, and thus drastically limits the interest in this computing paradigm. Therefore, there is a need to find a technology that can accelerate the processing time of the SC approach.

Due to light propagation characteristics, such as low latency and high bandwidth, nanophotonics technology is considered as a good candidate to overcome the throughput limitations induced by the electrical domain [10]. Silicon photonics technology allows the combination of electrical and optical devices in the same design [11]. Therefore, silicon photonics devices can be manufactured using the same facilities available for CMOS [12]. Companies, such as Intel and IBM, have started using integrated optics for high speed communications in data centers. For example, Intel 100G optical transceivers, available in the market, allow a data transfer rate of 100Gbps [13]. In 2019, Intel announced the design of 400G transceivers that transfer data at 400Gbps rate [14,15]. IBM offers optics transceivers that support a speed up to 32Gbps [16]. According to [17], optical interconnects can be considered as a good candidate to be integrated in distributed and parallel computing systems for chip to chip or even on-chip communication. Recently, nanophotonics has been widely investigated in the design of optical interconnects, where different topologies based on system-level simulation have been proposed [18–21]. In these designs, wavelength division multiplexing (WDM) is exploited to allow the propagation of multiple wavelength signals on the same waveguide. This leads to an increase in the bandwidth and a reduction in hardware utilization. The first demonstration of on-chip optical interconnects is proposed in [22]. In this work, a prototype of an on-chip electronic-photonics system is fabricated that contains a processor and a memory communicating through optical transceivers. It is worth mentioning that optical interconnects can feature low energy dissipation per transmitted bit [10] due to the absence of the capacitive charging/discharging in the wires of electrical interconnect. The design of approximate optical interconnects has been recently proposed, where the data that has a small impact on the accuracy can be transmitted with low power [23]. In [24], the energy efficiency can be further optimized by truncating data, which can be adapted at run-time.

Nanophotonics technology has been investigated for the use in computation. While CMOS-based architectures depend on the flow of electrons to perform the computation, optical technology relies on photon propagation. Nanophotonics cannot be considered as a replacement for CMOS technology in the computing domain. It can be used to support the computation by accelerating the processing time for specific applications. WDM allows for parallel computation, which increases processing throughput since multiple signals are propagated and processed simultaneously. In [25], WDM is used to design an arithmetic logic unit (ALU) for optical field-programmable gate array (FPGA), where multiple operations are executed in parallel. In 2020, Intel showed the design of a reconfigurable optical computer to accelerate solving partial differential equations in 10's of picoseconds [6]. In 2017, researchers from MIT developed an optical accelerator for convolutional neural networks (CNN), which demonstrates two orders of magnitude speed up, i.e., photodetection rate of 100GHz, compared to electronics implementation [26]. However, such an approach involves bulky optical devices, i.e., scaled in mm<sup>2</sup>, which does not allow the implementation of thousands of devices on a chip, hence limiting the scalability of nanophotonic accelerators. Architectures of cascaded devices, i.e., devices connected in series, also encounter another problem due to the propagated signal power losses. This raises the need for an approach that leads to a reduction in the number of devices and hence enhances the scalability and cascadability of optical architectures. This can be achieved using the SC paradigm due to the reduced hardware complexity provided by the approach.

To sum up, the major bottleneck in the performance of SC architecture is the high latency induced by serial processing of bit streams. Nanophotonics technology could contribute to speeding up the computation. However, scalability remains one of the main issues in the optical computing (OC) domain that could be enhanced using the SC approach in designing optical architectures.

The objective of this thesis is to design, for the first time, SC architectures using integrated optics. Both SC and optical technology have a complementary nature. We mainly aim to benefit, on the one hand, from the acceleration provided by optical devices to overcome the slow processing in SC and, on the other hand, from the reduced number of devices used in SC to increase design scalability in optical computing. We propose the design of optical stochastic computing (OSC) architectures that can execute polynomial functions and combinational filters. We focus mainly on implementing the computing part using all-optical gates. We explore the design space at system-level and device-level parameters to optimize the power consumption and evaluate the computing accuracy and processing time of a given application.



Figure 1.2: (a) SNG and (b) de-randomizer.

### 1.2 State-of-the-Art

In this section, we present the most relevant related work of SC and OC architectures.

### **1.2.1** Stochastic Computing

In SC, numbers are presented as probabilities. The SNG converts a binary number to a bit stream of a given length. A common implementation of SNG is using Linear Feedback Shift Register (LFSR) [27] and a comparator, as shown in Figure 1.2(a). The generation of bit streams is performed as follows: A binary number (BN) of size m requires a minimum BSL of size  $2^m$ . An LFSR of m bits is also needed to generate m-1 sequences, i.e., a pseudo-random number (PRN), since a sequence of zeros cannot be reached. A PRN is compared against a BN. If PRN < BN then bit '1' is generated, otherwise bit '0' is generated. At each clock cycle, the LFSR is shifted, a comparison is performed, and a new bit in the bit stream is generated. To convert the bit stream back to a binary number, a de-randomizer unit is required. It is commonly implemented using a counter, shown in Figure 1.2(b), that counts the number of '1's in the bit stream.

One of the state-of-the-art SC architectures is the reconfigurable SC (ReSC) proposed in [28]. The architecture is implemented using a combinational circuit that

can execute any arbitrary single input function by converting it to a Bernstein polynomial [29]. The inputs must be uncorrelated, i.e., each bit stream is generated from a separate SNG. In [30], another reconfigurable SC architecture is proposed, which is based on sequential logic. It allows for a trade-off between hardware complexity and computing accuracy by changing some configurations, such as the number of states and the number of inputs. The design can reduce hardware complexity and improve energy efficiency by 30% and 40% compared to conventional design, respectively.

SC can also target application-specific architectures for different domains. Contrast stretching [31] and edge detection [32] are image processing filters that can be implemented using combinational and sequential logic elements. For example, the edge detection filter proposed in [32] is composed of MUXs and XOR gates to perform addition and absolute value subtraction operations, respectively. For instance, for medical applications, the design of retinal implants for blind people using SC for image processing is proposed in [33]. The chip is located in the retina and receives a stream of data to be processed in real-time. The implementation of finite impulse response (FIR) filters using SC has also been widely studied. However, most designs suffer from limited scalability as the filter order increases, which is due to the low accuracy of stochastic scaled-adders [6]. For example, an m-tap filter will down-scale the result by  $1/2^{m-1}$ . For this purpose, a non-scaled adder was proposed in [34], where the design contains combinational circuits and a counter. This results in high design area and hence power consumption. In the communication domain, SC is proposed for decoding low-density parity-check (LDPC) codes [35], which is an error correction code used for reliable transmission over noisy channels. The parity check and equality check operations can be implemented using SC circuits [7].

Neural networks (NNs) are useful in many applications, such as character recognition, speech recognition, and spell checking. NNs are composed of multiple layers, where each layer contains multiple neurons. Therefore, the design of NNs require significant hardware resources and result in high power consumption. Hence, their implementations have been investigated in the context of SC [36]. Figure 1.3 illustrates the structure of a neuron implemented using the SC approach, where three primary operations are presented. Multiplication and addition can be implemented using AND gates and MUXs, respectively. The activation function can be designed using a finite state machine (FSM) to implement tanh or exponentiation circuits. In [37], a design of CNN using stochastic computing is proposed, where the implementation of the summation using parallel counter is studied instead of using MUX. The output of the counter is a binary number used as an input to a non-stochastic activation function. The design is tested for handwriting recognition, where the results showed an increase in the area compared to MUX implementation but with an enhancement in the accuracy. The results also demonstrated a  $151 \times$  improvement in power consumption with a 2.86% increase in the error compared to conventional binary design. A CNN relying on hybrid bit stream-binary is proposed in [38]. The design of the first layer is based on deterministic bit streams for accurate and fast computing. The results show  $19 \times$  area reduction and  $16 \times$  power saving compared to the non-pipelined fixed point binary design.

As can be seen, SC can be integrated into many domains, such as image and signal processing, medical and communication applications. It can reduce design area and save energy consumption; however, the high processing time, due to the high BSL required to control the accuracy, remains the major problem of this approach, which moves the interest to other faster AC approaches.



Figure 1.3: The basic structure of a neuron [36].

### 1.2.2 Optical Computing Architectures

A remarkable achievement in OC was noticed more than 70 years ago. The main focus was on exploiting Fourier transform properties of lenses to implement applications, such as pattern recognition [39]. The invention of lasers [40] enabled the development of coherent processors, e.g., for information processing [41]. However, the domain faced issues related to the performance and high fabrication cost of optical devices. At the same time, the rapid development in digital computing architectures and the high performance of electrical processing decreased the interest in OC. However, the research in OC processors continued, where significant progress in areas, such as optical memory and pattern recognition, was achieved. Moreover, OC architectures were designed for other applications, such as matrix operation [42] and NNs [43]. Later on, many research targeted the design of digital OC [44, 45], where vertical-cavity surface-emitting lasers (VCSELs) were used as a source of light. Nowadays, there are companies specialized in designing accelerators for machine learning (ML) domains, such as LightON [46] and lightmatter [47]. For example, Lightmatter integrates optical devices with electrical circuits on the same chip, where optical modulators are used to perform matrix-vector multiplication. The modulators are arranged as twodimensional matrix. The configuration (operating state) of the modulators and the propagation of the input signals through the devices leads to matrix multiplication, i.e., the amplitude of the input signal is multiplied by the transfer matrix.

Silicon photonics is an emerging technology that uses silicon material to fabricate optical devices. For instance, a silicon-on-insulator (SOI) platform allows using the same manufacturing process as CMOS technology. Due to its compatibility with CMOS technology, silicon photonics devices can be integrated on the same chip with electronics devices [48]. Examples of silicon photonics devices are Mach-Zehnder interferometer (MZI) [49], microring resonator (MRR) [50], and directional coupler (DC) [51]. Another material that can be used to fabricate photonic devices is III/V semiconductors [52]. It can be bonded directly on SOI waveguides, which allows the fabrication of other devices, such as photonic crystal (PhC) nanocavities [53] and PhC nanolasers [54].

MZI, MRR, and PhC nanocavity can be used to either modulate, switch, or filter optical signals. They can be controlled either electrically or optically. Electrical control can be achieved by applying an external electrical signal that changes the refractive index of the material, and hence modulates the phase and amplitude of the propagated signal. This can be achieved in three different ways: i) electro-optic effect [55], where the refractive index of the material changes with the applied electrical field; ii) thermo-optic effect [56], where the refractive index changes by changing the temperature of the material; and iii) free-carrier induced electro-refractive effect [57], where the refractive index changes by modifying the carrier concentration in the material. It is worth mentioning that the electro-optic effect has the fastest response time. Optical devices can be controlled optically using high power optical signal (usually in mW). This induces a nonlinear effect, such as the two-photon absorption (TPA) effect and the optical Kerr effect [58]. This type of control is faster and less power consuming than an electrically controlled method since it enables the optical signal propagation from the input to the output of the architecture without the need for electro-optics (E/O) or opto-electronic (O/E) conversion [59].

In addition to the modulators and filters, any optical architecture requires: i) waveguides for light propagation and to connect between optical devices; ii) lasers as light sources to emit optical signals; and iii) photodetectors to receive the processed signal, which are detailed as follows:

- Waveguides: Silicon photonics waveguides are fabricated on SOI substrates. They are recently characterized by low optical propagation losses, i.e., 0.1-1dB/cm [60].
- Lasers: They are responsible for injecting an optical signal of sufficient power into the design for processing. Lasers can be off-chip or on-chip [61] to integrate all-optical devices on the same chip. However, heating issues still need more investigation [62]. Throughout this thesis, we consider using off-chip lasers.
- Photodetectors: They are used to convert the received optical signal to the electrical domain [63]. In order for the photodetector to be efficient for the OC domain, it needs to have: i) high operating rate, i.e. in Gb/s [64]; ii) low dark current (few nA); iii) high responsivity (0.6-1A/W) [65,66]; and iv) low bit-error rate (*BER*) (10<sup>-12</sup> to 10<sup>-18</sup>). In an OC architecture, a photodetector is normally connected to a transimpedance amplifier (TIA) to amplify the received signal. Then, a comparator is used to produce the equivalent binary bit of the received light power, i.e., '1' or '0'.

OC architectures using silicon photonics devices have been investigated in order to design accelerators. In the following, we present some of these architectures according to the optical devices used in the implementation:

- The MZI, shown in Figure 1.4(a), is commonly used as an electro-optics modulator and switch. It is the most mature device among silicon photonics devices. For instance, MZI is used in [67] to design a reconfigurable mesh required to enable different functionalities in the architecture of microwave processors, which can support fiber-wireless communication, especially in 5G and IoT domains. MZI is also used in the design of fully optical NN in [26]. The proposed design is composed of 56 MZIs, each of them has two phase shifters; one to split the input power between MZI waveguides and the other is used to control the phase of the input signal. The design shows 10× speed up compared to the electrical domain. It is also demonstrated in the design of any arbitrary linear function [68]. For example, the interference of the input signal through the mesh can represent a linear vector-matrix product, where the input signals and mesh represent the vector and matrix, respectively.
- The MRR, shown in Figure 1.4(b) and (c), is used as modulator and add-drop filter to design reconfigurable architectures. One of these architectures is the reconfigurable directed logic (RDL) [72], which is designed based on the sum of products concept of combinational circuits. This design has two stages; the calculation of the products of the function and then the sum of these products. The design requires O/E conversion between stages, which contributes to an increase in power consumption. In [25], an optical lookup table (OLUT) relying on MRRs is designed, where WDM allows executing multiple functions in parallel. The design reduces the number of MRRs used by two orders of magnitude compared to RDL. In [73], a



Figure 1.4: Silicon photonics devices: (a) MZI [49], (b) MRR as modulator [69], (c) MRR as add-drop filter [70], (d), DC [71], and (e) PhC nanocavity [53].

 $4 \times 4$  swirl reservoir topology designed using nonlinear MRRs is implemented. The design is composed of an input layer, a swirl topology that contains 16 nodes representing the reservoir, and the output layer. The performance of the design is evaluated by implementing a 2-bit delayed XOR task, where studies on the input power and wavelength detuning are conducted to choose an operating point of the reservoir. The results show that the design can reach a  $2.5 \times 10^{-4}$  error rate.

• The DC, shown in Figure 1.4(d), is used in the design of full adders, where a carry-in signal remains propagating in the optical domain [74]. The design can be cascaded for an *n*-bit ripple carry adder, which requires duplicating hardware resources and hence limits the scalability of the design. An optical multiplier was proposed in [75], which relies on the optical full adder design in [74]. The performance of the design is compared to a CMOS Wallace tree multiplier [76]. The results show that the design in [75] is 3× faster than the electrical one.

- The PhC nanocavity, shown in Figure 1.4(e), is proposed in the design of an all-optical random access memory (RAM) [77], where the device is used as a bistable switch. The optical RAM reports a memory time of 1µs compared to 250ns demonstrated in [78]. Writing, storage, reading, and erasing operations are demonstrated. An all-optical-gate (AOG) is designed in [53] using PhC nanocavity. The nonlinear effect is based on TPA, which causes a blue shift in the resonant wavelength of the device. Hence the device can act as a switch to pass or block the transmitted signal.
- Phase change material (PCM) [79] has been studied recently in the implementation of photonics applications. PCM can switch between two states, i.e., amorphous and crystalline, by applying an electrical or optical signal. Switching between these two states involves changing material properties, which makes PCM suitable to design non-volatile data storage [79]. By applying a pump signal above a threshold value, the PCM is in the write (amorphous) state, while erasing (recrystallization) can be achieved by applying a train of decreasing energy pulses as proposed in [79]. Hence, allowing multi-level access by storing bits between the full amorphous state and full crystallization state. The PCM has been recently investigated in the design of on-chip in-memory processing, such as the implementation of scalar multiplication using a single PCM cell [80], as shown in Figure 1.5(a). In this design, the input signal power is multiplied by the transmittance of the device, which is controlled by another signal representing the write operation. This led to the design of matrix-vector multiplication, i.e., using multiple PCM cells, as shown in Figure 1.5(b). In this case, the addition is performed by combining multiple scalar products using a power splitter, which causes a 50% loss in the resulted power. The main challenge of using this device for computing is the need to reduce the power consumption required to change the state (phase) of the cell, especially



Figure 1.5: (a) PCM as scalar multiplier and (b) two PCMs used to implement matrix vector multiplication [80].

for multi-level operations. In [81], the design of a photonic hardware accelerator using PCM is proposed that can perform parallel matrix-vector multiplication operations at a rate of several Tera multiply-accumulate per second (TMAC/s) to process images using convolution filters.

These OC architectures aim to accelerate the processing speed. However, they may rely on bulky devices, such as MZI (scaled in mm<sup>2</sup>). Although other architectures use MRR in their implementations, which has a smaller footprint than MZI, i.e., scaled in 100s  $\mu$ m<sup>2</sup>, MRR's area is still relatively high for on-chip computing. Furthermore, in the above-stated architectures, the power losses of the propagated signals represent an issue, especially for designs that are composed of cascaded stages, such as the adder designed using DC [75]. This requires converting the signal to the electrical domain and regenerating it again (O/E and E/O conversions) or increasing the power of the data signal, which could result in triggering an undesired nonlinear effect. These solutions have a significant impact on the energy consumption and design area. However, design scalability remains an issue with the currently available devices. Therefore, another computing approach that relies on serial data processing can help in reducing the hardware complexity. Moreover, optical devices with smaller footprints can further contribute to this reduction, eventfully enhance design scalability.

### **1.3** Proposed Methodology

As mentioned earlier, the SC approach extensively reduces hardware complexity, while its intrinsic serial processing affects computing throughput. Therefore, the aim of this thesis is to investigate the use of integrated photonic devices to design SC architectures, which would contribute in accelerating the computing architectures. This would also help in achieving a scalable design for OC architectures. A general overview of our proposed methodology is depicted in Figure 1.6. The inputs to the methodology are: i) an application represented as a mathematical function; ii) the input data to be processed; and iii) the technological and system-level parameters to be explored. Based on the application, a computing architecture is selected and the design space is explored for a given set of design parameters. Therefore, the methodology has two main phases. The first phase is the design of optical SC libraries. This includes the libraries for the OC part and the interfaces. We aim to design an architecture that executes polynomial functions and architectures that are composed of combinational gates, such as filters in image processing. The second phase is to explore the design space of an OSC architecture and evaluate the performance, i.e., energy efficiency, computing accuracy, and processing time. These phases are detailed as follows:

1. Optical Stochastic Computing Libraries: In general, libraries are reusable blocks that can be customized to fulfill different requirements. Hence the proposed designs should be generic. Moreover, the proposed architectures are independent of the devices, where the integration of a new modulator or switch requires using the transmission model of this device. In this thesis, we propose to design libraries for optical processing and optical interfaces as follows:



Figure 1.6: Proposed methodology.

#### • Optical processing libraries

- Architectures that execute polynomial functions. Our design is based on the state-of-the-art ReSC architecture [28] that executes Bernstein polynomial functions. The design is implemented using devices working under different physical effects.
- Reconfigurable architectures that can be configured to execute one or more Bernstein polynomial functions simultaneously. The design needs additional devices to switch between different configurations.
- Combinational filter architectures, such as edge detection filter. For this purpose, we investigate the use of PhC nanocavities to design all-optical cascaded gates architectures.

#### • Optical interfaces libraries

- SNG architectures to generate the required bit streams for stochastic processing. We propose different implementations of SNG dedicated to OSC.
- A De-randomizer circuit that converts an output optical signal to a binary number. For this purpose, we use a photodetector followed by a counter to generate the binary number.
- 2. **Design Space Exploration:** The design flow allows optimizing the energy efficiency of the architecture and evaluating the computing accuracy for the given application. This is detailed as follows:
  - Energy optimization: The aim is to optimize the design energy efficiency according to the targeted application-level computing accuracy through a reduction in the total energy consumption. Since the proposed architectures involve numerous parameters, at both system and device levels, that impact energy consumption per computed bit, we develop a transmission model for each architecture that takes into account these parameters. We use C programming language to develop the transmission models and evaluate the energy consumption of the designs.
  - Accuracy evaluation: There are three sources of error in the design of OSC architectures. Errors related to i) SC domain due to the generation of bit streams; ii) optical domain due to transmission robustness; and iii) the architecture itself, such as the order in polynomial functions. We use MATLAB to calculate the computing accuracy for a given application.

As an output, the methodology provides design options characterized by applicationlevel energy efficiency, computing accuracy, and processing time. This exploration results in multiple design options to execute a given application.
# **1.4** Thesis Contributions

Stochastic computing is an energy efficient paradigm, where data is processed serially. This significantly impacts the processing time and hence designing an architecture using nanophotonics technology can accelerate the processing speed. On the other hand, SC can enhance the scalability of optical architectures due to the reduced number of devices used in the design. Therefore, the main objective of this thesis is to investigate the design of SC architectures using integrated optics. In the following, we list the main contributions of this thesis along with references to related publications provided in the Biography section at the end of this document.

- The design of an OSC architecture for polynomial functions. The proposed design can execute any arbitrary single input function by changing the polynomial coefficients. The proposed design is generic and can execute a polynomial function of different orders. We develop a transmission model to estimate energy consumption and propose two design methods to explore device-level parameters in order to optimize laser power consumption [Bio-Cf2].
- A framework to explore the design of OSC architectures for polynomial functions. The proposed framework allows the optimization of the design energy efficiency according to application-level computing accuracy. We evaluate the computing accuracy using an image processing application, i.e. Gamma correction function, considering multiple combinations of polynomial order, *BSL*, and *BER* [Bio-Jr2].
- A reconfigurable OSC architecture for polynomial functions. The order of the proposed architecture can be configured during execution time according to the

design requirements. We explore the design space at device-level and systemlevel parameters. We estimate the energy consumption and evaluate the computing accuracy by implementing a Gamma correction application. We explore a trade-off between energy consumption, computing accuracy, processing time, and design throughput [Bio-Cf1].

- A cascaded gates OSC architecture based on PhC nanocavities. We develop a transmission model of the device that involves device parameters, such as resonant wavelength and wavelength detuning, and propose the design of alloptical logic gates using nanocavities. We exploit the different quality factors feature in nanocavities to design all-optical cascaded multiplexers that are useful in image processing filters. We explore laser power, *BER* and *BSL* to optimize energy consumption and evaluate computing accuracy by implementing edge detection application. This work is a collaboration with Thales in France. They provided us with the device's characteristics and experimental results in order to validate our transmission model [Bio-Jr1, Bio-Tr1].
- Different implementations of SNG architectures. The one based on LFSR with modulators is used from Chapters 2 to 4 in this thesis [Bio-Jr1]. Another two designs based on on-chip directly modulated lasers and all-optical SNG are also proposed. A comparison between these implementations is conducted in the context of edge detection application to estimate the energy efficiency and evaluate the computing accuracy. We use all-optical SNG to design an all-optical *n*-bit adder and compare its performance with conventional optical architecture.

# 1.5 Thesis Organization

The rest of the thesis is organized as follows: In Chapter 2, we present the design of OSC architectures for polynomial functions and propose a transmission model to optimize energy consumption. We explore system-level parameters, such as architecture order, *BSL* and *BER*. We estimate energy consumption according to computing accuracy by implementing a Gamma correction application. In Chapter 3, we describe a reconfigurable architecture for polynomial functions. The design is based on the architecture proposed in Chapter 2. In this design, the architecture order can be configured during run-time to enhance the computing accuracy or to increase the design throughput.

In Chapter 4, we present the design of all-optical gates, i.e., NOT gate, XOR gate, and multiplexer, using PhC nanocavities. We develop the device transmission model and investigate the design of cascaded gates architectures using nanocavities. We target the design of edge detection filters. System-level exploration of laser power, *BSL* and *BER* is carried out to process gray-scale images.

In Chapter 5, we discuss different SNG designs that can be used with OSC architectures. The designs are either based on LFSR (electrical part) or are fully optical (using lasers). We use an edge detection application to compare these designs in terms of computing accuracy and energy consumption. We also compare our proposed design of an all-optical n-bit adder with a conventional design in the optical domain in terms of energy consumption, processing time, and hardware complexity. Finally, we conclude the thesis in Chapter 6 and provide future research directions.

# Chapter 2

# Optical Stochastic Computing Architecture for Polynomial Functions

In this chapter, we present the design of the first OSC architecture in the optical processing library. The design aims to execute polynomial functions of n order. It is based on the ReSC architecture in the electrical domain that targets the implementation of Bernstein polynomial functions [29]. We first give an overview of the ReSC architecture, then present the silicon photonics devices used in the design. We propose a methodology to explore the system-level and device-level parameters, which allows optimizing energy efficiency according to application-level computing accuracy.

# 2.1 Overview

The ReSC architecture [28] corresponds to the implementation of Bernstein polynomial function of order n, given in Equation 2.1, in stochastic domain.

$$B(x) = \sum_{i=0}^{n} B_{i,n}(x)$$
(2.1)

where x is the input, n is the polynomial order,  $B_{i,n}(x)$  is the Bernstein basis polynomial of order n:

$$B_{i,n}(x) = \binom{n}{i} x^{i} (1-x)^{n-i}$$
(2.2)

and  $b_i$  is the Bernstein polynomial coefficient:

$$b_i = \sum_{j=0}^{i} \frac{\binom{i}{j}}{\binom{n}{j}} a_j \tag{2.3}$$

As illustrated in Figure 2.1(a), the ReSC architecture is implemented using a combinational circuit; an adder and a multiplexer. The computation is carried out as follows: i) n SNGs generate n stochastic bit streams of data input x from  $x_1$  to  $x_n$ ; ii) n + 1 SNGs generate bit-streams for the Bernstein polynomial coefficients  $z_0$  to  $z_n$ ; iii) the streams of the coefficients are multiplexed to the output according to the sum of input data ( $x_1$  to  $x_n$ ); and iv) the number of the received ones are counted to de-randomize the data. We illustrate the configuration of the ReSC architecture using the following 3<sup>rd</sup> order polynomial function:

$$f(x) = \frac{1}{4} + \frac{9}{8}x - \frac{15}{8}x^2 + \frac{5}{4}x^3$$
(2.4)



Figure 2.1: (a) ReSC architecture with (b) an example of 3<sup>rd</sup> order polynomial function.

where  $a_0=1/4$ ,  $a_1=9/8$ ,  $a_2=-15/8$ , and  $a_3=5/4$  are the polynomial coefficients. By using Equation 2.3, the Bernstein polynomial coefficients are  $b_0=2/8$ ,  $b_1=5/8$ ,  $b_2=3/8$ , and  $b_3=6/8$ . From Equation 2.1, the 3<sup>rd</sup> order Bernstein polynomial function is:

$$f(x) = \frac{2}{8}B_{0,3}(x) + \frac{5}{8}B_{1,3}(x) + \frac{3}{8}B_{2,3}(x) + \frac{6}{8}B_{3,3}(x)$$
(2.5)

Figure 2.1(b) illustrates the implementation of the  $3^{rd}$  order ReSC architecture assuming x=0.5 and BSL=8. Three SNGs convert x into bit streams  $x_1$  to  $x_3$ , and four SNGs convert polynomial coefficients  $b_0$  to  $b_3$  into bit streams  $z_0$  to  $z_3$ . The streams of the coefficients are multiplexed to the output according to the sum of input data  $(x_1 \text{ to } x_3)$ . Finally, the resulted bit stream is converted to a binary number equals 0.5. Since the hardware complexity is low, i.e., two computation units, the design is ideal for the transposition of SC architecture into the optical domain.

## 2.2 Silicon Photonics

In the following, we introduce the silicon photonics devices that we use in designing the ReSC architecture.

• *MZI*: a  $1 \times 1$  MZI modulator is shown in Figure 2.2. The input signal power is

equally split and transmitted to two parallel waveguides. On one arm, the signal continues propagating at speed related to the silicon refractive index. On the other arm, the refractive index is modified using the electro-optic effect, where the signal slows down and a  $\pi$  phase shift is obtained in case '1' is applied. Hence, depending on the applied voltage, constructive and destructive interferences can be obtained when both signals are combined at the output, i.e., output='1' and '0', as shown in Figure 2.2(a) and (b), respectively. The transmission of the device is given by:

$$T^{MZI}[v] = \begin{cases} IL_{\%}, & v = 0 \quad Constructive \ state \\ IL_{\%} \times ER_{\%}, & v = 1 \quad Destructive \ state \end{cases}$$
(2.6)

where IL is the insertion ratio defined as the loss of signal power due to the transmission through the device. ER is the extinction ratio defined as the ratio between the transmission of data as '1' and as '0'.  $IL_{dB}$  and  $ER_{dB}$  are the conversion results of the ratio to dB of  $IL_{\%}$  and  $ER_{\%}$ , respectively.



Figure 2.2: MZI device.

• MRR as Modulator: Figure 2.3 illustrates a modulator implemented using an MRR controlled by a voltage applied to its positive-intrinsic-negative (PIN) junction [82]. In the initial state, i.e., no voltage is applied as shown in Figure 2.3(a), the MRR resonant wavelength is set to  $\lambda_2$ . This leads to the coupling of the light at wavelength  $\lambda_2$  into the ring, which results in a small fraction of signal power



Figure 2.3: MRR device.

transmitted, i.e., output='0'. When a voltage is applied, as shown in Figure 2.3(b), the refractive index of the MRR is blue shifted, i.e., most of the input signal power is transmitted to the output, i.e., output='1'. Equation 2.7 is the through transmission  $\theta_t$  of the MRR modulator to the output as defined in [83].

$$\varphi_t(\lambda_{signal}, \lambda_{res}) = \frac{a(\lambda_{res})(1 - r_1^2)(1 - r_2^2)}{1 - 2a(\lambda_{res})r_1r_2\cos[\theta(\lambda_{signal}, \lambda_{res})] + [a(\lambda_{res})r_1r_2]^2}$$
(2.7)

where  $r_1$  and  $r_2$  are the self-coupling coefficients,  $\lambda_{res}$  and  $\lambda_{signal}$  are the MRR resonant wavelength and signal wavelength, respectively.  $\Delta\lambda$  is the wavelength shift between OFF and ON states, a is the single-pass amplitude transmission, and  $\theta$  is the single-pass phase shift.

• All-optical Add-drop Filter (AOF): Figure 2.4 shows an optically controlled MRR using two-photon absorption (TPA) effect [84]. High intensity pump signal at  $\lambda_{pump}$  shifts the ring refractive index. The wavelength of the pump signal is slightly detuned from the AOF resonance wavelength. The next resonance wavelength  $\lambda_{ref}$ is used for the filtering operation. The resonant wavelength is blue shifted, when the pump signal is applied. In case no pump signal is applied (Figure 2.4(a)), the probe signals ( $\lambda_1$  and  $\lambda_2$ ) continue propagating on the same waveguide. In case a pump signal is applied (Figure 2.4(b)), the resonant wavelength of the AOF is



Figure 2.4: AOF device.

shifted to  $\lambda_2$ , which leads to the transmission of the probe signal  $\lambda_2$  to the drop port. Equation 2.8 is the transmission of the optical signal to the drop port [83].

$$\varphi_d(\lambda_{signal}, \lambda_{res}) = \frac{a^2(\lambda_{res})r_2^2 - 2a(\lambda_{res})r_1r_2cos[\theta(\lambda_{signal}, \lambda_{res})] + r_1^2}{1 - 2a(\lambda_{res})r_1r_2cos[\theta(\lambda_{signal}, \lambda_{res})] + [a(\lambda_{res})r_1r_2]^2}$$
(2.8)

#### $\mathbf{2.3}$ **Proposed Methodology**

The proposed methodology relies on the design of Bernstein polynomial architecture using integrated optics. The design shall be generic to target polynomial functions of different orders. As illustrated in Figure 2.5, the inputs of the design flow are: i) technological parameters of the optical devices; ii) system-level parameters (n, BSL,and *BER*); and iii) an application described as a mathematical function, e.g., Gamma correction function. For each set of parameters, the energy is optimized and the computing accuracy is evaluated at the application-level. The performance of the design is evaluated according to the total energy consumption, computing accuracy (between output data produced with approximation and with error-free processing) and processing time. In the following, we detail the energy efficiency optimization and computing accuracy evaluation.



Figure 2.5: Proposed methodology for design space exploration of polynomial architectures.

- Energy optimization: The aim is to optimize the energy efficiency through a reduction of laser power consumption, which is expected to consume most of the energy in the architecture. For this purpose, we propose two design methods in Section 2.4, i.e., *MRR-first* and *MZI-first*, that allow exploring the parameters of the devices, i.e., MZI and MRR, in order to optimize the lasers' energy. We use the *MRR-first* design method to explore the distance between resonance wavelengths (*WLS*), i.e.,  $\lambda_0$  to  $\lambda_n$  of the MRRs. It is worth mentioning that the SNG design is not taken into consideration when estimating the energy.
- Accuracy estimation: The purpose is to estimate the application-level computing accuracy of the architecture. Based on the mathematical function f(x)corresponding to the targeted application, the ReSC architecture is configured by defining the polynomial coefficients  $b_0$  to  $b_n$  of the function considering the order of the architecture. During the simulation, the polynomial coefficients are used to

generate the corresponding n + 1 stochastic numbers (SN), i.e.,  $z_0$  to  $z_n$ , for an n order architecture. The length of the generated bit stream is defined by *BSL*. In the context of image processing, input pixels are sequentially processed. Each pixel leads to n stochastic bit streams, i.e.,  $x_1$  to  $x_n$ , for an n order architecture. We evaluate the mean error distance (MED), taking into account the transmission of the signals through the devices. MED is calculated wrt the error-free image processed using f(x). We consider errors related to i) Bernstein polynomial approximation ( $MED_{Berns-approx}$ ); ii) generated stochastic numbers ( $MED_{BSL}$ ); and iii) optical transmission ( $MED_{Trans}$ ). This allows quantifying the impact of each type of error on the computing accuracy.

# 2.4 Proposed Architecture

We first present the design of the optical ReSC architecture. Then, we introduce the design of the optical interfaces, i.e., SNG and de-randomizer, followed by the design methods.

#### 1. Optical Bernstein Polynomial Design

The optical circuit is composed of an adder and a multiplexer, similarly to the ReSC circuit introduced in Section 2.1. The adder is composed of MZIs devices that are controlled by the input data  $x_i$ . A high power optical signal  $OP_{Laser\_pump}$ , that is continuously emitted by a laser source, is equally distributed among the MZIs. Depending on the MZIs status; constructive or destructive interference, the  $OP_{Control}$  is produced. The MZI is selected in the design because it is a non-resonance mature device and it is not affected by high power signal.



**Figure 2.6:** OSC architecture of a  $2^{nd}$  order polynomial function. (a) The optical circuit. The transmissions of the signals at  $\lambda_2$ ,  $\lambda_1$ , and  $\lambda_0$  to the drop port of the AOF are shown in (b), (c), and (d), respectively.

The multiplexer is implemented using an all-optical add-drop filter (AOF) receiving optical signals modulated by coefficients  $z_j$ . The resonant wavelength of the AOF depends on the intensity of the pump signal output by the adder. By controlling the resonant wavelength of the AOF, it is possible to extract a coefficient signal, thus implementing the multiplexing operation. The output signal is received by a photodetector, where E/O conversion is carried out.

A 2<sup>nd</sup> Bernstein polynomial is shown in Figure 2.6(a). It contains two MZIs, three MRRs modulators and an AOF. The MRRs are controlled by the coefficients  $z_0$ ,  $z_1$ , and  $z_2$  to modulate the probe signals at wavelengths  $\lambda_0$ ,  $\lambda_1$ , and  $\lambda_2$ . Three different scenarios are obtained based on the values of  $x_1$  and  $x_2$ .

- x<sub>1</sub>=x<sub>2</sub>=1: In Figure 2.6(b), both MZIs are in the destructive state. Therefore, OP<sub>Control</sub> is highly attenuated and the AOF resonance wavelength is tuned to the nearest wavelength; λ<sub>2</sub>. Hence, the optical probe signal at λ<sub>2</sub> is selected and dropped to the output.
- x<sub>1</sub> ≠x<sub>2</sub>: In Figure 2.6(c), one MZI is in the constructive state and the other is in the destructive state. Approximately half of the input power tunes the AOF to the resonance wavelength λ<sub>1</sub>. Therefore, the input signal at wavelength λ<sub>1</sub> is transmitted to the output.
- x<sub>1</sub>=x<sub>2</sub>=0: In Figure 2.6(d), both MZIs are in the constructive state. The maximum power is transmitted to control the AOF. Hence, the AOF resonance wavelength is tuned to λ<sub>0</sub>.

As can be seen from Figure 2.6(b),(c), and (d), the AOF is initially tuned to the resonance wavelength  $\lambda_{ref}$ , where  $\lambda_{ref} = \lambda_{pump} + FSR$ , in order to avoid the crosstalk with the modulated signal. When  $OP_{Control}$  is applied, the refractive index of the

AOF tunes its resonance wavelength to one of the probe signals wavelength using a nonlinear effect. Therefore, the right coefficient is selected and transmitted to the output. Figure 2.7 illustrates the transmission of the control and coefficients signals that correspond to the bit streams generated from SNGs, which control the modulators, i.e., MZIs and MRRs. The figure also shows the transmission of the output signal received by the photodetector. In the following, we introduce the design of the SNG and the de-randmizer.



Figure 2.7: The transmission of the output signal according to the  $OP_{control}$  power and the coefficients;  $z_0, z_1$  and  $z_2$ .

#### 2. Optical Interfaces

• SNG: Our proposed design of SNG is composed of electrical and optical parts. The electrical part contains the LFSR of size *m* and a comparator. An input binary number of size *m* bits is compared against an *m*-bit random number generated by the LFSR. Hence, bit stream of length= $2^m$  is generated, which controls the operation of an optical part, i.e., a modulator. As a result, a CW signal emitted from an off-chip laser is either modulated or transmitted, which represents the stochastic bit streams in the optical domain. We shall call this design an SNG-based LFSR + modulator. This design is used in this thesis from Chapter 2 to Chapter 4. Figure 2.8 illustrates the design of an SNG that takes an 8-bit binary number as an input and generates BSL=256to control an MZI modulator.



Figure 2.8: SNG-based LFSR + modulator.

• **De-randomizer:** Figure 2.9 shows the de-randomizer circuit required to convert the received light (equivalent to stochastic bit stream) to a binary number. The received light is converted to the equivalent current using photodetector. A TIA is used to convert the current to a voltage that is then compared against a threshold voltage, where bit '0' or bit '1' is generated. Then, the number of '1's are counted to generate the binary number.



Figure 2.9: De-randomizer for OSC architecture.

#### 3. Generic Architecture and Design Parameters

The architecture we propose is generic and can be implemented for an n order Bernstein polynomial function, as illustrated in Figure 2.10. It involves n MZIs and n+1 MRRs to modulate the data and the coefficients, respectively. The optical power of the pump laser is equally distributed to the MZIs using n-outputs and ninputs splitter and combiner, respectively. The use of WDM allows the propagation of probe signals on the same waveguide separated by wavelength spacing (*WLS*), i.e., the distance between two consecutive signals. *WLS* is a key parameter that is used to optimize laser powers consumption. Small *WLS* increases the crosstalk between probe signals, which requires high probe laser powers. However, it leads to a reduction in pump laser power required to shift the AOF. Therefore, as described in Figure 2.5, the exploration of *WLS* is essential as it involves a trade-off between pump laser power and probe laser powers.

Table 2.1 summarizes the design parameters. The system-level parameters, n, BSL, and BER, correspond to the order of the implemented polynomial function (ReSC specific), the length of the generated bit streams (SC domain specific) and



Figure 2.10: Generic architecture for OSC circuit.

the transmission error rate (optical domain specific), respectively. Since all parameters affect the computing accuracy, multiple combinations can lead to designs demonstrating the same computing accuracy but with different energy efficiency and processing time. As an example, we illustrate in Figure2.11 output optical signals (i.e. signals received by the photodetectors) corresponding to two scenarios: a) high BSL / high BER and b) low BSL / low BER. We assume the same polynomial order and the same application-level computing accuracy for both scenarios. In scenario a), the targeted accuracy is obtained thanks to the high number of

|               | Name                        | Description                                         | Unit  |
|---------------|-----------------------------|-----------------------------------------------------|-------|
| System-level  | n                           | Polynomial order                                    | -     |
|               | BSL                         | Bit Stream Length                                   | -     |
|               | BER                         | Bit-Error Rate                                      | -     |
| MZI           | $\mathrm{IL}_{\mathrm{dB}}$ | Insertion Loss                                      | dB    |
|               | $IL_{\%}$                   |                                                     | %     |
|               | $\mathrm{ER}_{\mathrm{dB}}$ | - Extinction Ratio                                  | dB    |
|               | $\mathrm{ER}_{\%}$          |                                                     | %     |
| MRR           | $\lambda_{\mathrm{i}}$      | Resonant wavelength in OFF state                    | nm    |
|               | $\Delta\lambda$             | Resonant wavelength shift between ON and OFF states | nm    |
|               | $\theta_{\mathrm{t}}$       | Through transmission Equation 2.7                   | %     |
|               | WLS                         | Wavelength spacing between probe signals            | nm    |
| AOF           | $\lambda_{ m ref}$          | Resonant wavelength when no pump power is injected  | nm    |
|               | FSR                         | Free Spectral Range                                 | nm    |
|               | OTE                         | Optical Tuning Efficiency                           | nm/mW |
|               | $\theta_{\rm d}$            | Drop transmission Equation 2.8                      | %     |
| Laser         | $\eta$                      | Lasing efficiency                                   | %     |
| Photodetector | R                           | Responsivity                                        | A/W   |
|               | i <sub>n</sub>              | Internal noise current                              | A     |

Table 2.1: System-level and technological parameters.

transmitted bits, which increases the processing time but allows to lower the constraints on the error transmission rate (i.e., high BER). This allows reducing the wavelength spacing, thus leading to energy reduction opportunities. In scenario b), we assume that the application-level accuracy is reached thanks to the robust transmission (i.e., low BER). This allows reducing BSL, thus shortening the transmission time. Hence, while both scenarios lead to the same computing accuracy, they show different latency and energy efficiency figures, which are relevant options for system designers. However, the design of such architecture is time consuming and challenging, since it involves heterogeneous devices working under different physical effects and being characterized by their own parameters, which requires a design space exploration.



Figure 2.11: Same accuracy level is reached for (a) high BSL / high BER, and (b) low BSL / low BER.

# 2.5 Implementation and Modeling

We present the analytical models to evaluate the computing accuracy and estimate the transmission robustness required to evaluate the energy efficiency of the design. The models are developed considering the technological and system-level parameters.

#### 2.5.1 Error Evaluation

We consider the following three sources of error:

•  $\varepsilon_{Berns\_approx}$ : results from the approximation of Bernstein polynomial function, which includes the polynomial order and the coefficients. The Bernstein polynomial coefficients of order n are computed by solving the function defined in [28]. The function is given as:

$$\int_{0}^{1} (f(x) - \sum_{i=0}^{n} b_i B_{i,n}(x))^2 dx$$
(2.9)

The higher the order, the more accurate the approximated Bernstein polynomial function and hence the lower the errors. The distance between the approximated function B(x) and the input function f(x) is given by:

$$\varepsilon_{Berns\_approx} = B(x) - f(x) \tag{2.10}$$

•  $\varepsilon_{BSL}$ : is induced by the generation of stochastic bit streams using SNGs, where BSL drives the accuracy. This error is defined by the distance between processed data Y(x) (produced by the architecture for a given BSL and an error-free transmission) and B(x):

$$\varepsilon_{BSL} = Y(x) - B(x) \tag{2.11}$$

•  $\varepsilon_{Trans}$ : results from the transmission error of the signal using integrated optics technology. It occurs on the photodetector side and it is defined by *BER*, i.e., the ratio of incorrectly transmitted bits.  $\varepsilon_{Trans}$  is the distance between  $\dot{Y}(x)$  (produced by the architecture for a given *BSL* and a given *BER* and Y(x):

$$\varepsilon_{Trans} = \dot{Y}(x) - Y(x) \tag{2.12}$$

We use the MED metric to quantify the architecture computing accuracy to process streams of data (e.g., pixels arrays in image processing application). For this purpose, we define  $MED_{\text{Total}}$  as the sum of the individual MEDs contributions, i.e.,  $MED_{\text{Berns\_approx}}$ ,  $MED_{\text{BSL}}$ , and  $MED_{\text{Trans}}$ , resulting from the three previously defined types of error, where M is the number of processed data and i is the data at the  $i^{\text{th}}$  position in the stream:

$$MED_{\text{Total}} = \frac{1}{M} \Big( \sum_{i=1}^{M} | \varepsilon_{Berns\_approx(i)} | + \sum_{i=1}^{M} | \varepsilon_{BSL(i)} | + \sum_{i=1}^{M} | \varepsilon_{Trans(i)} | \Big)$$
(2.13)

#### 2.5.2 Transmission Model

We define the wavelength spacing WLS as the wavelength distance between two consecutive probe signals.

$$WLS = \lambda_{i+1} - \lambda_i \tag{2.14}$$

The transmission of the probe signal i to the output is given in Equation 2.15.

$$T_{s,z}[i] = \underbrace{\varphi_t(\lambda_i, \lambda_i - \Delta\lambda \times z_i)}_{\text{Transmission through}} \times \underbrace{\prod_{w=0, w \neq i}^n \varphi_t(\lambda_i, \lambda_w - \Delta\lambda \times z_w)}_{\text{Transmission through}} \times \underbrace{\varphi_d(\lambda_i, \lambda_{ref} - \Delta AOF(x))}_{\text{Transmission through}} (2.15)$$

For example, if coefficient  $z_i$  is '1', this value will detune the  $MRR_i$  by  $\Delta\lambda$ . Hence,  $MRR_i$  is in ON state and a maximum power of the probe signal at  $\lambda_i$  is passed through the MRR. The signal will also experience different attenuation by the other MRRs depending on their states, which are determined by their coefficients  $z_w$ . Then, the signal is dropped to the output by the AOF where the transmission depends on the detuning value  $\Delta AOF$ . When  $z_i$  is '0', the  $MRR_i$  is tuned to the resonance wavelength of  $\lambda_i$  (OFF state). The probe signal at  $\lambda_i$  experiences high attenuation and the same transmission steps are repeated. The transmission of the probe signals to the output depends on the AOF for which the initial resonant wavelength is  $\lambda_{ref}$  (i.e., the resonant wavelength in case no control power is applied). The AOF detuning is defined as:

$$\Delta AOF = OP_{Laser\_pump} \times OTE \times \frac{1}{n} \sum_{i=1}^{n} T_i^{MZI}[x_i]$$
(2.16)

where OTE is the optical tuning efficiency measured in (nm/mW). The detuning of the AOF depends on the transmission of  $OP_{Laser\_pump}$  through *n* parallel MZIs, which in turn depends on the corresponding modulated data  $X_i$  as defined in Equation (2.6). The  $MZI_i$  has a constructive interference, when  $x_i$  is '0' and a destructive interference, when  $x_i$  is '1'.

$$SNR = OP_{Laser\_probe} \times \frac{R}{i_n} \times (T_{s,z_i=1}[i] - \sum_{w=0, w \neq i}^n T_{s,z_w=1}[w])$$
(2.17)

where  $i_n$  and R are the photodetector internal noise and responsivity, respectively.  $T_{s,z_i=1}[i]$  is the transmission of the signal at  $\lambda_i$  as '1' while the remaining signals are transmitted as '0'. On the other hand,  $T_{s,z_w=1}[w]$  is the transmission of the crosstalk signals as '1' while the signal at  $\lambda_i$  is transmitted as '0'.

The BER is given in Equation 2.18 assuming ON/OFF key modulation (OOK) of the probe signals.

$$BER = \frac{1}{2} erfc(\frac{SNR}{2\sqrt{2}}) \tag{2.18}$$

#### 2.5.3 Design Methods

The performance and energy efficiency of the architecture depend on many devices characteristics and related parameters that need to be explored. We propose two methods that can be used to optimize laser powers. These two methods are *MRR\_first* and *MZI\_first*. Following is a brief description of each method:

• *MRR-First:* This method allows exploring MZI characteristics and minimizes the required pump laser power  $OP_{Laser-pump}$  according to MRRs parameters. For this purpose, the MRRs resonant wavelengths  $\lambda_i$  are first defined according to *WLS*. The transmission  $T_{s,z}[i]$  then allows estimating the worst-case *SNR* for a given probe laser power  $OP_{Laser\_probe}$ , or finding the minimum laser power needed to reach a given *BER*. Then, according to the AOF resonant wavelength  $\lambda_{ref}$  and the MZI's *IL*, the minimum pump power is computed. Eventually, *ER* is given by the pump signal attenuation required to tune the AOF to  $\lambda_n$ , i.e., the right-most signal wavelength. • *MZI-First:* this method allows exploring MRRs characteristics and minimizing the required probe laser power. For this purpose, the pump laser power is specified and the MZI's *IL* and *ER* are selected. This allows estimating the power level of the control signal to tune the AOF. For a given  $\lambda_{ref}$ , it is possible to define  $\lambda_i$  and vice-versa. Eventually, *BER* and laser probe power can be defined according to the objective (power, robustness, speed).

### 2.6 Simulation Results

In this section, we explore the design of OSC architecture for polynomial functions by implementing a Gamma correction application. We illustrate the processing at the bit-level and the application-level. We explore the impact of the design parameters on the application-level computing accuracy. Then, we optimize the energy efficiency through exploring the *WLS* for different polynomial orders. Following the Gamma correction application, we explore the design space for optimizing energy efficiency according to the computing accuracy.

#### 2.6.1 Case Study: Gamma Correction Application

Gamma correction function [85] is a nonlinear operation that controls the luminance of an image, which is defined by:

$$f(x) = x^{\gamma} \tag{2.19}$$

where  $\gamma$  is the correction value. A  $\gamma$  value less than one maps dark pixels to a lager range of values, allowing enhancing the details of the dark area of the source images. In the following, we assume  $\gamma=0.45$  for all simulation results, which is one of the values used in modern TV systems to correct the illuminance of videos and images [86]. We use this application to illustrate the design space exploration proposed in the methodology (Section 2.3). For this purpose, we implement a 2<sup>nd</sup> order polynomial architecture and define the design parameters. We illustrate the simulation for processing one bit and a single image.

#### 1. Definition of Technological Parameters

To design a 2<sup>nd</sup> order polynomial architecture, three MRRs are needed. We assume wavelengths around 1550nm, since silicon material is transparent to this wavelength, which leads to low propagation losses [87]. We also assume WLS=1nm for illustration purposes. Then,  $MRR_2$ ,  $MRR_1$ , and  $MRR_0$  are tuned at resonance wavelengths  $\lambda_2 = 1550$  nm,  $\lambda_1 = 1549$  nm, and  $\lambda_0 = 1548$  nm, respectively. For the AOF, we select  $\lambda_{ref}$  to be detuned by 0.1nm from  $\lambda_2$ , i.e.,  $\lambda_{ref} = 1550.1$ nm. We also assume OTE=0.01 nm/mW [84] and  $IL_{dB}=4.5$  dB [88]. Following MRR-first design method presented in Section 2.5.3, we define  $OP_{Laser\_probe} = 1 \text{mW}$  and we set  $OP_{Laser\_pump}$  to 574mW, which is the minimum power required to detune the AOF to  $\lambda_0$  (rightmost signal).  $ER_{dB}=23$ dB is required to detune the AOF to  $\lambda_1$  and  $\lambda_2$ . On the other hand, the minimum laser probe power can be evaluated according to the MZI-first method by considering ranges of values for IL and ER typically observed in the literature [89, 90]. In this study, we assume a 2<sup>nd</sup> order polynomial function. Figure 2.12(a) illustrates the results for  $OP_{Laser\_pump}=0.6W$  and  $BER=10^{-6}$ . By assuming the MZI device in [90] ( $IL_{dB}=6.5$  and  $ER_{dB}=7.5$ ), the required laser probe power would be 0.26mW. Obviously, the minimum value of  $OP_{Laser\_probe}$  rises with the increase in  $IL_{dB}$  and the reduction of  $ER_{dB}$ , which is explained as follows: the lower the total transmission in the MZIs, the smaller the wavelength spacing and the higher the signal crosstalk. Increasing the probe



Figure 2.12: Minimum probe laser power according to (a)  $IL_{dB}$  and  $ER_{dB}$  for  $10^{-6}$  BER, (b)targeted BER, and (c) MZIs speed and phase shifter length.

laser power not only has a negative effect on the circuit energy efficiency, but it can also induce a nonlinear effect in the filter, which would lead to an undesired shift in its resonant wavelength. This could be avoided by increasing the pump power instead, which leads to a design trade-off involving the power of the pump and probe signals. We also evaluate the opportunities for laser power reduction by leveraging constraints of the optical signal transmission robustness.

As illustrated in Figure 2.12(b), targeting  $10^{-2}$  *BER* instead of  $10^{-6}$  leads to a 50% power reduction. The lack of accuracy in the optical domain could be alleviated by transmitting longer streams of bits in the stochastic domain. This also allows exploring a trade-off between computing accuracy and transmission robustness, which involves device characteristics related to the speed and the area (Figure 2.12(c)). For instance, a high modulation speed (e.g., 60Gb/s [90]) and a high laser power

could be combined to reduce the bit streams transmission rate, thus maximizing the circuit throughput.

#### 2. Bit-Level Processing

We illustrate the transmission of one bit for the given application. First, we use Equation 2.9 to evaluate the polynomial coefficients, which leads to  $b_0=0.209$ ,  $b_1=0.8927$ , and  $b_2=0.969$ . Then, the stochastic bit streams of the coefficients are generated ( $z_0$  to  $z_2$ ) to control the MRRs. For processing a pixel, n bit streams ( $x_1$ to  $x_n$ ) are generated to control the MZIS.

Figure 2.13 illustrates the transmission through the three MRRs and the AOF, as well as the transmission of the probe signals represented by the vertical arrows. We assume different combinations of coefficients and data inputs (pixel) as follows:

#### • $z_0=0, z_1=1, z_2=0 \text{ and } x_1=x_2=1 \text{ (Figure 2.13(a))}$

The coefficients lead to detuning the resonance wavelength of  $MRR_1$ , hence probe signal at  $\lambda_1$  has high transmission, while probe signals at  $\lambda_0$  and  $\lambda_2$  are attenuated. Since  $x_1=x_2=1$ , the resonance wavelength of the AOF is shifted to  $\lambda_2$ , and thus '0' is transmitted to the output. The total power received at the photodetector is 0.0952mW.

# • $z_0 = 1, z_1 = 1, z_2 = 0$ and $x_1 = x_2 = 0$ (Figure 2.13(b))

The coefficients result in detuning  $MRR_0$  and  $MRR_1$ , which lead to high transmission of probe signals at  $\lambda_0$  and  $\lambda_1$ , while probe signal at  $\lambda_2$  is attenuated. The data inputs  $x_1=x_2=1$  result in shifting the AOF resonance wavelength to  $\lambda_0$ . Therefore, '1' is transmitted to the receiver side with total power of 0.482mW.



**Figure 2.13:** The transmission of MRRs and AOF. a) probe signal at  $\lambda_2$  is transmitted as '0', b) probe signal at  $\lambda_0$  is transmitted as '1', and c) the optical power received by the photodetector for all input combinations.

Figure 2.13(c) reports the power received by the photodetector for all combinations of data inputs and coefficients signals. The optical power range of (0.092-0.099 mW) and (0.477-0.482 mW) implies the transmission of '0' and '1', respectively. For a targeted *BER*, we can compute laser powers according to the transmission of the architecture to ensure a proper detection of '0' and '1' at the receiver side.

#### 3. Execution of Image Processing Application

We illustrate the processing of a Gamma correction application at a scale of one image by implementing a 4<sup>th</sup> order polynomial architecture. The corresponding coefficients are  $b_0=0.129$ ,  $b_1=0.797$ ,  $b_2=0.613$ ,  $b_3=0.95$ , and  $b_4=0.988$ . We run the simulation assuming BSL = 1024 and  $BER = 10^{-1}$ . We compare the simulation result with i) the error-free results ( $f(x) = x^{0.45}$ ); and ii) the approximated results corresponding to the 4<sup>th</sup> order Bernstein polynomial function. Figure 2.14(a) shows



Figure 2.14: Gamma correction application: (a) Output pixels according to input pixels in the range [0, 1] for 1) $\gamma$ =0.45, 2) 4<sup>th</sup> order polynomial function, and 3) Simulation results for n = 4, BSL = 1024, and  $BER = 10^{-1}$ . (b) The corresponding results for processing an image.

the resulting output pixels according to input pixels in the range [0, 1]. This range corresponds to the probabilities of the stochastic numbers (for instance, a probability of 0.5 corresponds to a stream of bits composed of 512 ones and 512 zeroes). In the graph, curve (1) represents the error-free output and curve (2) is the result provided by the approximated 4<sup>th</sup> order Bernstein polynomial function. Curve (3) is the simulation result, which integrated the errors induced by SNG and the transmission errors in the optical domain. The resulting higher approximation measured using simulation can be observed clearly in the range [0, 0.01]. The approximation thus depends on the application, the design parameters and the input data. For this purpose, we run simulation on standard  $160 \times 160$  pixels image (in Figure 2.14(b)) assuming error-free transmission (i.e.  $BER = -\infty$ ) and  $BER = 10^{-1}$ , and we calculate their respective MED wrt error-free processing (image(b-1)). This allows the evaluation of the approximation induced by the Bernstein polynomial function only (image(b-2)), which is low compared to the additional approximation generated by combining SC and optical technology (image(b-3)). In the following, we carry out a comprehensive study of application-level computing accuracy, taking into account the impact of system-level parameters.

#### 2.6.2 Application-level Computing Accuracy

We study the impact of the three types of errors, namely  $\varepsilon_{Berns.approx}$ ,  $\varepsilon_{BSL}$ , and  $\varepsilon_{Trans}$  (see Section 2.5.1) using Gamma correction application. For this purpose, we assume a 4<sup>th</sup> order ReSC architecture with BSL = 1024 and  $BER = 10^{-1}$ . Exhaustive simulations are carried out for inputs pixels ranging from 0 to 1, assuming a step of 1/1024, which corresponds to the minimal reachable quantum for the assumed BSL. Figure 2.15(a) reports  $\varepsilon_{Berns.approx}$ , the relative errors induced by the use of an approximated 4<sup>th</sup> order polynomial with respect to the application function. As already observed in Figure 2.14, the approximation is less accurate for darker pixels, which can be improved using higher order architectures. Figure 2.15(b) reports  $\varepsilon_{BSL}$ , the distance between pixels processed using error-free optical transmission and the approximated function (Figure 2.15(a)). As can be seen in the figure, the error follows the pseudo-random generation of stochastic bit streams using LFSR. It can be reduced by optimizing the seed value for the LFSR [91] or by increasing BSL, which,

however, will impact the processing time. Both  $\varepsilon_{Berns\_approx}$  and  $\varepsilon_{BSL}$  depend on the application and the order of the approximated polynomial function. Figure 2.15(c) reports  $\varepsilon_{Trans}$ , which corresponds to the error distance between the data processed, taking into account the optical transmission wrt. result assuming error-free transmission (Figure 2.15(b)). Since we assume  $BER = 10^{-1}$ , the worst-case error occurs for an input value of 0. For this value, the bit stream contains only zeros and each transmission error induces a bit flip to one, which leads to a maximum positive error of 0.1. The opposite situation occurs for an input value of 1. For input value 0.5, the error is minimized since, in our model, bit flips to zero and bit flips to one tend to compensate each other. The error can be reduced by decreasing *BER*, which, however, significantly impacts the energy efficiency, as discussed in the following.



Figure 2.15: Errors for data input ranging from 0 to 1: (a)  $\varepsilon_{Berns\_approx}$  for  $4^{th}$  order, (b)  $\varepsilon_{BSL}$  for  $4^{th}$  order and BSL=1024, (c)  $\varepsilon_{Trans}$  for  $BER = 10^{-1}$ .

In order to evaluate the impact of system-level parameters, i.e., n, BSL, and BER, on the computing accuracy, we simulate the processing of  $160 \times 160$  pixels images and we evaluate the errors using  $MED_{\text{Total}}$  metric. We assume  $2 \le n \le 6$ ,  $256 \le$ 

 $BSL \leq 4096$ . Figure 2.16(a), (b), and (c) provide the results for  $BER = 10^{-1}$ ,  $3 \times 10^{-2}$  and  $10^{-3}$ , respectively. Figure 2.16(d) illustrates the resulting processed images for selected parameters combinations with the corresponding  $MED_{Total}$ , which allow to define the acceptable range of  $MED_{Total}$  that is sometimes very subjective depending on the application. As illustrated in the figure,  $MED_{Total}$  ranges from 0.04 to 0.077 for  $BER = 10^{-1}$  and it decreases to [0.027 - 0.058] and [0.015 - 0.05] for  $BER = 3 \times 10^{-2}$  and  $10^{-3}$ , respectively, thus highlighting the impact of the error transmission on the computing accuracy. We also notice an overlap between the ranges, which indicates that the same computing accuracy can be reached for multiple combinations of BER, BSL and n. As an example,  $MED_{Total} = 0.05$  is reached for the following combinations: i) n = 4, BSL = 4096,  $BER = 10^{-1}$ ; and ii) n = 3, BSL = 265,  $BER = 3 \times 10^{-2}$ . In some cases, it is also possible to



Figure 2.16: MED<sub>Total</sub> of the processed image with BER= a) 10<sup>-1</sup>, b) 3× 10<sup>-2</sup>, and c) 10<sup>-3</sup>. d) The resulting images for i) n=2 / BSL=256; and ii) n=6 / BSL=4096.

reach the same accuracy by keeping one of the parameters. For instance, the 2<sup>nd</sup> order architecture leads to  $MED_{Total} = 0.058$  for the combinations  $[BSL = 4096, BER = 10^{-1}]$  and  $[BSL = 256, BER = 3 \times 10^{-2}]$ . Similarly, the use of  $BER = 10^{-3}$  leads to  $MED_{Total} = 0.02$  for the combinations [n = 6, BSL = 1024] and [n = 3, BSL = 4096]. This validates the ability of the methodology to explore energy efficiency, processing time and accuracy trade-off, which contributes to reduce the design efforts to satisfy application level requirements.

Indeed, *BER* directly depends on the circuit power consumption (i.e., laser power and modulation power), which can be tuned at run-time [92] and without involving any modification in the hardware, which is required in case the order is changed. Furthermore, increasing the accuracy through an adaptation of the *BER* can be obtained without impacting the computing latency, as opposed to the *BSL* for which latency linearly increases with the length of the streams. Overall, while the main design parameters (i.e., n, *BSL* and *BER*) equally contribute to the computing errors, they have different impacts on the hardware complexity, power consumption and processing time. In the following, we optimize the design energy efficiency according to the proposed methodology.

#### 2.6.3 Energy Efficiency Optimization

In this experiment, we optimize the energy efficiency per computed bit for a targeted *BER*. This calls for *WLS* exploration to find the minimum total laser powers consumption for *n* order polynomial architectures. For this purpose, we assume  $BER = 10^{-3}$ , 1Gb/s modulation speed and 20% lasing efficiency. Figure 2.17(a) reports the energy consumption for pump laser and n+1 probe lasers as well as the total lasers for n = 2, 4, and 6. As can be seen, for WLS < 0.125nm, the total energy consumption

is dominated by the probe lasers. This is due to the high crosstalk between probe signals for low WLS. Whereas for WLS > 0.125nm, the pump laser dominates the energy consumption to allow a larger wavelength shift by the AOF. We search for the optimal WLS for architectures with n ranging from 2 and 16. We observe that the optimal WLS ranges from 0.158nm to 0.151nm. Considering the small variation, we assume 0.155nm as the optimal WLS for all orders. In figure 2.17(b), we evaluate the energy per bit for different design orders. The results show that for the optimal WLS, an energy saving up to 79.8% can be obtained, which validates the scalability of the design for higher orders. This allows the exploration of the resulting designs in order to optimize the performance metrics.



Figure 2.17: Laser energy consumption per computed bit according to a) *WLS* and b) the polynomial degree.



Figure 2.18: Designs that maximize the processing accuracy and energy efficiency for Gamma correction application.

#### 2.6.4 Accuracy and Energy Design Trade-off

We aim to optimize the accuracy and energy efficiency to execute the Gamma correction application. For this purpose, all the parameter combinations shown in Figure 2.16 are considered, which leads to 8 designs. For each design, we estimate the laser energy consumption per processed pixel. The pixel processing is also estimated, taking into account BSL and by assuming a 1Gbit/s modulation speed.

As illustrated in Figure 2.18, eight designs (i.e., so called D1 - D8 in the following) are on the Pareto front: D1 is the most energy efficient solution with 4.17 nJ/pixel and  $MED_{\text{Total}} = 0.077$ , while D8 is the most accurate one (196nJ/pixel and  $MED_{\text{Total}} = 0.017$ ). The  $47 \times$  increase in the energy consumption is due to the use of i) higher order (6 for D8 against 2 for D1); ii) lower BER (10<sup>-3</sup> for D8 against 10<sup>-1</sup> for D1); and iii) higher BSL (4096 for D8 against 256 for D1). It is also worth

mentioning that the processing time of D1 is  $16 \times$  faster than for D8 (256ns/pixel for D1 wrt.  $4\mu$ s/pixel for D8), which is due to the shorter BSL. Compared to D1, D2 allows reducing the error from 0.078 to 0.058 (-25%) at the cost of a 4.7% energy consumption increase. Since this can be achieved by adapting the laser power, no modification of the hardware is needed, thus allowing the user to switch between D1to D2 at run-time. However, further reducing the error (i.e., using D3 instead of D2) calls for a third order polynomial function, which involves different hardware since D1 to D8 are static architectures; hence the order cannot be adapted during runtime. Interestingly, switching at run-time between D4 and D5 is also possible in case reconfigurable SNGs are used. Indeed, both designs involve the same hardware for the OC part; the only difference is in the interfaces since D4 and D5 rely on 512 and 1024 bit-stream lengths, respectively. Using D5 instead of D4 leads to  $2\times$  increase in the energy consumption while offering a 27% reduction in the error.

# 2.7 Summary

In this chapter, we proposed the design of an OSC architecture for polynomial functions. We presented the design of a generic Bernstein polynomial architecture for *n*-order polynomial function. In addition, we defined the analytical model for the signal transmission. We proposed design methods for optimizing laser power consumption according to device characteristics. The computing accuracy is evaluated by considering errors induced by i) polynomial function approximation; ii) stochastic number generation; and iii) the transmission of optical signals. The latter depends on device parameters such as resonance wavelength, etc., which can be explored by the designer. We simulated the execution of Gamma correction to process  $160 \times 160$ pixels images. Results showed that reducing the mean error from 0.077 to 0.017 can be achieved at the cost of  $47 \times$  energy consumption and  $16 \times$  processing time. The results showed that it is possible to reach the same computing accuracy for different polynomial orders by compensating the reduced accuracy of lower order polynomial with higher *BSL* and lower *BER*.

Overall, we found out that the order n is the main design parameter to consider when both accuracy and energy efficiency need to be optimized. Based on our observations, and by considering the technological parameters and design method, *BER* and *BSL* are intrinsically needed to be maximized and minimized, respectively, for energy efficiency purpose. However, they would become key design parameters to explore in case processing speed and laser power consumption are optimized. All in all, these observations call for a reconfigurable architecture, in which the order of the polynomial function can be adapted, together with the *BSL* and/or the laser power consumption, according to users' constraints and objectives, as will be explained in the next chapter.
## Chapter 3

# Reconfigurable Optical Stochastic Computing Architecture for Polynomial Functions

The architecture proposed in Chapter 2 is static, where any change in the design requirements needs the design of a new architecture. Since designing computing architectures using silicon photonics devices remains costly, this calls for an adaptable design able to meet application-specific objectives. Therefore, in this chapter, we propose a reconfigurable OSC architecture allowing online adaptation of computing accuracy, energy efficiency, and throughput that meet different requirements. The reconfigurable architecture is based on the Bernstein polynomial design proposed in Chapter 2, where the order can be configured during run-time. This allows the execution of a single function of high order to enhance the accuracy or multiple functions simultaneously for higher throughput and energy efficiency.

## 3.1 Proposed Methodology

In the proposed methodology, shown in Figure 3.1, one or more applications represented as mathematical functions, i.e.,  $f_1(x_1)$  to  $f_m(x_m)$ , can be implemented simultaneously using polynomial architectures. This requires a reconfigurable design, where the order can be adapted according to the requirements, i.e., computing accuracy, energy efficiency, and design throughput. This allows for parallelism at instruction and data levels, where different applications can be executed at the same time to process the same input (instruction parallelism) or the same application can be implemented multiple times to process different input data in parallel (data parallelism). Based on the selected scenario, we explore the technological and system-level parameters. In order to optimize the energy consumption, we use the same approach proposed in the methodology of Chapter 2, where WLS is explored using the MRR-first design method to estimate the minimum energy consumption of the lasers. Evaluating computing accuracy is carried out using a Gamma correction application, where we consider the same sources of error introduced in Chapter 2, i.e.,  $\varepsilon_{Berns\_approx}$ ,  $\varepsilon_{BSSL}$ and  $\varepsilon_{Trans}$ . As a result, we evaluate the total energy consumption, computing accuracy, using the MED metric, and processing time for a given set of design parameters. The design can be generic to implement n order functions. To illustrate the efficiency of the proposed methodology, we propose the design of an architecture that can be reconfigured to execute: i) 4<sup>th</sup> order function for high accuracy processing; or ii) two 2<sup>nd</sup> order functions for energy efficiency and high throughput purposes as detailed in the following.



Figure 3.1: Proposed methodology for design space exploration of reconfigurable polynomial architectures.

## 3.2 Proposed Architecture

In this section, we introduce the optical devices used in the design and the proposed reconfigurable architecture. We present a design method to explore different design parameters for energy optimization.

### 3.2.1 Directional Coupler

Since the design is based on the optical polynomial architecture proposed in Chapter 2, the same optical devices are used, i.e., MZIs, MRRs and AOFs. For the configuration process, an additional device is required to direct the data signals to the correct output. We propose to use DC for this purpose, but first, a brief description of device operation is presented.



Figure 3.2: DC device.

Figure 3.2 shows a DC composed of two parallel arms implemented using waveguides. The device operates in two states: when no voltage is applied, the intrinsic refractive index of the waveguides leads to coupling of the signal from a waveguide to another, i.e., cross state (Figure 3.2(a)). When a voltage is applied, the change in the refractive index leads to a 50% reduction of the coupling length. Thus, the signals continue propagating on the same waveguide, i.e., bar state (Figure 3.2(b)). The device transmission is defined by:

$$T^{DC}[v] = \begin{cases} IL_{cross\%}, & v = 0 \ Cross \ state \\ IL_{bar\%}, & v = 1 \ Bar \ state \end{cases}$$
(3.1)

In the following, we propose the design of the reconfigurable architecture and detail the functionality of the DC in the design.

#### 3.2.2 Reconfigurable Bernstein Polynomial Architecture

Figure 3.3 illustrates the proposed reconfigurable architecture. It allows executing polynomial functions on input data  $X_A$  and  $X_B$  according to Bernstein coefficients (input  $b_0..b_2$  and  $b_3..b_5$ ). Two configurations are available:  $Cfg_{1\times4}$  allows executing a 4<sup>th</sup> order function on the data (i.e.,  $X_A = X_B$ ) and  $Cfg_{2\times2}$  leads to two 2<sup>nd</sup> order functions processed in parallel (i.e.,  $X_A \neq X_B$ ). Depending on the selected configuration, the results are output either on  $Y_{1\times4}$  (for  $Cfg_{1\times4}$ ) or  $Y_{2\times2.A}$  and  $Y_{2\times2.B}$  (for  $Cfg_{2\times2}$ ).



Figure 3.3: Proposed reconfigurable architecture for polynomial functions.

The reconfigurability involves a symmetrical architecture: two sets of adders and modulators are designed using MZIs and MRRs, respectively. Each one is responsible for generating optical signals corresponding to the related input data (i.e.,  $X_A$  or  $X_B$ ) and coefficients ( $b_0..b_2$  or  $b_3..b_5$ ). The data signals are generated as follows: from data  $X_A$  (resp.  $X_B$ ), streams of bits  $X_{A1}$  and  $X_{A2}$  (resp.  $X_{B1}$  and  $X_{B2}$ ) are generated using independent SNGs; their outputs modulate MZIs, thus leading to constructive state ('1') or destructive state ('0') on signals at  $\lambda_{pump}$  (see mark (1) in Figure 3.3). Eventually, for each pair of MZIs, three optical power levels can be obtained: 0 for 00, 1 for 01/10 and 2 for 11 (see (2)). The optical signals corresponding to coefficients  $b_i$  are obtained through modulation of MRRs at  $\lambda_i$  using SNGs, where  $0 \le i \le 5$  (see (3)). Data and coefficient signals are combined into a waveguide prior



**Figure 3.4:**  $Cfg_{1\times 4}$  executes a single 4<sup>th</sup> order function.

entering a reconfigurable multiplexer implemented using DCs and AOFs (see ④). The configuration depends on the states of the DCs, as detailed in the following:

- Configuration  $Cfg_{1\times4}$  involves both DCs in the cross state, i.e., cfg=1, as shown in Figure 3.4. The two groups of data and coefficient signals are combined into the same waveguide as follows (see (5)): while the coefficient signals are combined without interfering due to WDM, data signals cumulate with each other since they both propagate at  $\lambda_{pump}$ . This leads to five pump power levels able to detune the AOF to five wavelengths at which the coefficient signals propagate. The signal at the wavelength selected by the AOF is dropped to output  $Y_{1\times4}$ . This configuration thus allows executing a 4<sup>th</sup> order function.
- Configuration Cfg<sub>2×2</sub> involves both DCs in the bar state, i.e., cfg=0, as shown in Figure 3.5. The two groups of data and coefficient signals continue propagating independently from each other (see <sup>(6)</sup>). For each group, the pump signal detunes the corresponding AOF to one of the three wavelengths propagating the coefficient signals (i.e., λ<sub>0</sub>..λ<sub>2</sub> for Y<sub>2×2.A</sub> and λ<sub>3</sub>..λ<sub>5</sub> for Y<sub>2×2.B</sub>). This allows simultaneous execution of two 2<sup>nd</sup> order functions.



Figure 3.5:  $Cfg_{2\times 2}$  executes two 2<sup>nd</sup> order functions.

Since DCs enable the switching between a single 4<sup>th</sup> order function  $(Cfg_{1\times4})$ and two 2<sup>nd</sup> order functions  $(Cfg_{2\times2})$ , the architecture allows exploring accuracy and throughput trade-off at run-time. For image processing applications, the high polynomial order available in  $Cfg_{1\times4}$  configuration is suitable to meet objectives related to computing accuracy. On the other hand, the parallelism available in  $Cfg_{2\times2}$  configuration accelerate the processing, either using data-level parallelism (by applying the same filter on multiple images simultaneously) or instruction-level parallelism (by applying multiple filters on the same image). However, compared to static architecture proposed in Chapter 2, this adaptability leads to area and energy overhead. This calls for design optimization with the key challenges introduced in the following.

#### 3.2.3 Design Method

The laser powers are key design parameters to optimize. Indeed, while the laser powers should be minimized for energy efficiency purpose, enough optical power should be injected to ensure that the design works properly and the computations are correct. The reconfigurability of the architecture leads to additional constraints, since the same injected pump power should control either a single AOF  $(Cfg_{1\times4})$  or two AOFs  $(Cfg_{2\times2})$ . While existing methods allow adapting laser powers at run-time [92], they lead to a significant control overhead we intend to avoid in the context of SC as they would impact both latency and area. Instead, we aim to optimize, at design time, the laser powers taking into account the characteristics of the involved devices, i.e., MZI, MRR, DC and AOF, and system-level parameters, such as *BER*. For this purpose, we investigate the wavelengths of the coefficient signals, since they affect both lasers pump and probe powers.

First, we define two groups of wavelengths to be processed in parallel under  $Cfg_{2\times 2}$  configuration. Each group contains consecutive wavelengths; hence the pump power is equally distributed to two AOFs. The total wavelengths range (i.e., from  $\lambda_0$  to  $\lambda_5$ ) is also equally distributed, which allows using the same optical tuning efficiency for all the AOFs. Second, we define for each AOF an initial resonant wavelength  $\lambda_{ref}$  allowing to minimize the covered wavelength distance. For  $Cfg_{2\times 2}$ ,  $\lambda_{ref}$  is defined as close as possible to the right-most wavelength in the group ( $\lambda_2$  and  $\lambda_5$  for  $Y_{2\times 2.A}$  and  $Y_{2\times 2.B}$ , respectively), which is given by the minimum optical power received by the AOF (i.e., 00), hence it depends on the MZI and DC insertion losses. Finally, a large *WLS* leads to low crosstalk between the coefficient signals, which minimizes the required lasers probe powers. On the other hand, this requires higher pump power to cover a larger wavelength distance by the AOF. Therefore, the optimal spacing, i.e., the spacing minimizing the total laser power, is searched analytically by exploring the *WLS*. The design calls for a transmission model we define in the following.

## 3.3 Implementation and Modeling

The configuration proposed in Section 3.2.2, allows run-time adaptation of accuracy, energy-efficiency and throughput that comes with power overhead. In this section, we detail the signal transmission model of the reconfigurable architecture. It allows evaluating SNR, from where the laser energy consumption is estimated. The model is unified and is thus applicable for the two configurations. The configuration is defined by cfg, which controls the state of the DCs (i.e.  $Cfg_{2\times 2}$  and  $Cfg_{1\times 4}$  lead to bar state and cross state, respectively). The coefficient signal  $\lambda_i$  propagates through i) the modulating  $MRR_i$ ; ii) modulators  $MRR_w$  dedicated to other signals; iii) a DC; and iv) an AOF, as defined by:

$$T_{s,z}[i] = \underbrace{\varphi_t(\lambda_i, \lambda_i - \Delta\lambda \times z_i)}_{\text{Modulating MRR transmission}} \times \underbrace{\prod_{w=0, w \neq i}^n \varphi_t(\lambda_i, \lambda_w - \Delta\lambda \times z_w)}_{\text{Other MRRs transmission}} \times \underbrace{\frac{T^{DC}[cfg]}_{\text{DC Transmission}} \times \underbrace{\varphi_d(\lambda_i, \lambda_{ref} - \Delta AOF)}_{\text{AOF transmission}}}$$
(3.2)

The detuning of the AOF depends on the transmission of the pump signal through the MZIs and the DCs. It is given by:

$$\Delta AOF = OP_{Laser\_pump} \times OTE \times \frac{1}{n} \sum_{h=j}^{k} T^{MZI}[x_h] \times T^{DC}[cfg]$$
(3.3a)

$$\begin{cases} h = \{A_1, A_2\} , Cfg_{2\times 2} \\ \{B_1, B_2\} \\ h = \{A_1, A_2, B_1, B_2\} , Cfg_{1\times 4} \end{cases}$$
(3.3b)

where OTE assumed to be 0.01nm/mW as in Chapater 2.  $T^{MZI}[Xj]$  is the transmission through the MZIs, for which the states (constructive or destructive) depend on the data input  $X_j$ . Equation 3.3b indicates which MZIs will be considered in the transmission according to the selected configuration: either the pump signals are separated ( $Cfg_{2\times 2}$ ), or they remain combined ( $Cfg_{1\times 4}$ ). We use Equations 2.17 and 2.18 to calculate *SNR* and *BER*, respectively.

Regarding computing accuracy, we assume the three sources of error introduced in Chapter 2, i.e.,  $\varepsilon_{Berns\_approx}$ ,  $\varepsilon_{BSSL}$  and  $\varepsilon_{Trans}$ . Moreover, we use MED metric to evaluate the total computing accuracy, i.e.,  $MED_{Total}$ , at the application-level as presented in Section 2.5.1.

### 3.4 Simulation Results

In this section, we evaluate the performances of the proposed reconfigurable architecture using the Gamma correction application. We also evaluate the energy and area overhead compared to a non-reconfigurable version of the architecture.

#### 3.4.1 Accuracy and Throughput Trade-off

We use the Gamma correction image processing application with  $\gamma=0.45$  to execute a 2<sup>nd</sup> order ( $Cfg_{2\times 2}$ ) and 4<sup>th</sup> order ( $Cfg_{1\times 4}$ ) architectures. For this purpose, the Bernstein coefficients ( $b_0$  to  $b_2$ ) and ( $b_0$  to  $b_4$ ) are calculated for  $Cfg_{2\times 2}$  and  $Cfg_{1\times 4}$ , respectively, using Equation 2.9. Figure 3.6 shows the outputs from processing input data x [0, 1] using an error free function f(x), and approximated 2<sup>nd</sup> and 4<sup>th</sup> order polynomial functions. As expected, the approximation level increases with the reduced polynomial order, which impacts the error and leads to design trade-off we explore in the following.



**Figure 3.6:** Error free function f(x) and approximate polynomial functions for  $Cfg_{1\times 4}$ and  $Cfg_{2\times 2}$ .

To evaluate the architecture, we simulate the processing of  $160 \times 160$  pixels images for BSL ranging from  $2^8$  to  $2^{12}$  and  $BER = 10^{-3}$ . We explore the WLS, which leads to optimal WLS=0.155nm. The computing accuracy is calculated using MED, which is obtained by comparing the pixels processed using our architecture with the error free results.  $Cfg_{1\times4}$  leads to sequential processing of the pixels (Figure 3.7(a)). For this purpose, each pixel is sent to  $X_A$  and  $X_B$  and the 5 coefficients are distributed to the MRRs.  $Cfg_{2\times2}$  is used to process two pixels simultaneously for high throughput purposes (Figure 3.7(c)). In this case,  $X_A$  and  $X_B$  receive different pixels and the same coefficients are sent to the two groups of MRRs. By assuming 1Gbps modulation speed and  $BSL = 2^{10}$ , the average processing time per pixel are 1024ns



**Figure 3.7:** Image processed for (a)  $Cfg_{1\times4}$ : pixels are serially processed, and (c)  $Cfg_{2\times2}$ : pixels are processed in parallel. (b) marks ① ② and (d) ④ ⑤ are the transmissions through MRRs for  $Cfg_{1\times4}$  and  $Cfg_{2\times2}$ , respectively. (b) mark ③ and (d) marks ⑤ ⑦ are the transmissions towards the photodetectors for  $Cfg_{1\times4}$  and  $Cfg_{2\times2}$ , respectively.

and 512ns for  $Cfg_{1\times4}$  and  $Cfg_{2\times2}$ , respectively. Figure 3.7(b) and (d) show the signal transmissions for  $Cfg_{1\times4}$  and  $Cfg_{2\times2}$ , respectively. For  $Cfg_{1\times4}$ , we assume a value '1' for the coefficient signals at  $\lambda_2$  and  $\lambda_4$  and a value '0' for the remaining  $\lambda$ , thus leading to the transmissions illustrated in (1) and (2). The signals are merged and propagate to the same AOF. We assume a received 53mW pump signal power (corresponding to  $X_{A1} = X_{B2} = 1$  and  $X_{A2} = X_{B1} = 0$ ), allowing to detune the AOF from  $\lambda_{ref}$ to  $\lambda_2$  (see (3)), thus leading to the transmission of 110µW to  $Y_{1\times4}$ . For  $Cfg_{2\times2}$ , we assume the transmission of '1' at  $\lambda_2$ ,  $\lambda_4$ , and  $\lambda_5$  (see (4) and (5)). The groups of signals propagate to two AOFs, which are detuned independently from each other. The assumed data inputs values lead to the transmission of the signals at  $\lambda_1$  and  $\lambda_5$ to  $Y_{2\times2.A}$  (10µW) and  $Y_{2\times2.B}$  (90µW), respectively (see (6) and (7)).

#### 3.4.2 Static vs Reconfigurable Architectures

Table 3.1 reports the energy and area overheads of the reconfigurable architecture compared to the static architecture defined in Chapter 2. For a fair comparison, we design our architecture to ensure that  $Cfg_{1\times4}$  and  $Cfg_{2\times2}$  achieve the same accuracy as the 4<sup>th</sup> and 2<sup>nd</sup> order static architectures, respectively. We also assume that both static and reconfigurable architectures can adapt the *BSL* during run-time [93]. The simulation results show that they lead to 53% and 36.8% energy overhead, respectively, which is mainly due to the losses induced by the DCs on the propagation path.

We also evaluate the impact of BSL on the computing accuracy and energy efficiency. For this purpose, we evaluate the error and the energy efficiency of all architectures for BSL ranging from  $2^8$  to  $2^{12}$ . As can be seen in Figure 3.8, the proposed architecture allows covering  $MED_{Total}$  ranging from 0.05 to 0.017, while static architectures cover [0.05 - 0.03] and [0.04 - 0.017] for  $2^{nd}$  and  $4^{th}$  order, respectively. The

|                           |                                 | Static Architecture Reconfigurable Arch |       |                   | rable Arch | itecture |
|---------------------------|---------------------------------|-----------------------------------------|-------|-------------------|------------|----------|
|                           |                                 | n=4                                     | n=2   | abs               | Wrt. n=4   | Wrt. n=2 |
| Energy officiency         | n I / pivol                     | 34                                      | 10    | $Cfg_{1x4}:52$    | +53%       | +173%    |
| Energy eniciency          | no/pixer                        | 54                                      | 19    | $Cfg_{2x2}:26$    | -23.5%     | +36.8    |
| Accuracy                  | $\mathrm{MED}_{\mathrm{Total}}$ | 0.023                                   | 0.034 | $Cfg_{1x4}:0.023$ | -          | -32.4%   |
|                           |                                 |                                         |       | $Cfg_{2x2}:0.034$ | +47.8%     | -        |
| No. of optical<br>devices | Pump laser                      | 1                                       | 1     | 1                 |            |          |
|                           | Probe laser                     | 5                                       | 3     | 6                 |            |          |
|                           | MZI                             | 4                                       | 2     | 4                 |            |          |
|                           | DC                              | 0                                       | 0     | 2                 |            |          |
|                           | MRR                             | 5                                       | 3     | 6                 |            |          |
|                           | AOF                             | 1                                       | 1     | 3                 |            |          |
|                           | Photodetector                   | 1                                       | 1     | 3                 |            |          |
| Accuracy/energy           | Order                           | -                                       |       | $\checkmark$      |            |          |
| adaptability              | BSL                             | ١                                       |       | $\checkmark$      |            |          |

Table 3.1: Energy and area overhead evaluation.

improvement in the reachable range of accuracy (+65% and +43.5%) demonstrates the benefits of adapting the polynomial order to satisfy application-level requirements.

Interestingly, adapting the polynomial order is, in some cases, more energy efficient than adapting the *BSL*. For instance, assuming a 2<sup>nd</sup> order static architecture in Figure 3.8, reducing the error from 0.04 to 0.03 can be achieved by increasing the *BSL* from 2<sup>9</sup> (see (1) in the figure) to 2<sup>12</sup> (see (2)), which results in 67nJ/pixel. Using the proposed architecture, the same accuracy can be achieved by switching from  $Cfg_{2\times 2}$  (see (3)) to  $Cfg_{1\times 4}$  (see (4)), which leads to 26nJ/pixel. It is worth noticing that, in addition to the 61.2% energy saving, a 8× throughput is achieved thanks to a lower *BSL* (2<sup>9</sup> for (4) wrt. 2<sup>12</sup> for (2)).

As can be observed, although the proposed architecture leads to area overhead, it covers a large range of computing accuracy, which is needed to adapt to user requirements. This high adaptability allows, depending on the targeted accuracy, to improve the energy efficiency or the throughput compared to the static architecture.



Figure 3.8: Accuracy and energy efficiency results to process  $160 \times 160$  pixels images for BSL ranging from  $2^8$  to  $2^{12}$ .

## 3.5 Summary

In this chapter, we proposed a reconfigurable OSC architecture that allows adapting the order of the executed polynomial functions for accuracy, energy efficiency, and throughput purposes. Compared to a static architecture, in which the order is defined at design time, the reconfigurable architecture increases the range of reachable accuracy by 65%, which is a key to meet users' requirements. However, it leads up to 53% energy overhead. We also demonstrated that, in some cases, adapting the polynomial order is more energy efficient than adapting the BSL.

While Bernstein polynomial architecture is limited to single input function implementation, other applications may be based on processing multiple inputs, where the design can be composed of cascaded gates, such as combinational filters. This will require the exploration of other device's characteristics, as will be discussed in the next chapter.

## Chapter 4

# Optical Stochastic Computing Architecture for Combinational Filters

In this chapter, we investigate the use of PhC nanocavities to design SC architectures. PhC nanocavity is an energy efficient device of a small footprint. It is characterized by different quality factors around resonant wavelengths. We aim to take advantage of this feature to explore the design of SC architectures that involve cascaded gates and multi-wavelength signaling, such as combinational filters. In order to implement such architectures, we aim to design all-optical logic gates using PhC nanocavities.

## 4.1 Overview

In this section, we give a brief overview of the objective of this chapter and introduce the state-of-the-art SC edge detection filter architecture.

#### 4.1.1 All-optical Architecture

Silicon photonics devices, such as MZI and MRR, have been widely investigated in the design of OC architectures [25, 68]. In these approaches, optical signals are modulated by electrical signals, which calls for E/O and O/E converters. To cope with this limitation, the design of all-optical gates using MRR has been investigated in [84]. The switching operation is obtained by applying a high power (typically few mW) optical control signal in order to modulate a lower power optical data signal (typically few 100s  $\mu$ W). In MRRs, this is achieved by injecting control and data signals on different resonant wavelength, where the wavelength detuning obtained from the control signal will modify the transmission of the data signal. This way, the data signals remain in the optical domain during their processing from the inputs to the outputs, which prevent from the need for EO/OE converters. Therefore, all-optical architectures have the potential to operate at higher speeds compared to optical architectures involving electrically controlled devices. However, to trigger nonlinear effects needed for the all-OC architectures, one has to take into account the wavelength detuning achievable in the MRR, which mostly depends on the quality factor (Q factor). Since the Q factor is intrinsically the same for all resonances, the modulation obtained on the data signal is necessarily limited by the shift triggered by the control signal. PhC nanocavities do not share this limitation since each resonance can show a different Q factor. Hence, using such a device can lead to ERunreachable with MRR, which is essential for the design of computing architectures involving cascaded gates. Furthermore, PhC demonstrates 10ps switching speed, 100fJ switching energy consumption and  $10 \times$  compactness compared to MRRs [94], which makes the device an ideal candidate for all-OC architectures.

The design of all-optical gates is necessary to implement all-OC architectures. In the context of SC, the design of an all-optical XOR gate and Multiplexer (MUX) is essential since they represent an absolute value subtractor and an adder, respectively. The implementation of an architecture that involves cascaded gates, such as stochastic edge detection filter with cascaded multiplexers, in the optical domain is challenging. It requires a device with different Q factors and wavelength detuning to transmit a group of signals propagating at multiple wavelengths. The design of such architecture involves a large design space to explore at both device and system levels, such as Q factors, resonance wavelength, and wavelength detuning. In this chapter, we investigate the use of PhC nanocavities to design all-optical cascaded gates for SC architectures in the context of image processing applications, such as edge detection filters. For this purpose, we develop an all-optical XOR gate and a multiplexer (MUX) using nanocavities. We propose a transmission model of the nanocavities taking into account Q factors and resonance wavelengths, which allows exploring the design space. Thales in France provided us with the device's characteristics. We use their experimental results to validate our transmission model. As a case study, we implement a Sobel edge detection filter, which involves cascaded XOR gate and MUX for absolute value subtraction and addition. The design of the cavities is explored to trade off power consumption, computing accuracy and processing time.

#### 4.1.2 Stochastic Computing Edge Detection Filter

In [33], the design of stochastic edge detection filter is proposed. It is based on Robert's cross operator (Figure 4.1), where two  $2 \times 2$  filters are applied to an image to find the gradient vector at each pixel. The filters rely on absolute value subtraction and addition implemented using XOR gate and MUX, respectively, as follows:



Figure 4.1: Stochastic implementation of edge detection filter using Robert's cross operator [33] with XOR as an absolute value subtractor and MUX as an adder.

• Absolute Value Subtractor: Figure 4.2(a) illustrates an XOR gate implementing a subtractor. This operation requires positively correlated bit streams with maximum overlap between '1's and '0's [95]. In the example, bit streams A=01010110 and B=01110110 are positively correlated with probability  $p_A=4/8$ and  $p_B=5/8$ , respectively, which leads to  $p_Y=1/8$ . In general, the output of the XOR gate can be written as:

$$p_{Y} = \begin{cases} p_{A} - p_{B}, & p_{A} > p_{B} \\ p_{B} - p_{A}, & p_{B} > p_{A} \end{cases}$$
(4.1)

which can be expressed as:

$$p_Y = |p_A - p_B| \tag{4.2}$$



Figure 4.2: (a) XOR gate as absolute value subtractor and (b) 2×1 MUX as scaled adder.

• Scaled-adder: This operation can be implemented using 2×1 MUX, as shown in Figure 4.2(b). The selection line has a probability of 1/2, which allows to downscale the output in order to keep the probability in the range [0,1]. While the bit streams to be added can be either uncorrelated or correlated [6], the selection line needs to be uncorrelated with the inputs. The output of the MUX is given as:

$$p_Y = (1 - p_{sel})p_A + p_{sel}p_B (4.3)$$

since  $p_{sel}=1/2$ , the equation can be written as:

$$p_Y = \frac{1}{2}(p_A + p_B) \tag{4.4}$$

The main drawback of this implementation is the reduced accuracy of the output due to downscaling the results by half. This can be overcome by doubling the *BSL*, which, however, increases the latency. The design proposed in this chapter relies on cascaded MUXs, which induce precision loss but allow to maintain low hardware complexity. The impact of the precision loss on the application accuracy is evaluated, which allows choosing the most suitable *BSL*.

A common issue in SC architectures is the overhead induced by SNGs in terms of area and power. To overcome this issue, an adder allowing to reduce the number of LFSRs has been proposed in [32]. The selection line of the MUX is connected to the least significant bit (LSB) of the LFSR used to generate the MUX data inputs. The optical adder, we propose, relies on this efficient design. Since the same LFSR is used to generate correlated inputs [33], our design contains only a single LFSR to generate the bit streams for the XOR inputs and the selection lines of the MUXs.

## 4.2 Proposed Methodology

The proposed methodology, shown in Figure 4.3, targets combinational filters architectures. For a given application, we plan to explore the technological parameters, such as Q factors and resonant wavelength of the devices, and system-level parameters, i.e., laser powers, *BER*, and *BSL*, where energy consumption is optimized and computing accuracy is evaluated as follows:



Figure 4.3: Proposed methodology for design space exploration of combinational filter architectures.

• Energy Optimization: The energy is optimized by exploring the WLS between input signals at each stage in order to find the minimum BER. This requires exploring the Q factors and wavelength detuning of the gates in each stage. The exploration is repeated for each stage until the last stage of the cascaded architecture is reached, where the targeted BER at the photodetector is satisfied. • Accuracy Evaluation: This requires the generation of bit streams that control the logic gates. This can be divided into: i) generation of bit streams that correspond to the input data, i.e., the pixels of an image, which control the operation of logic gates, such as XOR in edge detection filter; and ii) generation of bit streams for the selection lines of the MUXs, which have a probability of 0.5. As discussed in the previous section, these bit streams can be taken directly from the LFSR register. We use the error distance (ED) metric to compute the accuracy. Unlike polynomial architectures, there are only two sources of error for combinational filter architecture: i) error from the generated bit streams  $ED_{BSL}$ ; and ii) error from optical transmission  $ED_{Trans}$ . For the total error, we use the peak signal-to-noise ratio (PSNR) metric to evaluate computing accuracy.

## 4.3 Photonics Crystal Nanocavity

In this section, we introduce the PhC nanocavity device used to implement all-optical logic gates. The physical properties of the device and the implementation of an inverter are first detailed. Then, the design of the XOR gate and MUX are presented. Finally, a transmission model of the nanocavity is proposed.

#### 4.3.1 Nanocavity Device Overview

PhC nanocavities are emerging devices that feature higher switching speed, improved compactness and higher energy efficiency compared to MRRs. Unlike MRRs, nanocavities can demonstrate different Q factors around resonance wavelengths, which would contribute to increasing wavelength detuning and hence *SNR*. This leads to new opportunities to design architectures involving cascaded gates and data propagating through multiple wavelengths.



Figure 4.4: Photographs of the studied PhC nanocavity. a) A III-V semiconductor PhC cavity bonded on top of a silicon waveguide. b) Scanning electronic microscope top view photographies of III-V PhC cavities.

In this section, we use PhC nanocavity to implement all-optical logic gates. The structure is made of III-V semiconductor bonded on top of a silicon waveguide, as illustrated in Figure 4.4(a). The PhC cavity itself consists of a waveguide drilled with holes (Figure 4.4(b)). PhC nanocavity is a resonator that can act as a filter allowing only the resonant optical frequency to pass through. The implementation of fully optical gates using such cavity involves the triggering of nonlinear effect. This can be achieved using a high power optical signal to control the transmission of lower power optical signals. It has been shown that a fast (10ps) nonlinear response is possible with only about 100fJ of energy [94], substantially outperforming MRRs [96].

#### 4.3.2 All-optical NOT Gate

As previously mentioned, the design of all-optical logic gates using nanocavity involves triggering nonlinear effects. We illustrate this principle using the implementation of an all-optical NOT gate. As shown in Figure 4.5(a), the NOT gate has an input In, which corresponds to the pump signal injected into the nanocavity. The value of In is given by its optical power  $P_{[NOT]}$  (i.e., low power means '0' and high power means '1'). Therefore, input signal In controls the value of the output signal Out, which corresponds to the output Out of the NOT gate. The design of the nanocavity allows two (or more) resonances separated by FSR. One resonance, in this case  $\hat{\lambda}_{P[NOT]}$ , is used to effectively inject a pump signal at  $\lambda_P$ , which induces the spectral shift of the other resonances, i.e.,  $\hat{\lambda}_{S[NOT]}$ . This modifies the transmission of the output signal at  $\lambda_S$ . The signal at  $\lambda_S$  is always injected into the cavity as '1', as shown in Figure 4.5(a). The operation of all-optical NOT gate is explained as follows:

- In='0' corresponds to  $P_{[NOT]}='Low'$  (Figure 4.5(b)): in this case, the nanocavity is off-resonance, i.e.,  $\hat{\lambda}_{S[NOT]} \neq \lambda_S$ . Thus, the transmission of the signal at  $\lambda_S$  to the output is maximized, which leads to Out='1'.
- In='1' corresponds to  $P_{[NOT]}='$ High' (Figure 4.5(c)): The pump power detunes the resonance of the nanocavity by  $\Delta\lambda_{[NOT]}$ . The resonance of the cavity is then aligned to the output signal wavelength at  $\lambda_S$ , i.e.,  $\hat{\lambda}_{S[NOT]} = \lambda_S$ . This leads to a strong attenuation of the signal and hence Out='0'.

The fabrication process allows controlling numerous parameters, such as Q factors and resonance wavelengths. The design allows defining different Q factors for each resonance, as shown in Figure 4.5. Since we assume one pump and one output signals, it is possible to define Q factors  $Q_{P[NOT]}$  and  $Q_{S[NOT]}$  at resonances  $\hat{\lambda}_{P[NOT]}$  and  $\hat{\lambda}_{S[NOT]}$ , respectively. We define the ratio between  $Q_{S[NOT]}$  and  $Q_{P[NOT]}$  as the figure of merit  $(M_{[NOT]})$  of the cavity  $(M_{[NOT]} = Q_{S[NOT]}/Q_{P[NOT]})$ . A nanocavity with a large figure of merits would allow maintaining efficient coupling of the pump



Figure 4.5: An all-optical NOT gate implemented using nanocavity: (a) logic gate and the equivalent nanocavity device representation, (b) gate transmission for logic input '0' and (c) gate transmission for logic input '1'.

signal power into the device, while significantly changing the transmission around the output signal wavelength. This would result in a large gap between the cavity transmission for data '1' (i.e., no pump is applied) and data '0' (i.e., a pump signal is applied), i.e., extinction ratio. The impact of the figure of merits is further discussed in Section 4.4. In the following, we propose the implementation of an all-optical XOR gate and a MUX, which we use for the design of edge detection filter.

#### 4.3.3 Design of All-optical XOR Gate and MUX

The design of an edge detection circuit requires XOR gate and MUX. The following introduces their implementation using nanocavity devices.

- 2-input XOR gate: A 2-input XOR gate is implemented using two cascaded nanocavities, as illustrated in Figure 4.6(a). They are equal in Q factors but different in the FSR. Nanocavities marked (1) and (2) resonate at  $\hat{\lambda}_{S[X1]}$  and  $\hat{\lambda}_{S[X2]}$ , respectively. Inputs  $In_1$  and  $In_2$ , common for both cavities, are injected as pump signals into the cavities. The pump signals propagating at  $\lambda_{p[1]}$  and  $\lambda_{p[2]}$  are close in values to achieve the desired detuning. The signal at  $\lambda_S$  is always '1'. It is tuned to match the resonance wavelength of the nanocavity marked (1)  $(\hat{\lambda}_{S[X1]} = \lambda_S)$  and hence initially, when no pump signal is injected  $(In_1=In_2=0)$ , the signal is attenuated leading to Out='0', as shown in Figure 4.6(b). When one of the pump signals is high (i.e.,  $In_1 \neq In_2$ ), the resonance wavelengths of both cavities are shifted by  $\Delta \lambda_{[XOR]} \approx 1/2(\hat{\lambda}_{S[X2]} - \hat{\lambda}_{S[X1]})$ . Since none of the resonance wavelengths is aligned with  $\lambda_S$ , this leads to the transmission of the signal at  $\lambda_S$  with maximized power, i.e., Out='1', as shown in Figure 4.6(c). When the two pump signals are high  $(In_1=In_2='1')$ , as shown in Figure 4.6(d), the resonance wavelengths of both cavities are detuned by  $\Delta \lambda_{[XOR]} = (\hat{\lambda}_{S[X2]} - \hat{\lambda}_{S[X1]})$ . Therefore, resonance wavelength  $\hat{\lambda}_{S[X2]}$  is tuned to  $\lambda_S$ . Since  $\hat{\lambda}_{S[X1]} \neq \lambda_S$ , this leads to the transmission of the signal at  $\lambda_S$  by the first device marked (1) and to its attenuation by the second device marked (2), hence Out='0'.
- $2 \times 1$  MUX: A  $2 \times 1$  MUX is composed of a nanocavity resonating at  $\hat{\lambda}_{S[MUX]}$  and controlled by the pump signal *Sel*, as illustrated in Figure 4.7(a). The pump signal allows selecting the input signal (i.e.,  $In_1$  or  $In_2$ ) to be transmitted to the output



Figure 4.6: Nanocavity operating as (a) a 2-input XOR gate implemented using two cascaded nanocavities. (b), (c), and (d) are the gate transmissions for different inputs scenarios.

*Out.* The selection is achieved by detuning the resonance of the nanocavity away from the required input signal. For this purpose, when no pump signal is injected (Sel='0'), the resonance wavelength of the nanocavity is aligned with  $\lambda_{S[1]}$ , i.e., the wavelength of  $In_1$ , hence signal  $In_1$  is attenuated and signal  $In_2$  is transmitted to the output, as shown in Figure 4.7(b), i.e.,  $Out=In_2$ . When a pump signal is injected (Sel='1'), the nanocavity is detuned to  $\lambda_{S[2]}$  ( $\Delta\lambda_{[MUX]} = \lambda_{S[1]} - \lambda_{S[2]}$ ), thus leading to  $Out=In_1$ , as illustrated in Figure 4.7(c).



Figure 4.7: Nanocavity operating as (a) a 2×1 MUX. (b) and (c) MUX transmission.

The MUXs operate on multiple signals at different wavelengths and with multiple spacing. The nanocavities implementing MUXs thus need to be carefully defined, taking into account the resonant wavelength, the transmission bandwidth (i.e., Q factor) and the detuning. In the following, we propose a model estimating the wavelength detuning and the transmission of a nanocavity, taking into account key device parameters and the applied pump power.

## 4.4 Nanocavity Model

We propose a model allowing to design nanocavity based logic gates. The model allows i) estimating the wavelength detuning  $(\Delta \lambda_{[gate]})$  according to the applied pump power  $(P_{[gate]})$ ; and ii) the calculation of signal transmission  $(T_{[gate]})$ . Table 4.1 summarizes the device parameters, where [gate] indicates the logic gate that is implemented using nanocavity, i.e., NOT, XOR, MUX, etc.

Inputs device parameters  $\hat{\lambda}_{P[gate]}$  and FSR, shown in Figure 4.8, allow to evaluate  $\hat{\lambda}_{S[gate]}$  (mark (1)), when no pump power is applied.  $Q_{P[gate]}$  (mark (2)) is obtained from  $Q_{S[gate]}$  and  $M_{[gate]}$ , which depend on the fabrication process and the cavity layout (e.g. width and length). The optical tuning efficiency  $(OTE_{[gate]})$  is obtained through device characterizations (mark (3)) and through linear extrapolation to a polynomial function (mark (4)), which requires the targeted device parameters. The detuning (mark (5)) is calculated by taking into account  $Q_{P[gate]}$ , the applied pump power  $(P_{[gate]})$ , and the  $OTE_{[gate]}$ . Finally, the transmission of the nanocavity is evaluated

| Parameter                 | Description                                                                       | $\mathbf{Unit}$ |
|---------------------------|-----------------------------------------------------------------------------------|-----------------|
| $\hat{\lambda}_{P[gate]}$ | Resonance Wavelength around pump signal                                           |                 |
|                           | (when no pump power is injected)                                                  | 11111           |
| $\hat{\lambda}_{S[gate]}$ | Resonance Wavelength around input signal                                          |                 |
|                           | (when no pump power is injected)                                                  |                 |
| FSR                       | Free spectral range (FSR= $\hat{\lambda}_{P[gate]}$ - $\hat{\lambda}_{S[gate]}$ ) | nm              |
| $Q_{P[gate]}$             | Quality factor around $\hat{\lambda}_{P[gate]}$                                   | -               |
| $Q_{S[gate]}$             | Quality factor around $\hat{\lambda}_{S[gate]}$                                   | -               |
| $M_{[gate]}$              | Figure of merit( $M_{[gate]} = Q_{S[gate]} / Q_{P[gate]}$ )                       | -               |
| $OTE_{[gate]}$            | Optical tuning efficiency (the detuning of the                                    |                 |
|                           | nanocavity according to the applied pump power)                                   | -               |

Table 4.1: Device parameters.



Figure 4.8: Proposed model.

using Lorentzian approximation (mark (6)) [92].

We illustrate in Figure 4.9 and 4.10 two scenarios using our model: i) different  $Q_{S[gate]}$ /same  $M_{[gate]}$ ; and ii) same  $Q_{S[gate]}$ /different  $M_{[gate]}$ , respectively.

•  $M_{[gate]}=1$  leads to the same Q factor at pump and input signals resonances, as illustrated in Figure 4.9(a) for  $Q_{P[gate]} = Q_{S[gate]} = 700$ , 1500, and 4000. The corresponding detuning  $(\Delta \lambda_{[gate]})$  of the cavity is plotted for pump power ranging from 0 to  $300\mu$ W, as shown in Figure 4.9(b). As it can be observed, the higher  $Q_{P[gate]}$ , the smaller the maximum detuning  $\Delta \lambda_{[gate]\_max}$ , which is due to the reduced coupling of the pump with the cavity. The transmission of the input signal at  $\lambda_S$  according to the applied power is shown in Figure 4.9(c). While 70% signal transmission can be obtained for all  $Q_{S[gate]}$ , the use of high  $Q_{P[gate]}$  can lead to pump power reduction since the maximum transmission is reached earlier (50 $\mu$ W



**Figure 4.9:** (a) Transmission of nanocavity devices of  $(Q_{S[gate]} = 700, 1500, 4000, M_{[gate]} = 1)$ . (b) The corresponding wavelength detuning  $(\Delta \lambda_{[gate]})$  of (a). (c) The corresponding transmission of input signal according to the applied pump power.

and 270 $\mu$ W for  $Q_{S[gate]}$ =4000 and  $Q_{S[gate]}$ =700, respectively).

•  $M_{[gate]} \neq 1$  leads to  $Q_{P[gate]}=700$  and  $Q_{P[gate]}=2100$  for  $M_{[gate]}=1.5$  and  $M_{[gate]}=0.5$ , respectively, assuming  $Q_{S[gate]}=1050$  (Figure 4.10(a)). As can be seen in Figure 4.10(c), the maximum signal transmission reaches 0.3 and 0.8 for  $M_{[gate]}=0.5$ and  $M_{[gate]}=1.5$ , respectively. Reaching high *ER* of the input signal is thus possible for high  $M_{[gate]}$ , thus leading to opportunities to reduce the data signal power.



Figure 4.10: (a) Transmission of nanocavity devices of  $(Q_{S[gate]}=1050, M_{[gate]}=1.5, 1, 0.5)$ . (b) The corresponding wavelength detuning  $(\Delta \lambda_{[gate]})$  of (a). (c) The corresponding transmission of input signal according to the applied pump power.

## 4.5 Proposed Edge Detection Filter Architecture

In this section, we investigate the design of a stochastic filter application using photonic nanocavities. Detecting edges in an image can be implemented using first derivatives by sliding two dimensional filters over the pixels. The application of the filters involves subtracting and adding the input pixels with each other. In SC, absolute value subtraction and addition can be implemented using XOR gates and MUXs, respectively. The implementation of the gates in the optical domain has been discussed in the previous section. We then discuss the main design challenges related to computing accuracy and energy consumption.

#### 4.5.1 Architecture Overview

The architecture we propose is generic and characterized by a size N. It is composed of one stage of  $2^N$  XOR gates (for the subtraction) followed by N MUX stages (for the addition). Each MUX stage is composed of  $2^N/2^n$  MUXs, where n is the stage position in the addition tree  $(1 \le n \le N)$ .

#### 1. Design Patterns

The architecture involves the following design patterns:

- Two XOR gates followed by a MUX allow implementing a sub-sum function. As illustrated in Figure 4.11(a), two input signals at  $\lambda_{S[i]}$  and  $\lambda_{S[i+1]}$  are injected into  $XOR_{[i]}$  and  $XOR_{[i+1]}$ , respectively (mark (1) in the figure), where *i* is the position of the XOR in the range  $1 \leq i \leq 2^N$ . For each gate, the transmission of the input signal to the output is controlled by a pump signal (mark (2)) generated by an SNG (mark (3)), as detailed later. The multiplexer  $MUX_{[j_1,1]}$  receives the signals transmitted through the XORs (mark (4)), where  $[j_1, 1]$  is the MUX at position  $j_1$  in stage n = 1 and  $1 \leq j_1 \leq 2^N/2$ . Depending on the pump signal generated from SNG<sub>5</sub> (mark (5)), the multiplexer either transmits the signal at  $\lambda_{S[i]}$  or  $\lambda_{S[i+1]}$ .
- Three MUXs allow implementing a sum function, as shown in Figure 4.11(b). The aim of the MUXs is to sum signals propagating at several wavelengths: a MUX at stage n receives two sets of 2<sup>n</sup>/2 signals (6) and 7) and outputs a



Figure 4.11: The The OC architecture of edge detection filters: a) proposed design pattern to implement subtraction and addition using XOR gates and MUXs; b) design pattern to implement a tree adder; c) architecture for the  $3 \times 3$  Sobel operator example.

single set of  $2^n$  signals. For example, each input of the MUX at n=3 is composed of 4 signals wavelengths and its output is composed of 8 wavelengths. In this design, only one signal will propagate to the output, other signals will be filtered through the MUXs. However, the number of wavelengths that can potentially carry the signal increases with the MUX stage. This calls for a MUX design taking into account the number of signals to process and the distance between the wavelengths.

#### 2. Sobel Filter Architecture Example

Figure 4.11(c) illustrates the design of a Sobel filter, where a  $3\times3$  window slides over the entire image to compute the gradient vector of the image. As shown in the figure, the design patterns are repeated through the entire architecture (see the blue and green dashed boxes). Each XOR receives two input pixels as pump signals, thus leading to a subtraction. The resulting signals propagate to the MUXs (that implement an adder-tree) and the output signal is transmitted to the photodetector. In order to keep the architecture symmetrical, we duplicate the input pixels for which coefficients 2 and -2 are applied in the Sobel filter. For instance,  $XOR_{[7]}$  and  $XOR_{[8]}$  are duplicated from  $XOR_{[5]}$  and  $XOR_{[6]}$ , respectively. In optical domain, the design of the architecture requires i) eight lasers (i.e., one per XOR gate) emitting input signals at different wavelengths; and ii) 23 pump lasers (i.e., two per XOR gate and one per MUX).

#### 3. Stochastic Number Generators (SNG)

The cavities are controlled by pump signals corresponding to stochastic numbers. As illustrated in Figure 4.12. Different SNGs are used for the XOR gates and the MUXs. However, in the proposed architecture, the same LFSR is used for the SNGs of all logic gates. The operation of the SNG according to the logic gates is detailed as follows:

• As shown in Figure 4.12(a), the XOR gates require SNG-based LFSR + modulator, introduced in Chapter 2, where each SNG has its own comparator and modulator. The pump signal  $(P_{[XOR,i]})$  emitted from an off-chip laser is either injected into the nanocavity or modulated depending on the value of the bit in the bit streams that controls the modulators. In order to avoid crosstalk, each



Figure 4.12: SNGs for (a) XOR gates and (b) MUXs.

pump signal uses a dedicated wavelength. To generate correlated inputs, the same LFSR is used to generate the bit streams inputs for all XOR gates.

As shown in Figure 4.12(b), the selection line of the MUX only requires the generation of bit streams with the same number of zeros and ones (probability of 0.5) to generate P<sub>[MUX,jn,n]</sub> values. For this purpose, a modulator is directly controlled by a bit in the LFSR, i.e., no need for a comparator. In order to reduce the area and power overhead, the same LFSR (used for the XOR gates) is used to control several MUXs. This can be achieved without loss of accuracy by selecting bits at different positions.

#### 4. Transmission Spectrum and Device Characteristics

As previously explained, the number of signals crossing the cavities increases with the stages. Figure 4.13 illustrates transmission examples corresponding to the architecture in Figure 4.11(c), where eight signals propagate using eight wavelengths. As detailed in the following, i) the distance between the wavelengths; and ii) the Q factor are key design parameters as they directly impact crosstalk and switching energy:

•  $WLS_n$  corresponds to the wavelength spacing at stage n of the MUX. The wavelengths are then regularly spaced following a hierarchy that suits the MUX tree.


Figure 4.13: The transmission of two XOR gates and one MUX per stage.

In the example,  $WLS_1$  is the distance between two consecutive signals in the first MUX stage, e.g., between  $\lambda_{S[1]}$  and  $\lambda_{S[2]}$ ,  $\lambda_{S[3]}$  and  $\lambda_{S[4]}$ , etc.  $WLS_2$  is the distance between two consecutive sets of wavelengths in the second stage, e.g., between  $\{\lambda_{S[1]}, \lambda_{S[2]}\}$  and  $\{\lambda_{S[3]}, \lambda_{S[4]}\}$ ,  $\{\lambda_{S[5]}, \lambda_{S[6]}\}$  and  $\{\lambda_{S[7]}, \lambda_{S[8]}\}$ , etc.

•  $Q_{S[gate,n]}$  corresponds to the cavity Q factor at stage n. Indeed, assuming the same Q factor for all cavities in a stage allows using the same laser power per stage. Moreover, we assume both XOR gates and the MUXs in the first stage

to have the same Q factor. We define  $Q_{S[XOR]}$ , without n, as the Q factor of the XOR gate around the input signal. Moreover, as the wavelength distance between signals to be multiplexed increases, the bandwidth of the cavity increases (i.e.,  $Q_{S[MUX,n]} > Q_{S[MUX,n+1]}$ ).

To summarize, the design of the proposed architecture involves exploring numerous parameters, such as laser powers, wavelength distances and Q factors. In the following, we further discuss their optimization according to computing accuracy and power consumption purposes.

### 4.5.2 Design Challenges

The design of such an architecture involves the optimization of computing accuracy, power consumption and processing time. The following summarizes key technological and system-level parameters we consider for the optimization of the architecture:

- **BSL and BER:** computing accuracy depends on *BSL* (stochastic domain specific) and *BER* (optical domain specific). While both techniques result in power consumption, a reduction in the *BER* should be preferred, since it can be achieved without impacting the processing time.
- Input signal power: the architecture is composed of cascaded gates, which results in signal attenuation. In order to ensure a proper operation of the design, an input signal should be injected with a high enough optical power (typically  $3\mu W$ to  $10\mu W$ ).
- Pump signal power: it controls the wavelength detuning of the nanocavity and ranges from  $100\mu$ W to 10mW scale. To prevent the input signal from detuning the cavity, we assume that its power should not exceed 10% of the pump power.

• Wavelength spacing: it impacts the power consumption as follows: small WLS increases crosstalk and hence results in high *BER*. This requires high laser powers for the input signals to overcome the crosstalk. On the contrary, larger *WLS* contributes to a reduction in input signal power but calls for higher pump power to cover the larger wavelength detuning.

## 4.6 Implementation and Model

In this section, we present an analytical model to evaluate the error induced from the SC technique and the optical transmission. Moreover, we develop a transmission model for the edge detection filter to estimate the power consumption. We also define the required design parameters for the exploration methodology.

### 4.6.1 Error Evaluation

Two types of errors are considered: i) errors related to SC domain; and ii) errors related to optical domain as discussed in the following:

•  $ED_{BSL}$ : an error distance induced by the approximation when generating stochastic bit streams. This error is defined as:

$$ED_{BSL} = |\acute{Y} - Y| \tag{4.5}$$

where Y is the error-free result and  $\acute{Y}$  is the approximated result for a given BSL.

•  $ED_{Trans}$ : an error distance induced by the optical transmission. It is given as:

$$ED_{Trans} = |\acute{Y} - \acute{Y}| \tag{4.6}$$

where  $\acute{Y}$  is the approximated result considering given BSL (related to  $\acute{Y}$ ) and BER. As a result, the total error (worst-case error) can be defined as:

$$ED_{Total} = ED_{BSL} + ED_{Trans} \tag{4.7}$$

We use PSNR as a metric to evaluate the computing accuracy when processing an image as follows:

$$PSNR_{Total} = 10 \times log_{10} \left(\frac{MAX_I^2}{MSE_{Total}}\right)$$
(4.8)

where  $MAX_I$  is the maximum pixel in the error free image defined as 255 for 8-bit pixels.  $MSE_{Total}$  is the Mean Square Error given as:

$$MSE_{Total} = \frac{1}{M \times K} \sum_{i=1}^{M} \sum_{j=1}^{K} ED_{Total}(i, j)^{2}$$
(4.9)

where M and K are the number of rows and columns in the image, respectively.  $ED_{Total}(i,j)$  is the total error distance from processing a pixel at position (i,j) in the image.

### 4.6.2 Edge Detection Transmission Model

In order to estimate the *BER* of the architecture, we need to define the transmission of the signals. As defined in Section 4.5, an edge detection architecture of size Nis composed of  $2^N$  XOR gates, where each gate is designed using two nanocavities connected in series. Each XOR gate transmits one of  $2^N$  input signals through NMUXs. The transmission  $(T_{[i]})$  of input signal i, propagating at  $\lambda_{S[i]}$  through two nanocavities of the XOR gate and N MUXs is given as:

$$T_{[i]} = \underbrace{T_{[X1]}(\lambda_{S[i]}, \hat{\lambda}_{S[X1,i]}, P_{\{1,2\}[XOR,i]})}_{\text{Transmission through the first cavity in XOR gate} \times \underbrace{T_{[X2]}(\lambda_{S[i]}, \hat{\lambda}_{S[X2,i]}, P_{\{1,2\}[XOR,i]})}_{\text{Transmission through the second cavity in XOR gate}} \times \underbrace{\prod_{n=1}^{N} T_{[MUX]}(\lambda_{S[i]}, \hat{\lambda}_{S[MUX,j_n,n]}, P_{[MUX,j_n,n]})}_{\text{Transmission through N MUXs}}$$

$$(4.10)$$

where  $j_n = \lfloor i/2^n \rfloor$  is the MUX position in stage n and  $1 \leq j_n \leq 2^N/2^n$ . From the signal transmission, SNR is calculated as follows:

$$SNR = OLP_{Input} \times \frac{R}{I} \times \left(T_{[i]} - \sum_{\substack{k=1\\k \neq i}}^{M} T_{[K]}\right)$$
(4.11)

where  $OLP_{Input}$  is the laser power of input signal at  $\lambda_{S[i]}$  injected into the XOR gate. R and I are the photodetector responsivity and internal noise, respectively.  $T_{[i]}$ , in this case, is the transmission of signal i as '1', while the other crosstalk signals k are transmitted as '0'.  $T_{[k]}$  is the transmission of the crosstalk signals k as '1' while signal i is transmitted as '0', where  $M = 2^N$ . The *BER* assuming ON/OFF Key (*OOK*) modulation of the input signals is given in Equation 2.18.

### 4.6.3 Nanocavity Design Parameters

The evaluation of  $T_{[i]}$  depends on  $\lambda_{S[i]}$ ,  $\hat{\lambda}_{S[gate]}$ , and  $P_{[gate]}$  parameters, which we define in the following:

• Signal Wavelengths, Cavity Resonances and Spacing: As previously explained,  $WLS_n$  corresponds to the shifting distance of the cavities located in stage n. Based on Figure 4.13, we assume  $WLS_3 > WLS_2 > WLS_1$ . In the XOR stage, each gate will operate on a signal propagating at  $\lambda_{S[i]}$ , where *i* is the row input number  $(1 \leq i \leq 2N)$ . We set to 1542nm the baseline wavelength  $\lambda_{S[1]}$  (i.e., signal used to propagate the first top input signal in Figure 4.11(c)). The subsequent signal wavelengths are assigned as follows:

$$\lambda_{S[i]} = \lambda_{S[1]} - \sum_{n=1}^{N} \left( \lfloor \frac{i-1}{2^n - 1} \rfloor mod2 \right) \times WLS_n \tag{4.12}$$

For each XOR gate, we set the first and second resonance (i.e.,  $\hat{\lambda}_{S[X1,i]}$  and  $\hat{\lambda}_{S[X2,i]}$ ) according to the signal wavelength  $\lambda_{S[i]}$  and the assumed detuning  $\Delta \lambda_{[XOR]}$ :

$$\hat{\lambda}_{S[X1,i]} = \lambda_{S[i]} \tag{4.13}$$

$$\hat{\lambda}_{S[X2,i]} = \hat{\lambda}_{S[X1,i]} + \Delta \lambda_{[XOR]} \tag{4.14}$$

The resonance at rest of each MUX is defined by the mean wavelength of the first set of input signals:

$$\hat{\lambda}_{S[MUX,j_n,n]} = \frac{\lambda_{S[2^n(i-1)+1]} + \lambda_{S[2^n(i-1)+2^{n-1}]}}{2}$$
(4.15)

where  $j_n = \lfloor i/2^n \rfloor$  is the MUX position in stage n.

• **Pump Power:** we assume the same pump laser powers  $(OLP_P)$  injected into the cavities located in the same stage. The pump powers received by XOR gates are

defined by:

$$P_{\{1,2\}[XOR,i]} = \begin{cases} OLP_{P[XOR]} \times IL_{\%}, & z_v = 1\\ OLP_{P[XOR]} \times IL_{\%} \times ER_{\%}, & z_v = 0 \end{cases}$$
(4.16)

where  $z_v$  is the bit streams of the input pixels for XOR gate. The pump powers received by the MUXs are given as:

$$P_{[MUX,j_n,n]} = \begin{cases} OLP_{P[MUX,n]} \times IL_{\%}, & LFSR \ bit_{n-1} = 1\\ OLP_{P[MUX,n]} \times IL_{\%} \times ER_{\%}, & LFSR \ bit_{n-1} = 0 \end{cases}$$
(4.17)

To ensure that the input power signal does not contribute to the detuning of the nanocavity, we set the maximum power of the input signal to 10% of the cavity pump power.

- Algorithm: We summarize the steps we follow to explore the design space as:
  - 1. Define input parameters: figure of merits  $(M_{[gate]})$ , wavelength of input signal  $(\lambda_{S[1]})$ , and targeted *BER* at the photodetector.
  - 2. From the experimental results, use  $Q_{S[gate]}$ ,  $Q_{P[gate]}$ ,  $\hat{\lambda}_{S[gate]}$  and  $\hat{\lambda}_{P[gate]}$  to calibrate the PhC nanocavity model. Validate that the transmissions model and measurements are well correlated.
  - 3. For XOR gate design, explore  $\Delta \lambda_{[XOR]}$  and  $Q_{s[XOR]}$  to minimize laser power. This requires setting the resonance wavelengths of the XOR gate;  $\hat{\lambda}_{S[X1,1]}$  and  $\hat{\lambda}_{S[X2,1]}$  according to Equations 4.13 and 4.14, respectively.
  - 4. For the MUX design, iterate from stage 1 to N to:
    - (a) Set the resonance wavelength of the  $MUX_{[1,1]}$  to  $\lambda_{S[1]}$  (Equation 4.15).

- (b) Explore WLS<sub>1</sub> and Q<sub>s[MUX,1]</sub> to minimize BER at the output stage, and select the desired BER. This allows defining λ<sub>S[2]</sub> according to Equation 4.12 and the resonance wavelength of XOR<sub>[2]</sub> and MUX<sub>[1,2]</sub> according to Equations 4.13 - 4.15.
- (c) Repeat step 4.b to explore WLS<sub>2</sub> and Q<sub>s[MUX,2]</sub>. By selecting a BER, λ<sub>S[3]</sub> and λ<sub>S[4]</sub> are now evaluated using the corresponding WLS<sub>2</sub> (Equation 4.12). Accordingly the resonance wavelengths of XOR<sub>[3]</sub>, XOR<sub>[4]</sub>, and MUX<sub>[2,1]</sub> are defined (Equations 4.13 - 4.15).
- (d) Repeat step 4.b again for the next stage until stage N. At this point, all WLS are defined. This allows calculating the wavelengths of the rest of input signals and the resonance wavelengths of the remaining devices.
- 5. According to the input laser powers and pump laser powers, estimate the energy per bit (Equations 4.16 and 4.17).
- Process an image and evaluate the application-level computing accuracy for a given BSL and input laser powers (Equations 4.5 - 4.9).

# 4.7 Simulation Results

In this section, we target a NOT gate of a given Q factor and compare the transmission and detuning using our proposed model and the experimental characteristics provided by Thales in France. We evaluate the laser powers for a NOT gate and present the valid range of wavelength detuning. We introduce the design of XOR gate and MUX by exploring the design space in each stage. We process an image using the proposed architecture and we evaluate the computing accuracy, energy consumption and processing time.

### 4.7.1 Model Calibration

In the following, we detail the model calibration according to the experimental results for a NOT gate. As it can be observed from the transmission results reported in Figure 4.14(a), the gate is characterized by resonance wavelengths at  $\hat{\lambda}_{S[NOT]}=1592.5$ nm (around input signal) and  $\hat{\lambda}_{P[NOT]}=1568.8$ nm (around pump signal), which leads to FSR=24nm. At  $\hat{\lambda}_{S[NOT]}$  and  $\hat{\lambda}_{P[NOT]}$  resonances, the 3dB bandwidth of the nanocavity is 1.44nm and 0.65nm, respectively, which induces  $M_{[NOT]}=0.5$ . We calibrate the model using these parameters and, as it can be seen in the figure, a good correlation is obtained.

Figure 4.14(b) shows the measured nonlinear cavity detuning  $(\Delta\lambda_{[NOT]})$  corresponding to pump power ranging from 0 to 250 $\mu$ W for a cavity Q factor=700. Depending on the Q factor and the material used, these numbers might change. In fact, the resonator here has been designed for maximized speed, hence low Q, trading off with energy efficiency. A different balance would target an order of magnitude larger Q. Figure 4.14(c) illustrates the transmission of the cavity at  $\hat{\lambda}_{S[NOT]}$  under 178 $\mu$ W pump power. This leads to around 1.6nm blue shift of the resonance, which we observe for both measurement and model, thus validating the calibration.

In the following, we explore the impact of the signal detuning  $(\Delta \lambda_{[NOT]} = \hat{\lambda}_{S[NOT]} - \lambda_S)$  on the laser powers, where  $\hat{\lambda}_{S[NOT]}$  is the cavity resonance at rest. We consider a nanocavity with  $Q_{S[NOT]}=2000$ ,  $M_{[NOT]}=2$  and  $\hat{\lambda}_{S[NOT]}=1542$ nm. In Figure 4.15(a), we assume transmission scenarios for  $\Delta \lambda_{[NOT]}=0.05$ nm, 0.1nm, 0.19nm, and 0.35nm. Two optical signals are injected:  $OLP_{Input}$  and  $OLP_P$  correspond to the optical power of input signal and pump signal, respectively. As illustrated in Figure 4.15(a),  $\Delta \lambda_{[NOT]}=0.05$  (mark (1)) requires the lowest  $OLP_P$  value due to the small shift in the resonant wavelength. On the other hand, this results in a rather low



Figure 4.14: Characterization results and model calibration: (a) Transmission when no pump power is applied, (b) wavelength detuning according to the average pump power, and (c) transmission when a pump power is injected.

0.7dB ER, which is compensated by using a high  $OLP_{Input}$  value. Higher  $\Delta\lambda_{[NOT]}$ , such as 0.1nm (mark (2)), 0.19nm (mark (3)), and 0.35nm (mark (4)), leads to an increase in the ER=1.7dB, 4.3dB, and 6.9dB, respectively. This contributes to lower  $OLP_{Input}$  but induces higher  $OLP_P$  due to the larger wavelength detuning distance.



Figure 4.15: For a nanocavity of  $Q_{S[NOT]} = 2000$  and  $M_{[NOT]} = 2$ : (a) The transmission assuming  $\Delta \lambda_{[NOT]} = 0.05, 0.1, 0.19$ , and 0.35nm. (b) Laser powers according to  $\Delta \lambda_{[NOT]}$  ranges from 0 to 0.5nm.

To further explore the design space, we investigate the design power consumption by considering laser powers, i.e.,  $OLP_{Input}$  and  $OLP_P$ . We assume  $BER = 10^{-1}$  and  $\Delta \lambda_{[NOT]}$  ranging from 0 to 0.5nm. We define the valid range when  $OLP_{Input}$  accounts for 10% or less of  $OLP_P$ . As it can be seen in Figure 4.15(b), the power consumption is dominated by  $OLP_{Input}$  for  $\Delta\lambda_{[NOT]} < 0.1$ nm. At  $\Delta\lambda_{[NOT]} = 0.05$ nm (mark (1)), we obtain  $OLP_P=2.9\mu W$  and  $OLP_{Input}=19.1\mu W$  (for a total power of  $22\mu W$ ). This implies an input signal power (injected by  $OPL_{Input}$ ) exceeding 10% of the pump signal power (injected by  $OPL_P$ ). Therefore, $\Delta\lambda_{[NOT]}=0.05$ nm is an invalid option. Although  $\Delta \lambda_{[NOT]} = 0.1$ nm (mark (2)) leads to optimal total power consumption, it is not a valid design option, since the  $OLP_{Input}$  accounts for 39% of the total power received by the cavity. From  $\Delta \lambda_{[NOT]} = 0.19$  nm (mark (3)) to  $\Delta \lambda_{[NOT]} = 1.13$  nm, the design becomes valid but leads to power overhead. Hence the power is dominated by  $OLP_P$  due to the large wavelength distance needed to reach the input signal. For example,  $\Delta \lambda_{[NOT]} = 0.35$  mm (mark (4)) involves  $OLP_P = 33.9 \mu$ W and  $OLP_{Input} = 0.7 \mu$ W, which increases the power consumption by  $2.7 \times$  compared to the optimal  $\Delta \lambda_{[NOT]}$ . Each nanocavity of a given  $Q_{S[NOT]}$  has a unique range of wavelength detuning that varies between 0 and  $\Delta \lambda_{[NOT]\_max}$ . However, the minimum detuning is specified according to the ratio of the injected input power to the pump power signals. In the following, we explore the power consumption in the design of XOR gates considering nanocavities of different Q factors.

#### 4.7.2 Design of XOR Gate

As previously defined, an XOR gate is composed of two cascaded nanocavities with the same Q factor but with resonances separated by  $\Delta \lambda_{[XOR]}$ . We assume  $M_{[XOR]}=2$ and  $Q_{S[XOR]}=[2000; 3500; 5000; 8000]$ . Figure 4.16(a) illustrates the total power



Figure 4.16: Total laser powers of XOR gate assuming  $BER = 10^{-1}$ ,  $M_{[XOR]} = 2$  and: (a)  $Q_{S[XOR]} = 2000$ , 3500, 5000, and 8000. (b)  $Q_{S[XOR]}$  ranges from 1 to 10000.

consumption for  $\Delta\lambda_{[XOR]}$  ranging from 0 to 1nm and for a targeted  $BER = 10^{-1}$ . As it can be seen in the figure,  $Q_{S[XOR]} = 8000$  and 2000 lead to a valid  $\Delta\lambda_{[XOR]}$  range of [0.17-0.28]nm and [0.45-1.13]nm, respectively, and involve a total power consumption ranging from  $39\mu$ W to  $94\mu$ W and  $104\mu$ W to  $276\mu$ W, respectively. Hence, the lower  $Q_{S[XOR]}$ , the larger the valid range of  $\Delta\lambda_{[XOR]}$  and the more increases the power overhead. As also can be observed from the figure, a total power=82.5 $\mu$ W can be obtained for  $Q_{S[XOR]}$ =8000, 5000, and 3500 under  $\Delta\lambda_{[XOR]}$ =0.27nm, 0.335nm, and 0.365nm, respectively (see (1)). This demonstrates that the same power efficiency can be obtained for different cavities ( $Q_{S[XOR]}$ ) and wavelength detuning ( $\Delta\lambda_{[XOR]}$ ).

In the following, we explore  $Q_{S[XOR]}$  and  $\Delta\lambda_{[XOR]}$  with the aim to find design parameters that minimize the XOR power consumption. The results are reported in Figure 4.16(b). For the sake of clarity, the design parameters corresponding to cavities detailed in Figure 4.16(a) are highlighted in Figure 4.16(b) (mark (1)). As a first observation, we note that the higher  $Q_{S[XOR]}$  and the lower  $\Delta\lambda_{[XOR]}$ , the lower the power consumption, which is due to the reduced amount of energy needed to shift the cavity. Overall, the cavities laser power consumption ranges from  $34.7\mu$ W (at  $\Delta\lambda_{[XOR]}$ = 0.14nm and  $Q_{S[XOR]}$ =10000) to  $398.2\mu$ W (at  $\Delta\lambda_{[XOR]}$ =1nm and  $Q_{S[XOR]}$ =2000). As discussed earlier, we use the same parameters for the cavities located in the XOR stage and the first MUX stage. In the following, we explore the remaining design parameters for MUX stages.

### 4.7.3 Design of MUX

In the following, we explore the MUX design parameters. For this purpose, we target a  $BER = 5 \times 10^{-1}$  at the photodetector, which corresponds to BER at stage n=3 of the MUX ( $BER_{[MUX,3]}$ ), and we explore the design space from the first stage to the last stage, by defining the inter-stage BER to be reached. We use the corresponding parameters ( $Q_{S[MUX,n]}$ ,  $WLS_n$ ) from stage n to explore the design space of stage n+1.

• Stage n=1: we assume  $3\mu$ W input signals powers  $(OLP_{Input})$  injected in the XOR gates, we also assume the following ranges for Q factors and  $WLS_1$ : 1 <  $Q_{S[MUX,1]} < 10000$  and 0 <  $WLS_1 < 1.2$ nm. As shown in Figure 4.17(a), the exploration results in  $BER_{[MUX,1]}$  ranges between  $10^{-4}$  and  $5 \times 10^{-1}$ . As can be seen, a high  $Q_{S[MUX,1]}$  leads to more accurate designs. For example,  $Q_{S[MUX,1]}=10000$  and 5000 result in  $BER_{[MUX,1]}=[10^{-4} - 4 \times 10^{-4}]$  and  $[4 \times 10^{-4} - 5 \times 10^{-2}]$ , respectively. Moreover, the higher  $WLS_1$ , the lower  $BER_{[MUX,1]}$ , which is due to the reduced crosstalk. We choose  $Q_{S[MUX,1]}=10000$  and  $WLS_1=0.215$  m, which lead to the lowest possible BER for the covered design space ( $BER_{[MUX,1]} = 10^{-4}$ ). The corresponding transmission is plotted in the caption of Figure 4.17(a). The data signals propagate at  $\lambda_{S[1]}=1542$  nm (i.e., baseline wavelength obtained through experimental results) and  $\lambda_{S[2]}=1541.785$  nm (i.e., baseline wavelength minus the 0.215 nm spacing). The detuning of the cavity to  $\lambda_{S[2]}$  is obtained with a  $32\mu$ W pump power. The selected signal is transmitted to the MUX output with a power of  $1.2\mu$ W.

- Stage n=2: We assume the parameters defined in stage n=1 (i.e.,  $Q_{S[MUX,1]}=10000$  and  $WLS_1=0.215$ nm) and we explore the same ranges of values for  $Q_{S[MUX,2]}$  and  $WLS_2$ . Figure 4.17(b) shows the resulting *BER* at stage n=2 ( $BER_{[MUX,2]}$ ), which is overall higher than  $BER_{[MUX,1]}$  due to: i) the higher crosstalk induced by additional input signals to process (2 and 4 input signals at n=1 and n=2, respectively); and ii) the lower received data signal power ( $3\mu$ W and  $1.2\mu$ W at n=1 and n=2, respectively). We target  $10^{-2}$  for  $BER_{[MUX,2]}$ , which we obtain with  $Q_{S[MUX,2]}=1900$  and  $WLS_2=1.19$ nm (for a  $210\mu$ W pump power). The resulting transmission is shown in the caption. In addition to the input signals at  $\lambda_{S[1]}$  and  $\lambda_{S[2]}$ , we inject signals at  $\lambda_{S[3]}=1540.81$ nm and  $\lambda_{S[4]}=1540.595$ nm: the distance between  $\lambda_{S[3]}$  and  $\lambda_{S[4]}$  is 0.215nm and the distance between  $\{\lambda_{S[1]}, \lambda_{S[2]}\}$  and  $\{\lambda_{S[3]}, \lambda_{S[4]}\}$  is 1.19nm.
- Stage n=3: The design of the MUX at stage n=3 ( $MUX_{[1,3]}$ ) is explored assuming  $Q_{S[MUX,2]}=1900$  and  $WLS_2=1.19$ nm. As reported in Figure 4.17(c),  $Q_{S[MUX,3]}=500$  and  $WLS_3=4.35$ nm lead to the targeted  $5 \times 10^{-1}$  BER. The 8 signals received



 $\begin{array}{l} {\bf Figure \ 4.17: \ Achievable \ BER \ at each stage \ for \ nanocavities \ with \ M_{[MUX]}=2: \ (a) \ Stage \ n=1 \ with \ 1 < Q_{S[MUX,1]} < 10000 \ and \ 0 < WLS_1 < 1.2nm. \ (b) \ Stage \ n=2 \ with \ 1 < Q_{S[MUX,2]} < 10000 \ and \ 0 < WLS_2 < 1.2nm. \ (c) \ Stage \ n=3 \ with \ 1 < Q_{S[MUX,3]} < 1000 \ and \ 3 < WLS_3 < 10nm. \end{array}$ 

by  $MUX_{[1,3]}$  and the corresponding cavity transmission are illustrated in the caption. The selected value for  $WLS_3$  leads to  $\lambda_{S[5]}=1537.65$  nm,  $\lambda_{S[6]}=1537.435$  nm,  $\lambda_{S[7]}=1536.46$  nm, and  $\lambda_{S[8]}=1536.245$  nm. The selection of signals  $\lambda_{S[4-8]}$  is achieved by applying a  $670\mu$ W pump power.

As it has been observed, the design space considerably shrinks from a stage to another, which is mostly due to the increasing number of signals to process. This calls for increasing wavelength spacing and thus reducing  $Q_{S[gate]}$ . As a matter of fact, we found that the highest possible Q factor should be preferred for the design of the XOR gates. Regarding the error rate, which inevitably increases as signals propagate through the stages, it can be overcome by increasing the power laser and the BSL, as discussed in the following.

### 4.7.4 Application-level Design Comparison

In the following, we evaluate the application level computing accuracy, energy consumption and processing time of the architecture. For a comparison purpose, we assume injected input power signals at  $3\mu$ W and  $4\mu$ W, and we target  $5 \times 10^{-1}$  and  $10^{-1}$  BER, respectively. By following the algorithm defined in Section 4.6.3, we obtain Design A and Design B, for which the Q factors and wavelength spacings are reported in Table 4.2.

In order to evaluate the computing accuracy at the application level, we process  $512 \times 512$  pixels images assuming BSL=256, 512, and 1024. This results in three designs for each set of parameters, as illustrated in Figure 4.18(b) and (c). The error is calculated with respect to the error free image shown in Figure 4.18(a). As expected, the accuracy increases with BSL. For instance, in Figure 4.18(b),  $PSNR_{Total}$  is reduced from 20 to 26.4 when BSL is increased from 256 to 1024. Furthermore, the



**Figure 4.18:** Processed image: (a) error free, and  $PSNR_{Total}$  for (b)  $OLP_{Input}=3\mu W$  and (c)  $OLP_{Input}=4\mu W$  assuming BSL=256, 512, and 1024.

Table 4.2: Device/system-level parameters, and performance of two designs target  $PSNR_{Total}=26.4$ .

|                         |                             | <b>Computing accuracy</b> |           |
|-------------------------|-----------------------------|---------------------------|-----------|
|                         |                             | $PSNR_{Total} = 26.4$     |           |
|                         |                             | Design A                  | Design B  |
| Input                   | $OLP_{Input}$               | $3\mu W$                  | $4\mu W$  |
| parameters              | BSL                         | 1024                      | 512       |
| Device<br>parameters    | $Q_{S[XOR]} = Q_{S[MUX,1]}$ | 10000                     | 7700      |
|                         | $Q_{S[MUX,2]}$              | 1900                      | 1600      |
|                         | $Q_{S[MUX,3]}$              | 500                       | 200       |
| System-level parameters | $WLS_1 (nm)$                | 0.215                     | 0.275     |
|                         | $WLS_2$ (nm)                | 1.19                      | 1.41      |
|                         | $WLS_3 (nm)$                | 4.35                      | 11.3      |
|                         | BER                         | $5 \times 10^{-1}$        | $10^{-1}$ |
| Performance             | Energy consumption          | 0.9                       | 0.85      |
|                         | (nJ/pixel)                  |                           |           |
|                         | Processing time             | 102.4                     | 51.2      |
|                         | (ns/pixel)                  |                           |           |

use of BSL=1024 for Design A and BSL=512 for Design B results in  $PSNR_{Total}=26.4$ , thus leading to opportunities to explore the power and processing time trade-off.

For this purpose, we evaluate the energy per computed pixel assuming 10ps pump pulse width under 10GHz repetition rate and 20% lasing efficiency. As reported in Table 4.2, Design B results in 5.6% energy saving and  $2\times$  reduction in processing time compared to Design A. This indicates that for the assumed set of device parameters, *BSL* has a higher negative impact on energy consumption compared to *BER* due to the higher static energy. Therefore, a small *BSL* is preferred for higher energy efficiency and faster processing architecture. Furthermore, while a higher injected input signal power contributes to reduce the *BER*, it also significantly reduces the design space due to the higher crosstalk. This calls for cavities with a higher figure of merits ( $M_{[gate]}$ ), as will be discussed in the future work chapter.

## 4.8 Summary

In this chapter, we investigated the use of PhC nanocavity to design an OSC architecture. We proposed a generic transmission model for the nanocavity, which showed a good correlation with experimental measurements for a NOT gate of  $Q_{P[NOT]}=2400$ and  $M_{[NOT]}=0.5$ , hence validating the proposed model. We used the model to design the XOR gate and MUX of different device parameters. We showed that an XOR gate of  $Q_{S[XOR]}=10000$  and wavelength detuning equals 0.14nm leads to  $34.7\mu$ W power consumption. We designed an edge detection filter that relies on the proposed XOR gate and MUX. At the application level, images were processed for various laser powers and BSL. The results showed that the assumed set of device parameters, BSL has a higher negative effect on the energy consumption compared to BER. The resulting architecture showed 0.85nJ/pixel energy consumption and 51.2ns/pixel processing time. So far, we generated the bit streams using off-chip lasers and LFSR. In the next chapter, we will introduce other SNG designs for OSC architectures.

# Chapter 5

# Optical Stochastic Number Generator Architectures

In this chapter, we propose the design of another two SNGs that can be used to generate the stochastic bit streams. One design is based on the use of an on-chip directly modulated laser controlled by the electrical bit streams and the other targets the design of all-optical SNG using a single laser. We consider all the proposed designs of SNGs, including the one proposed in Chapter 2, to conduct a comparison in terms of energy consumption and computing accuracy using edge detection application proposed in Chapter 4.

# 5.1 Overview

SNG is responsible for generating stochastic bit streams that represent the data to be processed. Many works address the design of SNG for SC. The most popular one is using an LFSR and a comparator [6], as presented in Chapter 1. In [97], the bit streams are generated using a weighted binary generator (WBG). It is composed of an LFSR and a stage of invertors followed by two stages of AND gates to generate weighted bit streams of non-overlapping '1's. Eventually, all these weighted bit streams are ORed to generate the final bit stream. Emerging technologies can also be used to generate random numbers. For example, memristors can randomly switch their state (OFF/ON) by applying bias voltage less than the device threshold voltage [98]. Random numbers can also be generated from amplifying thermal noise in magnetic tunnel junctions (MTJ) devices [99]. In the optical domain, the design of random number generators has been widely investigated. The proposed designs rely on the use of chaotic lasers [100,101] to generate random numbers. However, they are designed for the communication domain's encryption process and are not suitable for unconventional computations, such as SC.

In Chapters 2 to 4 of this thesis, we consider the design of SNG-based LFSR + modulators, shown in Figure 5.1(a), where off-chip CW lasers are considered. In this chapter, we propose another two designs of SNGs: i) SNG-based LFSR + on-chip directly modulated lasers, shown in Figure 5.1(b), and ii) all-optical SNG, shown in Figure 5.1(c). The SNG-based LFSR + on-chip directly modulated lasers relies on the use of on-chip lasers that can be modulated directly by the bit streams generated from the electrical part of the SNG, i.e., LFSR and comparator. All-optical SNG uses lasers to directly generate random pulses, where an analog signal controls the laser's operation. Therefore, when the input is a binary number, an A/D conversion is required to generate the analog signal. In the following, we explain in detail the operation of each design.



**Figure 5.1:** Three implementations of SNG: (a) SNG-based LFSR + modulator, (b) SNG-based LFSR + on-chip directly modulated laser, and (c) all-optical SNG.

# 5.2 Proposed Designs

In this section, we introduce the implementations of three SNGs that can be used with OSC architectures.

- SNG-based LFSR + modulator: As mentioned in Chapter 2, this design relies on electrical SNG to generate bit streams. An *m*-bit input binary number (*BN*) is compared against an *m*-bit pseudo-random number (*PRN*) generated from LFSR. Accordingly, bit '0' or '1' is generated to control the operation of a modulator, i.e., either to transmit or to modulate a CW optical signal injected from an off-chip laser.
- SNG-based LFSR + on-chip directly modulated Laser: In this design, the bit streams generated from the electrical part of the SNG control the emission of the on-chip laser [102]. Bit='0' results in no signal emission (OFF state); otherwise, a signal is emitted (ON state), as shown in Figure 5.2. The use of on-chip lasers eliminates the need for modulators, allowing the implementation of the entire design on the same chip. Unlike the first SNG implementation with continuous laser

emission, here, the power consumption depends on the value of the data to be processed. For example, the number of '1's in the bit stream of a small data value is low. Hence, the device is OFF during most of the processing time compared to a high data value. This is beneficial, especially when used to process dark images.



Figure 5.2: SNG-based LFSR + modulated laser

• All-optical SNG: As shown in Figure 5.3, the design relies on using lasers to generate random optical pulses. The bias power specifies the density of the generated pulses, i.e., small power leads to low pulses density, which increases with the bias value. In the scope of collaboration with Thales in France, they fabricate on-chip nanolasers based on PhC nanocavities [54]. They provided us with the simulation results of a stream of random pulses for probability 0.5, which needs 38fJ/bit excitation energy. The nanolaser is optically pumped with a CW signal and the excitation energy is controlled by applying an electrical bias power. Since we only have the probability of 0.5, in this chapter, we assume a fixed value for the bias and hence we are not considering the design of D/A convertor. It is worth mentioning that the probability of the emitted pulses is evaluated as the ratio between the high emitted pulses to the total number of pulses in the stream.



Figure 5.3: All-optical SNG using nanolasers

Regarding computing accuracy, while in SNG-based LFSR designs, the approximation in the results depends on the LFSR, in all-optical SNG, the approximation depends on the randomness of the pulses generated by the lasers. In the following, we use the edge detection architecture, proposed in Chapter 4, with different SNG implementations to evaluate the computing accuracy and estimate energy consumption. It is worth mentioning that the same study could be performed on the optical ReSC architecture for polynomial functions proposed in Chapter 2.

## 5.3 Optical SNGs Comparison

In the following, we assume all the proposed implementations of SNG to process a  $512 \times 512$  pixels image using an edge detection filter. We estimate the energy consumption and compute the accuracy for the entire architecture using different SNG designs. The electrical part of the SNG, i.e., LFSR and comparators, is designed using TSMC 65nm CMOS technology [103].

### 5.3.1 Energy Consumption

We estimate the energy consumption of an edge detection filter assuming two scenarios for SNG: i) SNG-based LFSR + modulator with off-chip lasers (Chapter 4); and ii) SNG-based LFSR + on-chip directly modulated lasers. For this purpose, we consider the optical pump power from Section 4.7.3 used to detune the nanocavities, i.e.,  $32\mu$ W,  $210\mu$ W, and  $670\mu$ W for XOR gates and MUXs in stage 1, 2 and the last stage, respectively. Figure 5.4(a) shows these optical powers for off-chip lasers with lasing efficiency of 25% [104]. For on-chip lasers, we assume lasers of 30mW power



Figure 5.4: Edge detection filter with (a) SNG-based LFSR + modulator with off-chip lasers of 25% lasing efficiency and SNG-based LFSR + on-chip directly modulated lasers of 30mW power consumption and 5% lasing efficiency. (b) A scale-down of on-chip directly modulated lasers.

consumption with 5% lasing efficiency [102], as shown in Figure 5.4(a). As can be seen, the on-chip lasers can provide enough power for detuning the nanocavities. This indicates that the SNG-based LFSR + directly on-chip modulated lasers can be used to drive the proposed OSC architecture. A 30mW power consumption of on-chip laser in [104] provides more optical power than what is required for each gate. In Figure 5.4(b), we scale down the optical power for on-chip lasers to match the same optical power required by nanocavities. We keep the lasing efficiency of 25% for off-chip lasers and 5% for on-chip lasers. In this case, the energy consumption of on-chip lasers is  $5 \times$  higher than the energy consumption of off-chip lasers, i.e., the total laser powers for on-chip lasers=34.6mW compared to 6.92mW for off-chip lasers. In order to compute the total energy consumption of the architecture using different SNG implementations, we need to break down the hardware complexity as discussed in the following.

Table 5.1 reports the breakdown of both designs' hardware complexity. Each design requires one LFSR and eight comparators to generate correlated inputs. SNG-based LFSR + modulators requires additional 23 modulators to control 23 nanocavities, where we assume plasmonic modulators of 110fJ/bit energy consumption at 72Gbps [105]. It is worth mentioning that the lasers used to generate the CW for the input signals ( $\lambda_1$  to  $\lambda_8$ ), not shown in the table, are taken into account when calculating the total energy consumption. As can be seen, SNG-based LFSR + modulators can slightly save more energy per bit compared to SNG-based LFSR + on-chip modulated laser, 3.3pJ/bit and 3.6pJ/bit at 10GHz (the operating frequency of the nanocavities and the electrical part of the SNG), respectively. However, the energy per bit is considered as the worst-case scenario for an on-chip laser since it is assumed to emit an optical signal in this case. As mentioned earlier, the on-chip laser is only

|                           | Off-chip CW laser                   | SNG-based LFSR $+$ on-chip |  |  |  |
|---------------------------|-------------------------------------|----------------------------|--|--|--|
|                           | <b>SNG-based LFSR</b> $+$ modulator | directly modulated laser   |  |  |  |
|                           | 1: LFSR                             |                            |  |  |  |
| Hardware complexity       | 8: Comparators                      |                            |  |  |  |
|                           | 23: Nanocavities                    |                            |  |  |  |
|                           | 23: Lasers                          |                            |  |  |  |
|                           | 23: Modulators                      | _                          |  |  |  |
| Power consumption<br>(mW) | 0.8: LFSR + comparators             |                            |  |  |  |
|                           | 6.92: Lasers                        | 34.6: Lasers               |  |  |  |
|                           | 184: Modulators                     | _                          |  |  |  |

Table 5.1: Hardware complexity and power consumption of SNG-based LFSR designs.

ON when the processed data='1'. Therefore, in order to demonstrate whether onchip lasers can outperform off-chip lasers, in the following, we evaluate the energy consumption at the scale of an image, where a huge number of bits, i.e., '0's and '1's, are processed.

We process two source images of  $512 \times 512$  pixels, shown in Figure 5.5. Image (A) is brighter with 25% of the total bits in the bit streams of the image pixels being '1's (assuming BSL=256), while in image (B), the total number of '1' is 10% of the whole bit streams in the image assuming the same BSL. The total energy consumption of the architecture using SNG-based LFSR + modulators is  $222\mu$ J/image for both images, since off-chip lasers continuously emit the signals . While the total energy consumption using SNG-based LFSR + on-chip directly modulated lasers is  $61\mu$ J/image and  $24\mu$ J/image for the images (A) and (B), respectively, since on-chip lasers only emit when the processed data is '1'. This indicates that the design with on-chip directly modulated lasers is more energy efficient than using off-chip CW lasers when processing a set of data at the application-level. Hence, it is suitable for AC architectures where reducing energy consumption is a crucial design requirement.

An on-chip directly modulated laser is a good option to have a fully on-chip integrated architecture. However, it is still composed of an electrical part to generate the



Figure 5.5: Total energy consumption for processing images (A) and (B).

bit streams, whereas an all-optical SNG provides another alternative of a fully on-chip integrated design without involving any electrical part, i.e., when analog input is assumed. In the following, we discuss an all-optical SNG implementation, where we first compare the resulting computing accuracy at the application-level with the accuracy evaluated using SNG-based LFSR architecture. Then, we discuss an enhancement of design scalability using a fully OSC architecture compared to a conventional optical design from the literature.

# 5.4 Towards All-optical Stochastic Computing Architectures

For edge detection filters of larger size, i.e.,  $7 \times 7$  and  $9 \times 9$ , the designs are composed of seven and nine stages of cascaded MUXs, respectively. This involves an increase in the number of SNGs and the number of gates needed for the designs. Hence, finding an implementation that reduces hardware complexity in the SNG circuit becomes essential. Therefore, all-optical SNG that includes only one laser is worth to be investigated for all-OSC architecture.

In the following, we first evaluate the computing accuracy of all-optical SNG for detecting the edges of an image processed using the optical Sobel filter proposed in Chapter 4. For this purpose, we assume two scenarios for SNG implementations:

- An electrical SNG-based LFSR that can be implemented either using LFSR + modulators or LFSR + on-chip directly modulated lasers.
- A combination of SNG-based LFSR and all-optical SNG to inject bit streams into the input of the XOR gates and the selection lines of the MUXs, respectively.
   Since we were only provided with random pulses of p=0.5 for the all-optical SNG, we cannot use it to represent the pixels of the image.

Figure 5.6 shows the processed images for the two scenarios. As can be seen, these scenarios have the same level of accuracy, which indicates that the randomness of the generated pulses by nanolasers is close to the randomness of the LFSR. Hence, the all-optical SNG is a good candidate for SC architectures and a perfect replacement for LFSR to design all-OSC architectures. It can also be considered a key factor to increase the design scalability of OSC architectures as it is composed of a single laser.



(a) PSNR<sub>Total</sub>=23 An SNG-based LFSR



(b) PSNR<sub>Total</sub>=23.3 A mix of SNG-based LFSR and all-optical SNG

Figure 5.6: Processed images using (a) SNG-based LFSR and (b) a mix of SNG-based LFSR and all-optical SNG.

As mentioned earlier, OC suffers from scalability problems due to the large size and the high number of devices the data signal has to propagate through. The SC approach can overcome this issue since it contributes to reducing the hardware complexity of the design. To further illustrate this point, we compare the design of an *n*-bit adder using our work, i.e., all-optical SNG and all-optical adder, and the work proposed in [75] (ripple carry adder). For the adder in [75], at each stage i  $(1 \le i \le n)$ , two inputs,  $A_i$  and  $B_i$ , electrically control MZI and DC devices. Increasing the size of the adder to *n*-bit requires duplicating the stage design *n* times, where the carry-out (*Cout<sub>i</sub>*) from one stage propagates as carry-in (*Cin<sub>i+1</sub>*) to the next stage. As can be

Table 5.2: Hardware complexity of n-bit adder proposed in [75] and our work.

| Devices       | proposed design in [75] | Our work |
|---------------|-------------------------|----------|
| Lasers        | 2n+1                    | 3        |
| photodetector | n+1                     | 1        |
| OR gate       | n                       | -        |
| MZI           | 2n                      | -        |
| DC            | 3n                      | -        |
| nanocavity    | -                       | 1        |



Figure 5.7: The design of an n-bit adder proposed in (a) [75] and (b) our work.

seen in Figure 5.7(a),  $Cout_i$  remains propagating in the optical domain from one stage to the next. On the other hand, to design an *n*-bit adder using our work, we need one all-optical MUX for the adder using PhC nanocavity, as proposed in Chapter 4, and three all-optical SNGs to generate the random pulses for inputs A and B, and the selection line *Sel*, as shown in Figure 5.7(b). Table 5.2 reports the number of devices, including lasers and photodetectors, required to design an *n*-bit adder using both designs. As can be seen, the number of devices using the OSC approach remains constant as opposed to the design proposed in [75], where the number of devices, including the interfaces, linearly increases with the data size.

Figure 5.8 shows the number of optical devices required for the computations (solid lines), i.e., without the interfaces. In order to design an 8-bit adder, 50 devices are required for the work in [75] (blue color) as opposed to one nanocavity for our design (red color). The figure also illustrates the processing time (dashed lines) required to perform two inputs n-bit addition assuming 10GHz operating rate for both designs.

Moreover, we assume the same processing time for SNG-based LFSR and all-optical SNG considering. Hence, by increasing the adder size, the number of pulses generated by the all-optical SNG increases, which negatively impacts the processing time of our design. For example, an 8-bit addition results in 0.1ns processing time for the design in [75] compared to 25.6ns for our design. It is worth mentioning that for the design in [75], the *Cin* signal has to propagate through *n* stages and its power needs to be divided equally between two DCs at each stage, which leads to a significant increase in the signal power losses. In order to ensure correct transmission of the signal, *Cin* has to be injected with enough power, which impacts the energy efficiency of the design. For example, assuming an 8-bit adder and a DC with IL=2dB [106], in order to receive the *Cout* at the receiver side with a total power of  $20\mu$ W, *Cin* has to be injected with 305mW from a laser. This value will increase with the data size. In our design, to receive the same output power of  $20\mu$ W, the total injected laser powers remains constant, i.e., 0.36mW, for any data size, since there is only one device used for an *n*-bit data. Figure 5.9 illustrates the energy consumption per *n*-bit adder. Our design



Figure 5.8: The number of devices (solid lines), without interface, and the processing time (dashed lines) for an *n*-bit adder in [75] (blue color) and our work (red color).



Figure 5.9: Energy consumption for an n-bit adder implemented using the design in [75] (blue bars) and our work (red bars).

(red bars) consumes more energy compared to the design with propagated *Cin* (blue bars) for small size of data, i.e., 1-bit to 5-bit adders. As the adder size increases, the design with propagated *Cin* starts to consume more energy due to the increase in the number of stages *Cin* has to propagate through. Assuming a 20% lasing efficiency for both designs, the energy consumption of an 8-bit adder is 140pJ/operation and 47pJ/operation for the design in [75] and our design, respectively. The increase in the *Cin* laser power with the adder size could eventually trigger the undesired nonlinear effect. Therefore, O/E and E/O conversion may be needed instead of increasing the laser power, which also leads to an increase in the energy consumption of the design.

To sum up, an all-OSC architecture results in the same computing accuracy as an architecture designed using SNG-based LFSR. It also leads to a significant reduction in the number of devices used in the design compared to its counterpart designs in the literature. It also saves energy when larger size of data is processed, allowing to scale up the design while impacting the processing time.

# 5.5 Summary

In this chapter, we proposed three designs of SNG suitable for OSC architectures. Their implementation involves SNG-based LFSR and all-optical SNG. The SNGbased LFSR is composed of electronics components, i.e., LFSR and comparator, while all-optical SNG involves lasers that generate random pulses. Hence, the all-optical SNG can further reduce the hardware complexity of the design. The results demonstrated that the edge detection design using the SNG-based LFSR + on-chip directly modulated lasers saves more energy compared to the design implemented using the SNG-based LFSR + modulator, i.e.,  $61\mu$ J/image energy consumption compared to 222µJ/image, respectively. Hence, an SNG using on-chip directly modulated lasers is more suitable for SC since it reduces energy consumption. In terms of computing accuracy, the all-optical SNG implementation leads to close approximation, i.e.,  $PSNR_{Total}$  around 23, with SNG-based LFSR. This indicates that the randomness of the sequence generated using nanolasers is suitable for SC architectures. Assuming an *n*-bit adder, the proposed design of all-OSC architecture maintains the same number of devices, i.e., one nanocavity and three lasers. Moreover, it saves energy with the increase in the adder size, which enhances scalability and impacts processing time.

# Chapter 6

# **Conclusions and Future Work**

## 6.1 Conclusions

In this PhD thesis, we proposed a novel computing domain based on integrated optics to design stochastic computing (SC) architectures for processing time and scalability enhancement. Our aim is to combine the positive features from each domain; high signal propagation speed in the optical domain with energy efficient and small design footprint from the SC approach. In order to investigate the design of optical stochastic computing (OSC) architectures, we proposed a methodology that contains libraries of architectures. Each library includes the design implementation along with the transmission and error evaluation models. Furthermore, we proposed libraries for the architecture interfaces with three different implementations of stochastic number generator (SNG), i.e., based on linear feedback shift register (LFSR) and fully optical. In the proposed architectures, we designed the computing parts to be fully optical since we aim to design all-OSC architectures. The methodology allows exploring the design of OSC by taking into account the technological and system-level parameters. The exploration targets a given application, leading to multiple design options that satisfy different design requirements. In the following, we briefly summarize each of the main contributions of this thesis.

The first contribution is the design of an *n* order Bernstein polynomial architecture. The design can execute any single input function by changing the coefficients inputs. The design is composed of Mach-Zehnder interferometer (MZI) and microring resonator (MRR) as modulators and MRR as an all-optical add-drop filter (AOF) that works as a MUX. We developed the transmission model in order to optimize energy consumption. For this purpose, we proposed design methods that allow exploring the technological parameters of the devices in order to find the optimal WLS. In order to evaluate the computing accuracy, we implemented a Gamma correction application for image processing. We considered three sources of error: i) error related to the order of the architecture; ii) error related to the generated bit streams in SC; and iii) error related to the optical transmission. We explored the design space at the system level, i.e., n, BSL and BER, where the results showed that it is possible to reach the same computing accuracy for different polynomial orders. This is achieved by compensating the reduced accuracy of lower order polynomial with higher BSL and lower BER. In addition to reducing the hardware complexity, this result demonstrates that maintaining a certain level of accuracy can be achieved by increasing the processing time (higher BSL) or by increasing the laser powers (lower BER). It is worth mentioning that the designs resulting from the exploration are considered static, which means that each design requires a different architecture.

In the second contribution, we proposed the design of a reconfigurable architecture of the Bernstein polynomial. In this design, we can reconfigure the design
order at run-time to trade off computing accuracy with design throughput. The architecture can execute one 4<sup>th</sup> order function for higher accuracy or two 2<sup>nd</sup> order functions for higher throughput. For this purpose, an architecture of a 2<sup>nd</sup> order is duplicated, more AOFs and photodetectors are added, and DCs are used that are configured during run-time to direct the coefficients signals to the correct output. The same exploration method used in the static architecture is used in the reconfigurable architecture to find the optimal *WLS*. Furthermore, a Gamma correction application has been used to evaluate the computing accuracy, assuming the same three sources of error presented in the first contribution. The reconfigurable architecture leads up to 53% energy overhead compared to the static design due to the additional devices. However, it increases the range of reachable accuracy by 65%, which is a key to meet users' requirements.

In the third contribution, we investigated the design of cascaded combinational filters using PhC nanocavity. The interest in the nanocavity is due to its physical characteristics, i.e., energy efficiency of 100fJ, compact size  $< 10\mu$ m<sup>2</sup>, high modulation speed of 10GHz. Moreover, it has different Q factors around resonance wavelengths, which allows designing cascaded gates, such as cascaded MUXs, where multiple signals propagate at different wavelengths. For this purpose, we proposed a transmission model for PhC nanocavities. The model takes into account key device parameters, such as Q factors, the figure of merits, resonance wavelengths, and the detuning induced by optical signals. We calibrated our model with the experimental results and explored the device parameters by implementing XOR gates and MUXs using nanocavity. We proposed the implementation of an optical stochastic Sobel edge detection filter. This requires implementing multiple stages of MUXs using nanocavity. Moreover, we explored system-level parameters, i.e., laser power, *BER* and *BSL*, and

evaluated the computing accuracy by processing an image. We showed that it is possible to implement the filter using a design of Q factors=7700 for XOR gates and 7700, 1600, and 200 for MUXs. The resulting architecture showed 0.85nJ/pixel energy consumption and 51.2ns/pixel processing time.

Finally, we proposed three implementations for SNG design; i) SNG-based LFSR + modulators; ii) SNG-based LFSR + on-chip directly modulated lasers; and iii) alloptical SNG. When processing an image using a Sobel filter, the results showed that an SNG implementation using on-chip directly modulated lasers led to energy saving by 72% compared to off-chip lasers. Regarding all-optical SNG, the randomness of the pulses generated by lasers maintains the same accuracy as SNG designed using LFSR for detecting the edges of an image. Towards the design of all-OSC, the results showed that for an 8-bit adder, a 70% energy saving and 98% reduction in the number of devices can be achieved compared to a conventional optical adder but with an increase in the processing time.

Our study led to the conclusion that it is beneficial to integrate both SC and OC in the same domain (OSC). SC's features enable expanding the design in the optical domain to accommodate data of larger size with a significant saving in the energy and area. Compared to the conventional OC architectures, the increase in the processing time is expected due to the serial processing. However, this increase can be considered a fair trade-off to gain scalability, which is a primary issue in OC architectures. The scalability is not only limited to a design with a large data size but also to the design of complex architectures, such as neural networks, which are only implemented on a small scale in the optical domain.

The study also highlights the fact that optical devices, in general, are still immature. However, in our proposed designs, we focused on demonstrating the feasibility of the OSC domain and exploring the design parameters that impact energy consumption, computing accuracy and processing time. In order for silicon photonics to suit in the OC domain, it requires devices of tens fJ energy efficiency, few  $\mu m^2$ footprint, and modulation speed greater than 50GHz. Based on the progress in the optical domain in the last few decades, it is expected to see a leap within few years in the development of new devices to meet different computing requirements. It is worth mentioning that the proposed computing architectures in this thesis are independent of the optical devices. Using another higher performance device can be easily integrated into the design since our architectures' transmission models are generic. This only requires replacing the transmission model of the device with the new one and exploring its technological parameters for energy optimization. The steps for the design space exploration in the methodology will remain the same.

### 6.2 Future Work

Combining the SC approach with integrated optics can accelerate the processing time of the design to overcome the main limitation in SC. This also increases the processing time compared to conventional OC architectures since maintaining an acceptable computing accuracy requires increasing the *BSL* or the number of generated pulses. However, the proposed computing paradigm enhances design scalability compared to conventional OC. Hence, it is more energy efficient when the hardware needs to be increased to execute more complex functions or to process a larger size of data.

Based on the study presented in this thesis, we propose several future work directions that can be divided into short-term, medium-term and long-term plans as follows:

#### • Short-term

- Regarding PhC nanocavities, our results emphasized the importance to explore the figure of merit  $M_{[gate]}$  parameter, as increasing this parameter can enhance the energy efficiency of the design by avoiding the need to modify the *BER* and *BSL* parameters. It allows to increase the range of wavelength detuning, which increases the design space, i.e., higher Q factors can be explored. Our proposed design space exploration can be used; however, the transmission model needs to be more accurate by taking into account the transient response of the devices, hence, working closer to the physical level. Moreover, our transmission model of PhC nanocavity should take into consideration the impact of the probe signals' powers on the wavelength detuning range. In addition to the pump power, increasing probe power would shift the resonance wavelength of the device away from the required data wavelength and hence can lead to a low transmission power of the input signal.
- Regarding lasers, the use of pulse-based lasers saves energy consumption, as presented in Chapter 2; however, this requires synchronization at the photodetector, especially for high operating frequencies, i.e., 10GHz and above. This is challenging due to the lower responsivity of the photodetector at this rate. In order to achieve this, we need to use a model for the photodetector that takes into account the technological parameters. The model should be able to calculate the responsivity parameter according to the operating rate. Extra hardware is needed to monitor the photodetector responsivity and compare it with a threshold value selected at the design time. When the responsivity falls below the threshold value, the laser power should be increased accordingly.

#### • Medium-term

- Finite impulse response (FIR) filter is important in signal processing as it can be used to implement low pass filters. In its current architecture in the literature, FIR filter has several drawbacks that include high design area and high processing time. Investigating the design of OSC FIR filter is interesting to overcome these limitations. Hence, an all-optical SNG will be used to generate the pulses for the inputs. The FIR filter is composed of addition and multiplication operations, which will be implemented using MUXs and XNOR gates, respectively. These gates can be designed using PhC nanocavities, where each one requires only a single nanocavity, unlike the design of XOR gates, where two cascaded nanocavities are needed, as proposed in Chapter 4. Moreover, optical delay elements are required; hence, one of the designs proposed in the literature can be used. For example, in [107], a reconfigurable optical delay is designed using cascaded MZI. The OSC FIR filter can be compared to existing optical FIR filters in terms of energy consumption, area, processing time and computing accuracy.
- Neural networks (NNs) allow modeling nonlinear processes. The number of layers and neurons in the network adds significant overhead on the design area and power consumption, which limit the design scalability. According to [108], the basic components of NN can reach up to few millimeters in dimension, which raises the need for the OSC approach. We propose using PhC nanocavity to design XNOR gate and MUX for multiplication and addition, respectively. An all-optical SNG with analog input will be used to generate the pulses for the inputs and the weights. Nanolasers can also be investigated in the activation function design, i.e., when the accumulated energy at the nanolaser reaches a threshold value, a pulse will be generated.

#### • Long-term

Integration of all-optical accelerators in optical networks on chip (ONoC). Indeed, optical accelerators can be considered part of optical interconnects, where data is optically processed during its propagation at a high rate from one module to the next. There are designs for reconfigurable ONoC [109] from the literature that can be used to connect the proposed computing architectures. Moreover, a controller will be designed to configure the optical interconnect (the data path) in order to direct the data to the correct computing architecture based on the targeted application. Then, depending on the design requirements, i.e., computing accuracy and energy efficiency, the configuration of the design can be adapted. These designs involve costly EO/OE that the ONoC and the accelerators would share in order to reduce energy and area overhead.

# Bibliography

- [1] State of the IoT 2018: Number of IoT devices now at 7BMarket accelerating. https://iot-analytics.com/ state-of-the-iot-update-q1-q2-2018-number-of-iot-devices-now-7b/, 2021.
- [2] Number of Internet of Things (IoT) Connected Devices Worldwide from 2019 to 2030. https://www.statista.com/statistics/1183457/ iot-connected-devices-worldwide/, 2021.
- [3] J. Han and M. Orshansky. Approximate Computing: An Emerging Paradigm for Energy-Efficient Design. In *European Test Symposium*, pages 1–6. IEEE, 2013.
- [4] Q. Xu, T. Mytkowicz, and N. S. Kim. Approximate Computing: A Survey. IEEE Design & Test, 33(1):8–22, 2015.
- [5] B. R Gaines. Stochastic Computing Systems. In Advances in Information Systems Science, pages 37–172. Springer, 1969.
- [6] A. Alaghi and J. P. Hayes. Survey of Stochastic Computing. ACM Transactions on Embedded Computing Systems, 12(2s):92, 2013.

- [7] A. Alaghi, W. Qian, and J. P. Hayes. The Promise and Challenge of Stochastic Computing. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 37(8):1515–1531, 2017.
- [8] J. P Hayes. Introduction to Stochastic Computing and its Challenges. In Design Automation Conference, pages 1–3. IEEE, 2015.
- [9] L. Miao and C. Chakrabarti. A Parallel Stochastic Computing System with Improved Accuracy. In Signal Processing Systems, pages 195–200. IEEE, 2013.
- [10] R. G. Beausoleil, P. J. Kuekes, G. S. Snider, S. Y. Wang, and R. S. Williams. Nanoelectronic and Nanophotonic Interconnect. *Proceedings of the IEEE*, 96(2):230–247, 2008.
- [11] A. Shacham, K. Bergman, and L. P. Carloni. Photonic Networks-on-Chip for Future Generations of Chip Multiprocessors. *IEEE Transactions on Computers*, 57(9):1246–1260, 2008.
- [12] D. A. B. Miller. Device Requirements for Optical Interconnects to Silicon Chips. Proceedings of the IEEE, 97(7):1166–1185, 2009.
- [13] Intel. Silicon Photonics Overview. https://www.intel.ca/content/ www/ca/en/architecture-and-technology/silicon-photonics/ silicon-photonics-overview.html, 2021.
- [14] Intel Demos the 400G, a 400Gbps Transceiver for the Datacenter. https://www.techspot.com/news/ 79640-intel-demos-400g-400gbps-transceiver-datacenter.html, 2021.

- [15] Intel Silicon Photonics Update at Interconnect Day 2019. https://www.servethehome.com/ intel-silicon-photonics-update-at-interconnect-day-2019/, 2021.
- [16] IBM, SAN Switch Hardware Features. https://www.ibm.com/support/ knowledgecenter/HW29A/san64b6.doc/hardware\_features.html, 2021.
- [17] R.G. Beausoleil, J. Ahn, N. Binkert, A. Davis, D. Fattal, M. Fiorentino, N. P Jouppi, M. McLaren, C. Santori, R.S. Schreiber, et al. A Nanophotonic Interconnect for High-Performance Many-Core Computation. In *IEEE Symposium* on High Performance Interconnects, pages 182–189. IEEE, 2008.
- [18] M. Haurylau, G. Chen, H. Chen, J. Zhang, N. A Nelson, D. H. Albonesi, E. G. Friedman, and P. M. Fauchet. On-Chip Optical Interconnect Roadmap: Challenges and Critical Directions. *IEEE Journal of Selected Topics in Quantum Electronics*, 12(6):1699–1705, 2006.
- [19] K. Ohashi, K. Nishi, T. Shimizu, M. Nakada, J. Fujikata, J. Ushida, S. Torii, K. Nose, M. Mizuno, H. Yukawa, et al. On-Chip Optical Interconnect. *Proceed*ings of the IEEE, 97(7):1186–1198, 2009.
- [20] Russel J Baker and Brent Keeth. Optical Interconnect in High-Speed Memory Systems, May 10 2011. US Patent 7,941,056.
- [21] A. Benner. Optical Interconnect Opportunities in Supercomputers and High End Computing. In Optical Fiber Communication Conference, pages 1–60. IEEE, 2012.

- [22] C. Sun, M. T. Wade, Y. Lee, J. S Orcutt, L. Alloatti, M. S. Georgas, A. S. Waterman, J. M. Shainline, R. R. Avizienis, S. Lin, et al. Single-Chip Micro-processor that Communicates Directly using Light. *Nature*, 528(7583):534–538, 2015.
- [23] J. Lee, C. Killian, S.L. Beux, and D. Chillet. Approximate Nanophotonic Interconnects. In *IEEE/ACM International Symposium on Networks-on-Chip*, pages 1–7, 2019.
- [24] F. Sunny, A. Mirza, I. Thakkar, S. Pasricha, and M. Nikdast. LORAX: Loss-Aware Approximations for Energy-Efficient Silicon Photonic Networks-on-Chip. In *Great Lakes Symposium on VLSI*, pages 235–240, 2020.
- [25] Z. Li, S. Le Beux, C. Monat, X. Letartre, and I. O'Connor. Optical Look Up Table. In Design, Automation and Test in Europe, pages 873–876, 2013.
- [26] Y. Shen, N. C Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, et al. Deep Learning with Coherent Nanophotonic Circuits. *Nature Photonics*, 11(7):441, 2017.
- [27] Makoto Murase. Linear Feedback Shift Register, February 18 1992. US Patent 5,090,035.
- [28] W. Qian, X. Li, M. D. Riedel, K. Bazargan, and D. J. Lilja. An Architecture for Fault-tolerant Computation with Stochastic Logic. *IEEE Transactions on Computers*, 60(1):93–105, 2011.
- [29] G. M. Phillips. Bernstein Polynomials. In Interpolation and Approximation by Polynomials, pages 247–290. Springer, 2003.

- [30] M. H. Najafi, P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. Riedel. A Reconfigurable Architecture with Sequential Logic-based Stochastic Computing. *ACM Journal on Emerging Technologies in Computing Systems*, 13(4):1–28, 2017.
- [31] P. Li and D. J. Lilja. Using Stochastic Computing to Implement Digital Image Processing Algorithms. In International Conference on Computer Design, pages 154–161. IEEE, 2011.
- [32] R. K. Budhwani, R. Ragavan, and O. Sentieys. Taking Advantage of Correlation in Stochastic Computing. In *International Symposium on Circuits and Systems*, pages 1–4. IEEE, 2017.
- [33] A. Alaghi, C. Li, and J. P. Hayes. Stochastic Circuits for Real-time Imageprocessing Applications. In *Design Automation Conference*, pages 1–6. IEEE, 2013.
- [34] K. J. Ahmed, B. Yuan, and M. J. Lee. High-Accuracy Stochastic Computingbased FIR Filter Design. In International Conference on Acoustics, Speech and Signal Processing, pages 1140–1144. IEEE, 2018.
- [35] X. Lee, C. Chen, H. Chang, and C. Lee. A 7.92 Gb/s 437.2 mw Stochastic LDPC Decoder Chip for IEEE 802.15. 3c Applications. *IEEE Transactions on Circuits and Systems*, 62(2):507–516, 2014.
- [36] Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han. A Survey of Stochastic Computing Neural Networks for Machine Learning Applications. *IEEE Transactions* on Neural Networks and Learning Systems, 2020.

- [37] J. Li, A. Ren, Z. Li, C. Ding, B. Yuan, Q. Qiu, and Y. Wang. Towards Acceleration of Deep Convolutional Neural Networks using Stochastic Computing. In Asia and South Pacific Design Automation Conference (ASP-DAC), pages 115–120. IEEE, 2017.
- [38] S. R. Faraji, M. H. Najafi, B. Li, D. J. Lilja, and K. Bazargan. Energy-efficient Convolutional Neural Networks with Deterministic Bit-stream Processing. In Design, Automation & Test in Europe Conference & Exhibition, pages 1757– 1762. IEEE, 2019.
- [39] E. O'Neill. Spatial Filtering in Optics. IRE Transactions on Information Theory, 2(2):56-65, 1956.
- [40] TH Maiman. Optical and microwave-optical experiments in ruby. Physical review letters, 4(11):564, 1960.
- [41] A. Lugt. Coherent Optical Processing. Proceedings of the IEEE, 62(10):1300– 1319, 1974.
- [42] H. Rajbenbach, Y. Fainman, and S.H. Lee. Optical Implementation of an Iterative Algorithm for Matrix Inversion. *Applied Optics*, 26(6):1024–1031, 1987.
- [43] D. Psaltis, D. Brady, and K. Wagner. Adaptive Optical Networks using Photorefractive Crystals. Applied Optics, 27(9):1752–1759, 1988.
- [44] P.S. Guilfoyle. Digital Optical Computer II. In Optical Enhancements to Computing Technology, volume 1563, page 214. International Society for Optics and Photonics, 1991.
- [45] R.S. Rudokas and P.S. Guilfoyle. A Digital Optical Implementation of RISC.
   In COMPCON Spring'91 Digest of Papers, pages 436–441. IEEE, 1991.

- [46] LightON. https://lighton.ai/, 2021.
- [47] Lightmatter. https://medium.com/lightmatter, 2021.
- [48] R. Soref. The Past, Present, and Future of Silicon Photonics. IEEE Journal of Selected Topics in Quantum Electronics, 12(6):1678–1687, 2006.
- [49] D.J. Thomson, F.Y. Gardes, Y. Hu, G. Mashanovich, M. Fournier, P. Grosse, J.M. Fedeli, and G.T. Reed. High Contrast 40Gbit/s Optical Modulation in Silicon. Optics Express, 19(12):11507–11516, 2011.
- [50] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos, S. Kumar Selvaraja,
  T. Claes, P. Dumon, P. Bienstman, D. Van Thourhout, and R. Baets. Silicon
  Microring Resonators. Laser & Photonics Reviews, 6(1):47–73, 2012.
- [51] K. Kubota, J. Noda, and O. Mikami. Traveling Wave Optical Modulator using a Directional Coupler LiNbo 3 Waveguide. *IEEE Journal of Quantum Electronics*, 16(7):754–760, 1980.
- [52] E. F. Schubert. Doping in III-V Semiconductors. E. Fred Schubert, 2015.
- [53] L. Constans, S. Combrié, X. Checoury, G. Beaudoin, I. Sagnes, F. Raineri, and A. De Rossi. III-V/Silicon Hybrid Nonlinear Nanophotonics in The Context of On-chip Optical Signal Processing and Analog Computing. *Frontiers in Physics*, 7:133, 2019.
- [54] G. Crosnier, D. Sanchez, S. Bouchoule, P. Monnier, G. Beaudoin, I. Sagnes,
   R. Raj, and F. Raineri. Hybrid Indium Phosphide-on-Silicon Nanolaser Diode.
   Nature Photonics, 11(5):297–300, 2017.

- [55] R. Soref and B. Bennett. Electrooptical Effects in Silicon. IEEE journal of Quantum Electronics, 23(1):123–129, 1987.
- [56] G. Cocorullo and I. Rendina. Thermo-Optical Modulation at 1.5 μm in Silicon Etalon. *Electronics Letters*, 28(1):83–85, 1992.
- [57] M. Nedeljkovic, R. Soref, and G. Z. Mashanovich. Free-Carrier Electrorefraction and Electroabsorption Modulation Predictions for Silicon Over the 1–14- μm Infrared Wavelength Range. *IEEE Photonics Journal*, 3(6):1171–1180, 2011.
- [58] P. Günter. Nonlinear Optical Effects and Materials, volume 72. Springer, 2012.
- [59] J. Hardy and J. Shamir. Optics Inspired Logic Architecture. Optics Express, 15(1):150–165, 2007.
- [60] M. JR. Heck, J. F. Bauters, M. L. Davenport, D. T. Spencer, and J. E. Bowers. Ultra-low Loss Waveguide Platform and its Integration with Silicon Photonics. *Laser & Photonics Reviews*, 8(5):667–686, 2014.
- [61] G. Roelkens, L. Liu, D. Liang, R. Jones, A. Fang, B. Koch, and J. Bowers. III-V/Silicon Photonics for On-Chip and Intra-Chip Optical Interconnects. *Laser & Photonics Reviews*, 4(6):751–779, 2010.
- [62] Z. Zhou, B. Yin, and J. Michel. On-Chip Light Sources for Silicon Photonics. Light: Science & Applications, 4(11):e358, 2015.
- [63] F. Xia, T. Mueller, Y. Lin, A. Valdes-Garcia, and P. Avouris. Ultrafast Graphene Photodetector. *Nature Nanotechnology*, 4(12):839–843, 2009.
- [64] H. T. Chen, J. Verbist, P. Verheyen, P. De Heyn, G. Lepage, J. De Coster,P. Absil, X. Yin, J. Bauwelinck, J. Van Campenhout, et al. High Sensitivity

10Gb/s Si Photonic Receiver Based on a Low-Voltage Waveguide-Coupled Ge Avalanche Photodetector. *Optics Express*, 23(2):815–822, 2015.

- [65] L. Chen, K. Preston, S. Manipatruni, and M. Lipson. Integrated GHz Silicon Photonic Interconnect with Micrometer-Scale Modulators and Detectors. *Optics Express*, 17(17):15248–15256, 2009.
- [66] C. Gunn. CMOS Photonics for High-Speed Interconnects. IEEE Micro, 26(2):58–66, 2006.
- [67] D. Pérez, I. Gasulla, and J. Capmany. Toward Programmable Microwave Photonics Processors. *Journal of Lightwave Technology*, 36(2):519–532, 2018.
- [68] D. A. B. Miller. Self-configuring Universal Linear Optical Component. Photonics Research, 1(1):1–15, 2013.
- [69] Q. Xu and M. Lipson. All-optical Logic Based on Silicon Micro-ring Resonators. Optics Express, 15(3):924–929, 2007.
- [70] A. L. Giesecke, A. Prinzen, J. Bolten, C. Porschatis, B. Chmielak, C. Matheisen, T. Wahlbrink, H. Lerch, M. Waldow, and H. Kurz. Add-Drop Microring Resonator for Electro-Optical Switching and Optical Power Monitoring. In *Conference on Lasers and Electro-Optics*, pages 1–2. IEEE, 2014.
- [71] P. Xu, J. Zheng, J. K Doylend, and A. Majumdar. Low-Loss and Broadband Nonvolatile Phase-Change Directional Coupler Switches. ACS Photonics, 6(2):553–557, 2019.
- [72] Q. Xu and R. Soref. Reconfigurable Optical Directed-logic Circuits using Microresonator-based Optical Switches. Optics Express, 19(6):5244–5259, 2011.

- [73] F. Denis-Le Coarer, M. Sciamanna, A. Katumba, M. Freiberger, J. Dambre,
  P. Bienstman, and D. Rontani. All-optical Reservoir Computing on a Photonic Chip Using Silicon-based Ring Resonators. *IEEE Journal of Selected Topics in Quantum Electronics*, 24(6):1–8, 2018.
- [74] T. Ishihara, A. Shinya, K. Inoue, K. Nozaki, and M. Notomi. An Integrated Optical Parallel Adder as a First Step Towards Light Speed Data Processing. In *International SoC Design Conference*, pages 123–124. IEEE, 2016.
- [75] Y. Imai, T. Ishihara, H. Onodera, A. Shinya, S. Kita, K. Nozaki, K. Takata, and M. Notomi. An Optical Parallel Multiplier using Nanophotonic Analog Adders and Optoelectronic Analog-to-Digital Converters. In *Conference on Lasers and Electro-Optics: Science and Innovations*, pages JW2A–50. Optical Society of America, 2018.
- [76] P.C. Meier, R.A. Rutenbar, and L.R. Carley. Exploring Multiplier Architecture and Layout for Low Power. In *Proceedings of Custom Integrated Circuits Conference*, pages 513–516. IEEE, 1996.
- [77] K. Nozaki, A. Shinya, S. Matsuo, Y. Suzaki, T. Segawa, T. Sato, Y. Kawaguchi,
   R. Takahashi, and M. Notomi. Ultralow-power All-optical RAM Based on Nanocavities. *Nature Photonics*, 6(4):248, 2012.
- [78] M. Notomi, A. Shinya, K. Nozaki, T. Tanabe, S. Matsuo, E. Kuramochi, T. Sato,
  H. Taniyama, and H. Sumikura. Low-Power Nanophotonic Devices Based on
  Photonic Crystals Towards Dense Photonic Network on Chip. *IET circuits, devices & systems*, 5(2):84–93, 2011.

- [79] C. Ríos, M. Stegmaier, P. Hosseini, D. Wang, T. Scherer, C D. Wright,
   H. Bhaskaran, and W.HP. Pernice. Integrated All-photonic Non-Volatile Multi-Level Memory. *Nature Photonics*, 9(11):725–732, 2015.
- [80] C. Ríos, N. Youngblood, Z. Cheng, M. Le Gallo, W.HP Pernice, C.D. Wright, A. Sebastian, and H. Bhaskaran. In-Memory Computing on a Photonic Platform. *Science advances*, 5(2):5759, 2019.
- [81] J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M.L. Gallo, X. Fu, A. Lukashchuk, A. Raja, J. Liu, et al. Parallel Convolution Processing using An Integrated Photonic Tensor Core. arXiv preprint arXiv:2002.00281, 2020.
- [82] Q. Xu, B. Schmidt, S. Pradhan, and M. Lipson. Micrometre-scale Silicon Electro-optic Modulator. *Nature*, 435(7040):325, 2005.
- [83] H. Li, S. Le Beux, Y. Thonnart, and I. O'Connor. Complementary Cmmunication Path for Energy Efficient On-chip Optical Interconnects. In *Design Automation Conference*, pages 1–6. IEEE, 2015.
- [84] V. Van, T. Ibrahim, P. Absil, F. Johnson, R. Grover, and P. Ho. Optical Signal Processing using Nonlinear Semiconductor Microring Resonators. *IEEE Journal* of Selected Topics in Quantum Electronics, 8(3):705–713, 2002.
- [85] R. C. Gonzalez, R. E. Woods, and S. L. Eddins. *Digital Image Processing using MATLAB*. Pearson Education India, 2004.
- [86] D. R. Bull. Communicating Pictures: A Course in Image and Video Coding. Academic Press, 2014.
- [87] R. Wang, A. Vasiliev, M. Muneeb, A. Malik, S. Sprengel, G. Boehm, M. C. Amann, I. Šimonytė, A. Vizbaras, K. Vizbaras, et al. III-V-on-Silicon Photonic

Integrated Circuits for Spectroscopic Sensing in the 2–4  $\mu$ m wavelength range. Sensors, 17(8):1788, 2017.

- [88] M. Ziebell, D. Marris-Morini, G. Rasigade, J. Fédéli, P. Crozat, E. Cassan, D. Bouville, and L. Vivien. 40 Gbit/s Low-loss Silicon Optical Modulator Based on a Pipin Diode. *Optics Express*, 20(10):10591–10596, 2012.
- [89] M. Streshinsky, R. Ding, Y. Liu, A. Novack, Y. Yang, Y. Ma, X. Tu, E. Chee, A. Lim, P. Lo, et al. Low Power 50 Gb/s Silicon Traveling Wave Mach-Zehnder Modulator Near 1300 nm. Optics Express, 21(25):30350–30357, 2013.
- [90] X. Xiao, H. Xu, X. Li, Z. Li, T. Chu, Y. Yu, and J. Yu. High-speed, Low-loss Silicon Mach–Zehnder Modulators with Doping Optimization. *Optics Express*, 21(4):4116–4125, 2013.
- [91] J. H. Anderson, Y. Hara-Azumi, and S. Yamashita. Effect of LFSR Seeding, Scrambling and Feedback Polynomial on Stochastic Computing Accuracy. In Design, Automation & Test in Europe Conference & Exhibition, pages 1550– 1555. IEEE, 2016.
- [92] R. Wu, C. H. Chen, C. Li, T. C. Huang, F. Lan, C. Zhang, Y. Pan, J. E. Bowers, R. G. Beausoleil, and K. T. Cheng. Variation-Aware Adaptive Tuning for Nanophotonic Interconnects. In *International Conference on Computer-Aided Design*, pages 487–493. IEEE, 2015.
- [93] K. Devika and R. Bhakthavatchalu. Design of Reconfigurable LFSR for VLSI IC Testing in ASIC and FPGA. In International Conference on Communication and Signal Processing (ICCSP), pages 0928–0932. IEEE, 2017.

- [94] A. Bazin, K. Lenglé, M. Gay, P. Monnier, L. Bramerie, R. Braive, G. Beaudoin, I. Sagnes, R. Raj, and F. Raineri. Ultrafast All-Optical Switching and Error-Free 10 Gbit/s Wavelength Conversion in Hybrid InP-Silicon on Insulator Nanocavities Using Surface Quantum Wells. *Applied Physics Letters*, 104(1):011102, 2014.
- [95] A. Alaghi and J. P. Hayes. Exploiting Correlation in Stochastic Circuit Design. In International Conference on Computer Design, pages 39–46. IEEE, 2013.
- [96] G. Moille, S. Combrié, L. Morgenroth, G. Lehoucq, F. Neuilly, B. Hu, D. Decoster, and A. de Rossi. Integrated All-Optical Switch with 10 ps Time Resolution Enabled by ALD. Laser & Photonics Reviews, 10(3):409–419, 2016.
- [97] P. K. Gupta and R. Kumaresan. Binary Multiplication with PN Sequences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(4):603–606, 1988.
- [98] Y. Wang, W. Wen, H. Li, and M. Hu. A Novel True Random Number Generator Design Leveraging Emerging Memristor Technology. In *Great Lakes Symposium* on VLSI, pages 271–276, 2015.
- [99] Damir Vodenicarevic, Nicolas Locatelli, Alice Mizrahi, Joseph S Friedman, Adrien F Vincent, Miguel Romera, Akio Fukushima, Kay Yakushiji, Hitoshi Kubota, Shinji Yuasa, et al. Low-energy Truly Random Number Generation with Superparamagnetic Tunnel Junctions for Unconventional Computing. *Physical Review Applied*, 8(5):054045, 2017.

- [100] I. Reidler, Y. Aviad, M. Rosenbluh, and I. Kanter. Ultrahigh-Speed Random Number Generation Based on a Chaotic Semiconductor Laser. *Physical review letters*, 103(2):024102, 2009.
- [101] A. Uchida, K. Amano, M. Inoue, K. Hirano, S. Naito, H. Someya, I. Oowada, T. Kurashige, M. Shiki, S. Yoshimori, et al. Fast Physical Random Bit Generation with Chaotic Semiconductor Lasers. *Nature Photonics*, 2(12):728–732, 2008.
- [102] Y. Li, Y. Zhang, L. Zhang, and A. W. Poon. Silicon and Hybrid Silicon Photonic Devices for Intra-datacenter Applications: State of The Art and Perspectives. *Photonics Research*, 3(5):B10–B27, 2015.
- [103] 65nm Technology. https://www.tsmc.com/english/dedicatedFoundry/ technology/logic/1\_65nm, 2021.
- [104] C. Sun, C. H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L. S. Peh, and V. Stojanovic. DSENT-A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling. In *International Symposium* on Networks-on-Chip, pages 201–210. IEEE, 2012.
- [105] M. Ayata, Y. Fedoryshyn, W. Heni, B. Baeuerle, A. Josten, M. Zahner, U. Koch, Y. Salamin, C. Hoessbacher, C. Haffner, et al. High-Speed Plasmonic Modulator in a Single Metal Layer. *Science*, 358(6363):630–632, 2017.
- [106] K. RAMESH. Broadband Silicon Photonics Devices with Wavelength Independent Directional Couplers. PhD thesis, Indian Institute of Technology Madras, 2019.

- [107] X. Wang, L. Zhou, R. Li, J. Xie, L. Lu, K. Wu, and J. Chen. Continuously Tunable Ultra-thin Silicon Waveguide Optical Delay Line. Optica, 4(5):507–515, 2017.
- [108] F.P. Sunny, E. Taheri, M. Nikdast, and S. Pasricha. A Survey on Silicon Photonics for Deep Learning. arXiv preprint arXiv:2101.01751, 2021.
- [109] X. Wu, J. Xu, Y. Ye, Z. Wang, M. Nikdast, and X. Wang. SUOR: Sectioned Undirectional Optical Ring for Chip Multiprocessor. ACM Journal on Emerging Technologies in Computing Systems (JETC), 10(4):1–25, 2014.

# Biography

## Education

- Concordia University: Montreal, Quebec, Canada Ph.D., Electrical and Computer Engineering, (Jan. 2017 - Mar. 2021)
- New York Institute of Technology: Amman, Jordan
   M.Sc, Electrical and Computer Engineering, (Aug. 2002 Jun. 2003)
- Al Ahliyya Amman University: Amman, Jordan
   B.Sc, Computer Engineering (Oct. 1996 Jun. 2001)

## Awards

- Concordia Accelerator Award, Canada (2020).
- Concordia University Conference and Exposition Award, Canada (2019).

# Work History

- Concordia University: Montreal, Quebec, Canada Research Assistant, Electrical and Computer Engineering (2017-2021)
- Al Ahliyya Amman University: Amman, Jordan Lecturer, Computer Engineering (2006-2015)
- Al Ahliyya Amman University: Amman, Jordan Lab Engineer, Computer Engineering (2001-2006)

## Publications

- Journal Papers
  - Bio-Jr1 H. El-Derhalli, L. Constans, S. Le Beux, A. De Rossi, F. Raineri, and S. Tahar. "Towards All-optical Stochastic Computing Using Photonic Crystal Nanocavities", ACM Journal on Emerging Technologies in Computing, *Submitted*.
  - Bio-Jr2 H. El-Derhalli, S. Le Beux, and S. Tahar. "Design Space Exploration of Stochastic Computing Architectures Implemented using Integrated Optics", IEEE Transactions on Emerging Topics in Computing, DOI. 10.1109/TETC.2020.2969435, January, 2020.

#### • Refereed Conference Papers

- Bio-Cf1 H. El-Derhalli, S. Le Beux, and S. Tahar. "OSCAR: An Optical Stochastic Computing AcceleRator for Polynomial Functions", Proc. IEEE/ACM Design Automation and Test in Europe (DATE'20), March 2020, pp. 1450-1455.
- Bio-Cf2 H. El-Derhalli, S. Le Beux, and S. Tahar. "Stochastic Computing with Integrated Optics", Proc. IEEE/ACM Design Automation and Test in Europe (DATE'19), March 2019, pp. 1342-1347.

### **Technical Reports**

 Bio-Tr1 H. El-Derhalli, L. Constans, S. Le Beux, A. De Rossi, F. Raineri, and S. Tahar. "Optical Stochastic Computing Architectures Using Photonic Crystal Nanocavities", Technical Report, Department of Electrical and Computer Engineering, Concordia University, February 2021. http://arxiv.org/abs/2102.02064

## Workshops

- Bio-WS1 H. El-Derhalli, S. Le Beux, and S. Tahar. "Designing Reconfigurable Stochastic Computing Architecture Using Integrated Optics", 4<sup>th</sup> Montreal Photonics Networking Event, Montreal, Canada, October 2020.
- Bio-WS2 H. El-Derhalli, S. Le Beux, and S. Tahar. "Design of Stochastic Computing Architectures Using Integrated Optics", École d'hiver Francophone sur les Technologies de Conception des Systèmes Embarqués Hétérogènes (FETCH), Montreal, Canada, February 2020.

• **Bio-WS3** H. El-Derhalli, S. Le Beux, and S. Tahar. "Stochastic Computing with Integrated Optics", Optical/Photonics Interconnect for Computing Systems (OPTICS) Workshop in Conjunction with Design, Automation, and Test in Europe Conference, Florence, Italy, March 2019.