

Turkish Journal of Electrical Engineering & Computer Sciences

http://journals.tubitak.gov.tr/elektrik/

Research Article

Turk J Elec Eng & Comp Sci (2019) 27: 153 – 166 © TÜBİTAK doi:10.3906/elk-1711-211

# Pipelined adders for ultralow-power wearables

Mansi JHAMB<sup>®</sup>, Tejaswini DHALL<sup>\*®</sup>, Tamish VERMA<sup>®</sup>, Hinduja PUDI<sup>®</sup>

University School of Information Communication and Technology, GGSIPU, New Delhi, India

| Received: 17.11.2017 | • | Accepted/Published Online: 18.06.2018 | • | <b>Final Version:</b> 22.01.2019 |
|----------------------|---|---------------------------------------|---|----------------------------------|
|                      |   | · · · · · · · · · · · · · · · · · · · |   |                                  |

Abstract: For continuous real-time monitoring of personal health, wearable devices are indispensable. The constraints of cost, power consumption, and limited device dimensions are the critical issues which need to be handled carefully while designing these battery-powered devices. The wearables employ high-end processors dedicated for complex signal processing. The core of every digital signal processor is its data path. The arithmetic units like adders constitute the core of data path and addressing unit. This work proposes a novel low-complexity asynchronous pipelined adder. The proposed design guarantees great savings in power and latency, which makes it a suitable candidate for low-power high-speed sophisticated wearables. The proposed design consumes a minimum power of 33.46 µW and offers a minimum propagation delay of 0.04 ns in comparison to state-of-art adders such as ripple carry adder (RCA), carry look ahead adder (CLA), and carry select adder (CSA). Thus, an area-delay-power efficient adder design guarantees high-end performance for wearables.

**Key words:** C-element, asynchronous, adders, pipeline, completion detection, process voltage temperature, completion detection circuit, complementary metal oxide semiconductor, integrated circuits

## 1. Introduction

Wireless body area network (WBAN) is a wireless sensor network supporting a wide range of latest wearable devices for health care and biomedical applications. These WBANs [IEEE 802.15.6] comprise sensors, batteries, transceivers, and embedded digital signal processors (DSP). The core of every digital signal processing is its data path. The data path and addressing units are primarily arithmetic units involving adders. Hence, designing an area-delay-power efficient adder guarantees a high-end performance for wearables [1–5]. Binary addition is the most commonly used application in wearable technology [6]. A large conglomeration of algorithms has been implemented for binary addition [7–9]. Asynchronous circuit design is inherently low-power due to absence of a global synchronizing signal [10]. The design is robust across all process-voltage-temperature (PVT) corners. It offers high throughput and is highly immune to PVT variations due to the absence of clock [11]. This work proposes a pipelined implementation of an asynchronous adder. The proposed adder is implemented using the dual-rail domino logic technique [12]. Section 2 gives a brief overview of state-of-the-art adder design. The proposed design is explained in Section 3 and results and simulation are provided in Section 4.

## 2. State-of-art adders

Extensive variants of adders have been investigated by industrial and academic research communities. There exists a large variety of adders based on different algorithms. The conventional adders are ripple carry adder

\*Correspondence: tejaswini.dhall1994@gmail.com



[13], carry look ahead adder [14], and carry select adder [15].

The basic working of full adder depends on the inputs A, B, and  $C_{in}$ . The outputs are sum and carry.

$$SUM = A'B'C_{in} + A'BC'_{in} + AB'C'_{in} + ABC_{in},$$
$$CARRY = AB + AC_{in} + BC_{in}.$$

The gate level diagram of basic full adder with logic gates is shown in Figure 1. This circuit adds three inputs, i.e. A, B, and  $C_{in}$  and produces sum and carry as desired output.



Figure 1. Basic structure of full adder.

## 2.1. Ripple carry adder (RCA)

In ripple carry adder, the carry from one adder goes into another. The adders are connected in cascade. The output carry of one adder goes as input C in the second adder. The basic block diagram of ripple carry adder is shown in Figure 2. Here a 4-bit ripple carry adder is implemented using 1-bit full adders. Ripple carry adder has a disadvantage; the input of the second adder is dependent on the first adder. Thus, the adder can start its calculation only when the previous adder submits the information.

#### 2.2. Carry look ahead adder (CLA)

Carry look ahead adder solves the issue of ripple carry adder through carry look ahead logic. Here, the carry signal is calculated in advance based on input signal as follows:

(Here,  $P_i$  and  $G_i$  denote the carry propagate and carry generate respectively.  $A_i$  and  $B_i$  are the inputs.)

$$C_1 = G_0 + P_0 C_0,$$

$$C_2 = G_1 + P_1 C_1 = G_1 + P_1 (G_0 + P_0 C_0),$$

$$C_3 = G_2 + P_2 C_2 = G_2 + P_2 (G_1 + P_1 C_1),$$

$$C_4 = G_3 + P_3 C_3 = G_3 + P_3 (G_2 + P_2 G_1 + P_2 P_1 C_1),$$

154



Figure 2. Ripple carry adder.

 $P_i = A_i XORB_i,$  $G_i = A_i ANDB_i.$ 

From the above equations, it is observed that the carry is not dependent on previously generated carry. Thus, it alleviates the problem of ripple carry adder and reduces the system delay.

## 2.3. Carry select adder (CSA)

In ripple carry adder, each block has to wait for the previous block to complete its processing. The block diagram of carry select adder is shown in Figure 3. Here, two 4-bit RCAs are multiplexed together and the resulting sum and carry are selected by the input carry. The carry select adder uses two ripple carry adder blocks and each block processes a single bit, one with 0 and the other with 1. Both blocks operate in parallel with each other. When the actual carry arrives, multiplexers are used to select either of the preevaluated values and pass it on to the next block. This greatly reduces carry propagation time or delay.

#### 3. The proposed adder

This work proposes an asynchronous pipelined adder targeting high-speed integrated circuit design applications. The asynchronous designs inherently possess an advantage over their synchronous counterparts. These devices consume less power and are quite faster. An asynchronous system operates according to actual delays of the system elements. In the asynchronous implementation of the system, we have  $T_{plh}$  and  $T_{phl}$  which indicate the time to process the input when the output goes from high to low and low to high, respectively.



Figure 3. Carry select adder.

The complete processing time for one cycle is given by (here,  $T_{pa}$  represents processing time of asynchronous device):

$$T_{pa} = T_{plh} + T_{phl}$$

For synchronous devices, processing time is represented by  $T_{ps}\colon$ 

$$T_{ps} = T_{plh} + T_{plh} = 2T_{plh}.$$

The cycle time for asynchronous is less as compared to synchronous.

$$T_{pa} < T_{ps}.$$

The block diagram of the proposed adder is shown in Figure 4. This structure comprises a 1-bit full adder and a double-edge-triggered D flip-flop. Here, the full adder is implemented using the dual-rail domino logic. The dual-rail domino logic implementation offers the advantage of low transistor count. The clock skew problem is also removed in the dual-rail implementation. The dual-rail protocol belongs to the asynchronous paradigm, so there exists no clock which needs to be distributed with the minimal skew across the circuit. Hence, no clockskew issues exist in dual-rail designs. The proposed adder constitutes functional block. The characteristics of the adder are shown in Table 1.



Figure 4. Block diagram of the proposed adder.

| Table 1. | Characteristics | of the proposed | adder (Vdd $=$ | 1.2  V, Temp = | 298 K). |
|----------|-----------------|-----------------|----------------|----------------|---------|
|----------|-----------------|-----------------|----------------|----------------|---------|

| $T_{phl}$     | $0.049 \mathrm{~ns}$ |
|---------------|----------------------|
| $T_{plh}$     | $0.045~\mathrm{ns}$  |
| $T_{pd}$      | $0.046 \mathrm{~ns}$ |
| Average power | 27.63 uW             |

The schematic diagram of the adder is shown in Figure 5. Here, a dynamic complementary metal-oxidesemiconductor (CMOS) implementation is used for the generation of carryouttrue, carryoutfalse, sumtrue, and sumfalse. The output carry of this adder goes as the input of the double-edge-triggered D flip-flop. The output of the D flip-flop, Carryout true goes inside the input of the adder. The schematic diagram of D flip-flop is shown separately in Figure 6. The flip-flop has two inputs, i.e. data input "D" and clock, along with two outputs, Q and Q'. This element responds to both leading and lagging edges of the clock and thus scores in terms of energy and speed. Digital electronic systems implement pipelines to deliver high speed and to increase the throughput of a system. This pipelined design involves register between each stage. Asynchronous approach offers great modularity and flexibility to the design. Here, the stages may have unequal delays, whereas in synchronous systems, the clock period must be longer than the worst-case stage delay and all the stages work simultaneously. Contrary to this, stages with different static delays can be connected to form a functional asynchronous design. This per-stage variability is exploited for great improvements in throughput and system latency. Moreover, asynchronous designs consume power only on demand. Handshaking signals ackpre and acknxt are used for the

### JHAMB et al./Turk J Elec Eng & Comp Sci

control circuitry of the pipeline. The asynchronous pipelines have advantages such as processing multiple data at the same time and the power consumption is quite less. The domino logic adds a boost to the throughput and latency of the pipeline. The block diagram given in Figure 7 shows a basic three-stage asynchronous pipeline. The pipeline works in the following way:



Figure 5. Schematic diagram of full adder.

- 1. Evaluation of stage 1,
- 2. Evaluation of stage 2,
- 3. Evaluation of stage 3.



Figure 6. Block diagram of D flip-flop.



Figure 7. Block diagram of PS0 pipeline.

Completion detector circuit (CDC) of stage 3 forwards the acknowledgment signal indicating the evaluation and completion. It commences the precharging procedure for the second stage. The pipeline that we have proposed is a modified form of the PS0 pipeline. The single stage of pipeline is shown in Figure 8. Here,

a completion detector circuit is also used. Two-input NOR gate is used as a one-bit completion detector. Van Berkel C-Muller element is used in this completion detection circuit. Completion detector is formed by combining all the bit signals from the whole data path with the C-Muller element. In synchronous design, the entire activity occurs at a specific frequency resulting in concentration of the entire energy in a narrow spectral band centered around clock frequency causing a significant amount of electrical noise. In the proposed asynchronous design, there is no correlation in entire activity. This contributes to a relatively small peak value of noise and comparatively a distributed noise spectrum. Asynchronous approach minimizes high current peaks caused by simultaneous clock-induced circuit switching thereby reducing on-chip power supply noise. In terms of switching noise, the proposed asynchronous implementation has a better behavior than the synchronous equivalents because the operation is less centralized and different blocks do not need supply current at the same time. Therefore, current peaks are wider and the maximum value of these peaks is also less than those of the synchronous case. The frequency spectrum of the supply current of the proposed asynchronous circuit does not exhibit peaks at clock frequency and multiples thereof. There may be spikes but they tend to fade away when a longer integration interval is considered. Data-dependent delays in the design may invariably cause jitter. This timing uncertainty is contributed by the sources such as environmental or power supply noise. In clocked design, jitter is contributed by the synchronous clock source, e.g., crystal oscillator where it prevails as an obvious undesired phenomenon. However, being asynchronous, our designs are different as there is no specified reference value for timing in general. Besides all this, it is a highly desirable attribute of asynchrony to adapt the speed of operation according to the prevailing conditions.



Figure 8. Single-stage pipeline.

Operation frequency ascendancy and shrinking technology in high-density designs make design of interconnects a complicated issue. Shape variations and signal integrity owing to the nonideal interconnect effects can spoil circuit behavior. Several methods exist for determining wire constraints in design performance. Eye diagram happens to be the most widely accepted approach for quality estimation of the transmitted signal through the communication links. In IC design, wires are visualized as a high-speed link amongst various nodes and thus an eye diagram may be obtained. Eye diagram, being a tool for behavioral analysis, includes the overall specifications of the system. There exists no dedicated clock in the proposed asynchronous circuit, only the random jitter may exist due to aging, temperature, and local voltage fluctuations, etc. This random jitter cannot be controlled by the system setup and is highly unpredictable. Therefore, the proposed model does not accommodate the inevitable random jitter.

Hence, an eye diagram may be investigated in future works. Additionally, the random jitter may also be predicted by considering the stochastic nature of input and cumulative wire responses.

#### 4. Simulation and results

The spice level simulations were carried out on HSPICE using 65 nm TSMC CMOS. The power supply voltage varied from 0.8 V to 1.2 V. All the designs were simulated with extracted wire and layout parasitic. In order to satisfy the delay constraints, the MOSFETs with minimum size are employed in the design. Though increasing the transistor size improves our speed, it also contributes to the increased power dissipation as the load capacitances increase. Thus, we used the minimum size of TSMC 65 nm CMOS process (W/L=120 nm/60 nm). The proposed pipeline is a modified version of the PS0 pipeline. The state-of-art pipeline designs employ dynamic C-element but van Berkel C-element was used in our three-stage pipelined circuit. C-element is the indispensable element. Its importance lies in the fact that C-element acts as an event synchronizer. It is a quite complicated task to design circuits using dynamic C-element. The behavior of such circuits is constrained as well as convoluted and may result in unspecified outputs [16]. Such a condition becomes difficult when left untreated. On the other hand, van Berkel's C-element is superior in terms of delay, energy, and area [17]. Therefore, van Berkel C- element design offers the least overhead for latching with no resistances against output switching along with a symmetric topology in terms of input. Therefore, we used van Berkel C-element instead of the dynamic C-element.

The characteristics of the pipelined adder are shown in Table 1. The output waveform is shown in Figure 9. Here, waveform depicts the behavior of the proposed adder. The ripples exist in the carryouttrue and sumtrue waveforms due to the incurring data-dependent delays. The data can exhibit ripples arbitrarily between the transactions as long as it is valid and stable at setup time before the arrival of the next request. The data should remain stable from this time until ack is deasserted. In clocked systems, the data obey a fixed relationship with the synchronizing signal. When the setup and hold constraints are satisfied, the output becomes valid in a specific propagation delay time. Thus, for synchronous systems, the input signal always satisfies flip-flop's timing constraints; therefore, metastability has the least scope of occurrence. On the contrary, in an asynchronous system, the relationship between clock and data is not fixed; thus, frequent violations of hold and setup times may occur. When this occurs, the output (carryouttrue) goes to an intermediate level between the two valid states and dwell there for indefinite time before resolving itself. Hence, Figure 9 shows the three voltage levels in the carryouttrue waveform. Figure 10 shows the pipeline's worst case and best case latency achieved through timing analysis at various process corners such as fast, slow, and typical giving the worst and the best circuit

latency. Besides, physical scaling application dependent power dissipation is a big factor in hampering the integration of scaled devices, thus limiting the performance. Figure 11 shows the delay and average power at different voltage values. In our circuits, there are primarily three power dissipation phenomena: switching or dynamic power due to charging and discharging of capacitive load on each device of the circuit, short circuit power dissipation due to circuit architecture (this component is greatly reduced in our design), and the third one is the static power dissipation due to cumulative leakage currents. In the proposed design, the highest power consumption is due to the dynamic activity of the devices. As shown in Figure 11, there exists a quadratic relationship between dynamic power dissipation and power supply voltage. Power dissipation is noticed to be highest at 1.2 V. As the voltage is decreased from 1.2 to 1 V, the percentage power dissipation reduces by 21.44%. When we further reduce the voltage to 0.9 V, the power reduces by 47.56% and at 0.8 V the power reduces by 73.61%. Critical data-path delay is the average propagation delay at each voltage and is carried out by calculating  $T_{phl}$  and  $T_{plh}$ . Figure 11 also reveals that supply voltage reduction ensures lower power at the cost of higher latency. As the voltage reduces from 1.2 V, the delays of 0.079 ns, 0.075 ns, and 0.12 ns are observed at 1.0 V, 0.9 V, and 0.8 V, respectively.



Figure 9. Waveform of the proposed adder.



The number of transistors contributes to chip area and load capacitance of the design. Thus, the transistor count comparison for the proposed adder with the state-of-art adders is shown in Figure 12. Here, the proposed adder offers lesser gate count compared to RCA [13] and CLA [14] but CSA [15] claims the least gate count. To establish an impartial testing environment, the circuits were tested on the input patterns covering all possible combinations of the input streams. After the physical layout designing, post layout simulations were performed with the extractions of parasites. Our implementation of adder is done with minimum amount of transistors because it will then reduce the chip area and also the chip's throughput can be increased. The layout of the proposed adder is shown in Figure 13. The complete comparison of the proposed design with state-of-art designs is provided in Table 2. The transistor count of the proposed adder is reduced by 17.65% with respect to CLA and by 22.3% with respect to RCA.



Figure 12. Transistor count.

Table 2. Performance analyses of different adders (Vdd = 1.2 V, Temp = 298 K).

| Types of adder     | Gate count | Power (mW) | Delay (ns) |
|--------------------|------------|------------|------------|
| RCA [13]           | 288        | 0.206      | 4.20       |
| CLA [14]           | 272        | 0.302      | 3.14       |
| CSA [15]           | 152        | 0.194      | 1.37       |
| The proposed adder | 224        | 0.293      | 1.97       |

Similar types of parallel long wires can have negative effect on the noise injection and may lead to crosstalk. The design rule constraints are satisfied by appropriate insertion of buffers at different stages of



Figure 13. Layout of adder.

physical design. For functional operation of a design, it is critical to ensure that the input-output pads have sufficient power and ground connections are properly located for elimination of current switching noise-related issues. For reducing or eliminating the noise coupling effects, the following points are taken into consideration:

- Sensitive asynchronous inputs are isolated from other switching signal pads.
- All bidirectional pads are grouped together so that all are either in output or input mode.
- Slow input pads are grouped together.

In order to control the inductive switching noise, ground and power pads are appropriately placed. This limits the magnitude of noise. The proposed adder has smaller area compared to the state-of-art adders. The layout area for different adders is also compared in Table 3.

| RCA [13]       | 2214    |
|----------------|---------|
| CLA [14]       | 2160    |
| CSA [15]       | 6201    |
| Proposed Adder | 298.172 |

**Table 3**. Layout area comparison of different adders  $(\mu m^2)$ .

The latency and throughput of the pipeline, as well as power delay product (PDP) of the pipeline, are calculated. Throughput and latency are essential design parameters of the pipeline. Calculation of latency is done at various different voltage levels from 0.8 V to 1.2 V by keeping the temperature fixed at 298 K. The throughput obtained is 1.9 Gsps.

Cycle time tells us about a complete cycle of computation time. It is required to determine the throughput of the pipeline and is given by the following formulae (where  $T_{CB}$  represents Forward latency per stage,  $T_{pre}$  represents precharge time, and  $T_{CSC}$  represents time required by csc):

$$Throughput = 3T_{CB} + T_{pre} + 2T_{csc}.$$

Figure 14 illustrates the PDP variations with the supply voltage. The formula for calculation of PDP is (here  $T_{pd}$  represents propagation delay):

$$PDP = 2T_{pd}.Power.$$

#### JHAMB et al./Turk J Elec Eng & Comp Sci



#### 5. Conclusion

This work proposes a pipelined adder design targeting low power applications. The circuit exhibits highest speed at 1.2 V with the latency as low as 0.04 ns. The worst-case latency achieved by the pipeline is at 0.8 V. The proposed pipeline achieves low power attribute compared to the state-of-art designs. The throughput achieved is 1.9 Gsps. The lowest power dissipation is achieved at 0.8 V. Thus, an area-delay-power efficient pipelined adder design guarantees high-end performance for wearable devices.

#### References

- Abbasi QH, UrRehman M, Qaraqe K, Alomainy A. Advances in Body-Centric Wireless Communication: Applications and State-of-the-Art. Stevenage, UK: IET, 2016.
- [2] Daniel C, Navid N. Wireless Public Safety Networks 2: A Systematic Approach. Oxford, UK: Elsevier, 2016.
- [3] Jens M, Restituto MD. Ultra Low Power Transceiver for Wireless Body Area Networks. New York, NY, USA: Springer, 2013.
- [4] Yang, Zhong G. Body Sensor Networks. London, UK: Springer, 2014.
- [5] Alomainy A, Bari RD, Abbasi QH, Chen Y, Mucchi L.Co-operative and Energy Efficient Body Area and Wireless Sensor Networks for Healthcare Applications. Amsterdam, the Netherlands: Academic Press Library, 2014.
- [6] Chew ES, Phyu MW, Goh WL. Ultra low-power full-adder for biomedical applications. IEEE C Elec Devices; 5-27 December 2009; Xi'an, China: IEEE. pp. 115-182.
- [7] Koren I. Computer Arithmetic Algorithms. 2nd ed. Natick, MA, USA: A K Peters, 2002.
- [8] Parhami B. Computer arithematic Alogorithms and Hardware Design. 2nd ed. New York, NY, USA: Oxford University Press, 2000.
- [9] Ergecovac M, Lang T. Digital Arithematic. Burlington, MA, USA: Morgan Kauffman, Elsevier, 2003.
- [10] Jhamb M, Gitanjali. Efficient adders for assistive devices. Engineering Science and Technology 2016; 20: 95-104.
- [11] Sparso J. Asynchronous Circuit Design: A tutorial. Technical University of Denmark, 2006.
- [12] Xia Z, Hariyama M, Kameyama M. Asynchronous domino logic pipeline design based on constructed critical data path. IEEE T Vlsi Syst 2015; 23: 619-630.
- [13] Ahmad MM, Manjunathanachari K, Lalkishore K. Static power reduction in 32-bit ripple carry adder using dual threshold voltage assignment. International Journal of Computer Applications 2015; 4: 2278-1021.

- [14] Aradhya HVR, Lakshmesha J, Muralidhara KN. Reduced Complexity Hybrid Ripple Carry Look ahead adder. International Journal of Computer Applications 2013; 70: 0975-8887.
- [15] Kaur R, Grover A. Different Techniques used for Carry Select Adder-A Review. International Journal of Computer Applications 2015; 116: 0975-8887.
- [16] Moreira MT, Moraes FG, Calazans NLV. Beware the Dynamic C-Element. IEEE T Vlsi Syst 2014; 22: 1644-1647.
- [17] Shams M, Ebergen JC, Elmasry MI. A comparison of CMOS implementations of an asynchronous circuits primitive: the C-element. ISLPED '1996; Monterey, CA, USA: IEEE. pp. 93-96.