Turk J Elec Eng \& Comp Sci
(2016) 24: $5224-5237$
(C) TÜBITAAK
doi:10.3906/elk-1412-140

# An optimized embedded adder for digital signal processing applications 

Kala BHARATHAN*, Seshasayanan RAMACHANDRAN<br>ECE Department, Anna University, Chennai, Tamil Nadu, India

| Received: 22.12 .2014 | Accepted/Published Online: 07.11 .2015 | Final Version: 06.12 .2016 |
| :--- | :--- | :--- | :--- |


#### Abstract

In this paper, an embedded logic full adder (PRO-FA) circuit in transistor level is proposed that reduces logic complexity, consumes low power, and is low area. The design is implemented for 1 bit and then is further extended to 64 bits. The area obtained for 1-bit PRO-FA is $2.85 \mu m^{2}$ and is built using only 13 transistors. The PDP of the proposed adder is $459.4 \times 10^{-18} W s$ and ADP is $128.25 \mu m^{2} p s$ and is compared with the earlier reported designs. Furthermore, a $16-, 32-, 64$-bit both linear and square-root carry select adder/subtractor (CSLAS) structure is proposed. Realistic testing in terms of power and delay is performed for the proposed logic by implementing it on $8 \times 8$ modified Booth, array, and Wallace tree multiplier architectures. The efficiency of the proposed circuits in DSP architecture like 4-tap FIR filter is demonstrated. Overall delay for CSLAS is reduced to $70 \%$ when compared to the conventional one. The implementations are done using the Cadence Virtuoso tool with TSMC 28 nm LP CMOS technology and are found to have power savings of up to $76 \%$. The present proposed architectures offer significant improvement in terms of power and speed in comparison to other reported architectures.


Key words: Embedded logic, adder, low power, Booth multiplier, XOR gate, delay

## 1. Introduction

Adders are the main element for a wide range of arithmetic units like ALUs and multipliers. Adder architectures have been under investigation for a very long time, as the critical path determines the overall synchronous system performance used for digital signal processors in embedded applications. There are various factors that play a vital role in the optimization of circuit performances like power consumption and noise margins. The propagation delay and power are reduced by reducing $V_{D D}$ or by changing the sizes of the transistors along with the choice of logic function and implementation styles. When devices are cascaded, overconsumption of resources occurs. Hence they must be tackled at the fundamental design level. A great improvement to an adder (or any other scalable device) is the reduction of power consumption. One way to reduce the dynamic power consumption of a design is by decreasing the supply voltage.

$$
\begin{equation*}
P O W E R=C_{L} \times I \times V_{D D}^{2} \times f \tag{1}
\end{equation*}
$$

From Eq. (1), the $V_{D D}$ component is squared, indicating a nonlinear reduction in power per voltage reduction in $V_{D D}$, which comes at the expense of overall speed and increased delay. In certain cases when $V_{D D}$ is greatly reduced, interference may occur with the logical operations of the circuit. The new direction in the technology is towards programmable $V_{D D}$ states. The three states of programmable $V_{D D}$ operation are high $V_{D D}$, low $V_{D D}$, and power gating. This technology helps in increasing the lifetime of any confined power source; unwanted

[^0]draws and excess dissipation would be largely eliminated. An additional benefit to $\mathrm{V}_{D D}$ reduction also comes as improved power delay product (PDP). Many possible techniques are used for transistor optimization such as buffering, size progression, and ordering. The most effective one is the propagation delay and power consumption minimization that comes with the sizing of the transistors. The resistive characteristics of the transistor could be altered by sizing modifications. With sizing modifications, parasitic capacitances are introduced to act against the optimization gains. The main objective is to build an optimized embedded adder design by reducing the number of transistors. The contributions in this paper are summarized as follows:

1. A 1-bit full adder structure (PRO-FA) with an alternative internal logic consisting of transmission gates (TGs) and pass transistors (PTs) for all arithmetic applications is proposed. Minimum sizes on all the transistors are maintained initially and all basic gate structures are modified to reduce the number of transistors when compared to standard ones. The length and width of the transistors K 1 to K 13 are $\mathrm{L}=$ 30 nm and $W_{p}=800 \mathrm{~nm}$ and $W_{n}=600 \mathrm{~nm}$. The circuits are implemented using the Cadence Virtuoso tool with 28 nm TSMC CMOS LP technology.
2. A 16-, 32-, 64-bit linear and square-root carry select adder/subtractor (CSLAS) architecture that uses the PRO-FA as the basic block is proposed.
3. A radix- $28 \times 8$ modified booth multiplier that consists of both Booth encoder and decoder circuits along with a Wallace tree and CSLA designed using proposed circuits is evaluated in terms of power and delay. Finally, the optimized proposed circuits are tested in a DSP application like finite impulse response (FIR) filter and a comparison is done.

This paper is organized as follows.
Section 2 presents an overall literature survey. Section 3 introduces the alternative proposed embedded logic structure used to build the 1-bit full adder (PRO-FA) and also explains the proposed CSLAS architecture in depth. Section 4 reviews the results obtained from the simulation and the performance comparison done in terms of PDP. Finally, Section 5 concludes this work.

## 2. Previous works

The static CMOS full adder is based on the regular CMOS structure with pmos pull-up and nmos pull-down transistors. This circuit provides full swing voltage while the layout is symmetric and efficient due to the complementary transistor pairs. The disadvantage is the large input capacitance due to the large number of transistors and also because of the existence of sized up PMOS transistors its area is affected.

The SR-CPL full adder with swing restoration and transmission gate (TG) full adder are other classic circuits [1]. The SR-CPL has both true and complement gate structures with 26 transistors. It provides highspeed and good driving capability due to the level restorers used in the circuit. This consumes high power due to the existence of a number of internal nodes and static inverters. These are the main sources for leakage and static power dissipation. TG full adders have no voltage drop problem, are low power consuming, and are mainly used for designing xor or xnor gates. The main disadvantage is that they lack driving capability. Hence, in order to improve their weak driving capability, additional buffers are required.

The multiplexer based adder (MBA) [2] consists of fewer transistors compared to the static adder but the main disadvantage is it does not operate for low voltage devices. Dynamic implementations (D3L) [3] on the other hand yield an extremely fast design but end up paying higher costs in overall power consumption.

The main drawback of these adders is the high power consumption, due to the large number of transistors as well as the multiple paths to ground present in the sp-D3L implementations.

The static energy recovery full (SERF) adder [4] uses an energy recovery technique to reduce power consumption. It is constructed using 10 transistors. Energy recovery logics reuse charge. Therefore, it consumes less energy than the other full adders. Some disadvantages of this circuit are that the sum is generated from two cascaded xnor gates, which leads to long delays. Second, it cannot work correctly in low supply voltage. In the worst case, when $A=B=^{\prime} 1^{\prime}$ there is $2 V_{t n}$ threshold loss in output voltage. The suitable operating supply voltage is limited to $V_{D D}>2 V_{t n}+\left|V_{t p}\right|$. Second, there are five gate capacitances on a particular node, which causes a longer delay to generate an intermediate $A \oplus B$ signal.

ULPFA [5] is used to design low power xor-xnor gates that are implemented using pass transistors to produce the sum. The previous reported level restorer's drawbacks include delay, noise, and power consumption, which are eliminated when the ULPD voltage level restorer is used. This circuit is robust against voltage scaling and transistor sizing. The disadvantages include high input capacitance and high area due to the use of lowmobility large pmos in its structure and also the presence of series transistors in the output port creates a weak driver. A 14-transistor full adder that employs more than one logic style for implementation is reported in Vesterbacka [6].

In [7], Zhang has proposed a hybrid pass-logic with static CMOS output drive full adder (HPSC). Here xor and xnor functions are simultaneously generated by pass transistor logic using only 6 transistors. This is employed in the CMOS module to produce full-swing outputs of the full adder at the cost of increased transistor count and decreased speed. Hybrid logic styles offer promising performance but these adders suffer from poor driving capability and their performance degrades drastically while cascading, unless suitably designed buffers are included.

In the proposed full adder circuit PRO-FA, the xor gate is the main element responsible for most of the power consumption of the entire adder circuit. Therefore, this module is redesigned using pass transistors and TGs to minimize the power to the best possible extent avoiding the voltage degradation possibility.

## 3. Proposed architectures

### 3.1. Proposed full adder (PRO-FA)

In this paper, a transmission gate along with the pass transistor concept is used to build a xor gate [8] that is further used to propose a 1-bit full adder. The main objective is to build an optimized embedded adder design with low power dissipation and low area by reducing the number of transistors. In the proposed full adder circuit PRO-FA, the xor gate is the main element that controls both the sum and carry outputs. An alternative method proposed in terms of rewriting the expressions of the sum and carry of full adder using xor gates only is given as follows:

$$
\begin{equation*}
S U M=A \text { Operator }[U+2 A 01] B \text { Operator }[U+2 A 01] C I N, \quad C A R R Y=[(A \times \bar{F})+(C I N \times F)] \tag{2}
\end{equation*}
$$

The 1-bit proposed full adder (PRO-FA) implemented using Eq. (2) as shown in Figure 1 has 13 transistors (K1 to K13) composed of TGs and pass transistors. This circuit gives full swing voltage due to the use of a TG for xor gate 1 ( K 1 to K 4 ) and the parasitic capacitances are reduced due to the minimum sizing of the transistors. This structure is used to build fast adder circuits and registers. This implementation of the xor gate requires only 6 transistors, which is shown in Figure 1, when compared to the standard CMOS structure. The operation of the xor gate 1 is given below:

## BHARATHAN and RAMACHANDRAN/Turk J Elec Eng \& Comp Sci



Figure 1. Circuit diagram of PRO-FA.
i) When $\mathrm{B}=1$, both the K 1 and K 2 act as inverter while the transmission gates $\mathrm{K} 3 / \mathrm{K} 4$ are off; hence the equation is

$$
\begin{equation*}
F=\overline{\mathrm{A}} \times B \tag{3}
\end{equation*}
$$

ii) When $\mathrm{B}=0$, both the K 1 and K 2 are disabled and transmission gates $\mathrm{K} 3 / \mathrm{K} 4$ are operational; hence the equation is

$$
\begin{equation*}
F=A \times \bar{B} \tag{4}
\end{equation*}
$$

iii) When both the equations are combined at the F node, the respective xor gate 1 output obtained is

$$
\begin{equation*}
F=[(\bar{A} \times B)+(A \times \bar{B})] \tag{5}
\end{equation*}
$$

To build the complete 1-bit full adder structure we use this xor gate 1 output node F as in-out port for the next modules. This F in-out port is further given as one input to the next stage xor gate 2 (K5 to K7), which is implemented using 3 pass transistors only to produce the complete sum of the full adder. This in-out port F of PRO-FA shown in Figure 1 produces a strong output voltage (partial sum) that is used to implement the second stage xor gate 2 . Increasing the $W / L$ ratio of transistor K7 minimizes the voltage degradation due to threshold drop. Some of the earlier reported adders are implemented using only pass transistors. This reduces the area and power due to the minimum number of transistors but does not produce a strong 1 or 0 at the output node. This becomes a problem when cascading is done for higher bits. Hence the use of a transmission gate to produce the partial sum (at F node) in PRO-FA provides a strong input for the second stage operation and for the generation of carry. The XOR gate 2 operates as follows:
iv) When CIN $=1$, both the K 5 and K 6 act as an inverter while the K 7 transistor is off; hence the equation is

$$
\begin{equation*}
S U M=\bar{F} \times C I N \tag{6}
\end{equation*}
$$

v) When $\mathrm{CIN}=0$, both the K 5 and K 6 are disabled and only transistor K 7 is operational; hence the equation is

$$
\begin{equation*}
S U M=F \times C \bar{I} N \tag{7}
\end{equation*}
$$

## BHARATHAN and RAMACHANDRAN/Turk J Elec Eng \& Comp Sci

vi) The SUM of the full adder is obtained by combining the above equations

$$
\begin{equation*}
S U M=[(\bar{F} \times C I N)+(F \times C \bar{I} N)] \tag{8}
\end{equation*}
$$

The carry of the full adder is produced using only 2 pass transistors K12 and K13 where the F acts as the control signal. Hence the CARRY expression is

$$
\begin{equation*}
C A R R Y=[(A \times \bar{F})+(C I N \times F)] \tag{9}
\end{equation*}
$$

As soon as the partial sum F of PRO-FA is obtained, the final sum and carry is obtained. The propagation delay of the carry is more compared to the sum but the advantage is low static power. When transmission gates are used, the delay in carry chain increases quadratically. The delay of $n$ transmission gates is calculated using the Elmore approximation given as

$$
\begin{equation*}
t_{p}=0.69 \times C \times R_{e q} \times \frac{n(n+1)}{2} \tag{10}
\end{equation*}
$$

The $C_{\text {load }}$ incurs a voltage change $V$, drawing energy $\left(C_{l o a d} \times V \times V_{D D}\right)$ from the supply voltage $V_{D D}$ for every low-to-high logic transition in an adder. These transitions occur at a fraction $\alpha_{j}$ with the clock frequency $f_{c}$ for each node $j$ belonging to $N$. The summation of all $N$ nodes in the circuit gives the total dynamic switching power. Hence, transistor size is an effective parameter for reducing dynamic power consumption as shown in the equation

$$
\begin{equation*}
P O W E R=V_{D D} \times f_{c} \times \sum_{j=1}^{N} \propto_{j} \times C_{l o a d j} \times V_{j} \tag{11}
\end{equation*}
$$

The inverters are weak and the transmission gates are strong. The total propagation delay of the PRO-FA is calculated as

$$
\begin{equation*}
T=(N-1) \times\left(t_{\text {sum }}+t_{\text {carry }}\right) \tag{12}
\end{equation*}
$$

The transistor sizes are chosen on the theoretical background of the design initially. Later, the sizes are varied (through simulations) in the vicinity of the previously set values to obtain the best performance in terms of power and delay.

### 3.2. Proposed carry select adder/subtractor (CSLAS)

A carry select adder (CSLA) is the most challenging one due to its complexity for implementation. The CSLA structure is used in many computational systems to alleviate the problem of carry propagation delay by independently generating multiple carries and then selecting a carry to generate the sum. However, the disadvantage of CSLA is that it is not area efficient as it uses multiple pairs of ripple carry adders (RCAs). This is used to generate partial sum and carry by considering carry input Cin $=0$ and $\mathrm{Cin}=1$. In this paper an approach to optimize the PDP that is energy consumption throughout the entire system design is proposed. With the implementation of PRO-FA in the CSLAS, the area and power are greatly reduced. The existing standard adder structures are implemented without a subtractor depending on the application.

Here, a linear and square-root 16 -bit carry select adder/subtractor (CSLAS) is proposed as shown in Figures 2 and 3. The subtractor module takes care of signed bits, which is further useful for computing signed numbers in the multiplication process. This is extended to 64 bits to test the performance. The internal modules
of CSLAS are RCA, ripple carry adder with half adder (RCAHA), binary to excess- 1 converter (BEC-1), and XOR, AND, and OR gates. The working principle of CSLAS is split into 4 phases of computation. The 1st phase consists of ADDSUB blocks. The ADDSUB shown in Figure 4 is implemented using XOR gates only and this block indicates a subtraction process only when $\mathrm{Cin}=1$. The XOR gate consists 6 transistors (K1-K4, K8, K9) of the PRO-FA. The 2nd phase consists of RCA and RCAHA blocks. Here, a 4 -bit RCAHA is designed using 3-bit PRO-FA and 1-bit proposed half adder. The 1-bit half adder is proposed using 8 transistors. The half adder carry is obtained using TG only and the xor gate 1 is used to obtain the half adder sum. The RCAHA is used wherever Cin $=0$. The 4 -bit RCA block consists of 4 -bit PRO-FA circuit. The 3rd phase consists of BEC-1. For the linear 16-bit CSLAS, 5-bit BEC is implemented using the proposed XOR and AND gates. For the 16 -bit SQRT CSLAS, different bits of BEC are implemented. Using BEC-1 instead of ripple carry adder (RCA) reduces the power and area of the carry select adder for the second stage of computation [9]. The main advantage of this transistor level modified BEC-1 is the lower number of transistors than the ordinary BEC. The last (4th) phase consists of multiplexers (MUX). The MUX build using TG is used as the final block to select the final sum and carry. The AND gate is implemented using TG's concept with 4 transistors while the OR gate is implemented using 3 pass transistors with a restorer.


Figure 2. Proposed linear CSLAS.

## BHARATHAN and RAMACHANDRAN/Turk J Elec Eng \& Comp Sci



Figure 3. Proposed square-root CSLAS.


Figure 4. ADDSUB of CSLAS.
The operation of CSLAS is as follows:

The input $\mathrm{B}[15: 0]$ augend along with Cin is given to all the 4 ADDSUB blocks. Only when Cin $=1$, subtraction is performed. The output Y [15:0] obtained from ADDSUB along with input A [15:0] addend is added using the RCAHA blocks when Cin $=0$, else it uses the RCA block to perform addition. In the third stage, both the sum and carry are given to the BEC-1 and MUX blocks. The select line of MUX is the carry. Depending on the select line, the output sum is produced and the final carry. Compared to the existing CSLA structure, an ADDSUB block is introduced, which reduces the delay in computation and also performs a subtraction operation. Hence, this design is an efficient one in terms of power and area as the number of transistors is reduced greatly. Further to understand the PDP performance the 32-bit, 64-bit linear, and square-root CSLAS are also designed and compared with the conventional ones.

### 3.3. Modified radix-2 Booth multiplier

Another objective of this work is to provide better performance signed and unsigned multipliers that can be used to design high-end processors. This multiplier is chosen due to its complexity and as it is used to multiply both signed and unsigned numbers. In this multiplier, the XOR gate structure is used for both encoder and decoder computation of partial products as shown in Figure 5. After obtaining the partial products, a Wallace tree adder circuit designed using PRO-FA is used to obtain the final product.


Figure 5. Partial product generator.

The MB algorithm reduces the number of partial products by half in the first step. To multiply X (multiplicand) by Y (multiplier) using the MB algorithm starts from grouping Y by 3-bits and encoding it into one of $\{-2,-1,0,1,2\}$. This algorithm is based on the detail that fewer partial products need to be generated for the set of consecutive 0 's and 1's. For a group of consecutive 0 's, there is no need to generate any new partial products. The partial products generated by the MB decoder are added in parallel using the Wallace tree adder structure. The final multiplication results are obtained using CSLA by adding the last two rows. The results of this MB multiplier are again compared with other multiplier architectures.

### 3.4. Fir filter

A digital filter is discrete-time, discrete-amplitude convolved. Basic Fourier transform theory states that the linear convolution of 2 sequences in the time domain is the same as multiplication of 2 corresponding spectral sequences in the frequency domain. A filter mainly consists of registers (flip-flops), multipliers, and adder circuits. Filtering is in essence the multiplication of the signal spectrum by the frequency domain impulse response of the filter. A finite impulse response (FIR) filter performs a weighted average of a finite number of samples of the input sequence. The basic input-output structure of the FIR filter is a time-domain computation based on a feed-forward difference equation. Pipelining reduces the effective critical path by introducing latches along the data path. The number of coefficients generated for designing the FIR filter determines the order of the filter. If the order of the filter is $N$, then $(N+1)$ coefficient terms are required. The speed of the FIR filter depends on rate at which the number of input samples is processed. To increase the speed of the FIR filter the critical path is reduced from input to output. The sampling period $T_{\text {sample }}$ of the FIR filter [10] is stated as

$$
\begin{equation*}
T_{\text {sample }}\left[T_{m}+(N-1) T_{a}\right] \tag{13}
\end{equation*}
$$

Hence the sampling frequency $f_{\text {sample }}$ is stated as

$$
\begin{equation*}
f_{\text {sample }} \frac{1}{\left[T_{m}+(N-1) T_{a}\right]} \tag{14}
\end{equation*}
$$

For the low pass 4-tap FIR filter using data broadcast form the sampling period is $\left[T_{m}+3 T_{a}\right]$ and for the transposed structure $\left[T_{m}+T_{a}\right.$ ]. Finally a 4-tap transposed FIR filter is realized with 3 D-FFs, 4 radix-2 MB multipliers, and PRO-FA circuits. The D-FF used in the design is implemented using transmission gates. This is simulated and verified using the Frequency Analysis and Design (FDA) tool of MATLAB.

## 4. Simulation results and analysis

A single bit adder cell designed for optimum performance cannot perform well under deployment to real-time conditions. This is because in cascaded form the driver adder cells do not provide a proper input signal level to the driven cells. The circuit malfunctions under low supply voltages due to the cumulative degradation in signal level, which leads to faulty output. A practical simulation environment is set up to analyze the performance of the PRO-FA when it is actually used in VLSI applications. The low power (LP) process is ideal for low standby power applications such as cellular baseband. The 28 LP process boasts a $20 \%$ speed improvement over the 40 LP process at the same leakage/gate. The schematics and layout are designed using the Cadence Virtuoso tool with TSMC 28 nm LP CMOS technology.

The simulation environment is set up such that buffers are added at the inputs and outputs. The adder cell inputs are fed through the buffers to incorporate the effect of input capacitance and the outputs are also loaded with buffers to ensure proper loading condition. The 1-bit PRO-FA is simulated using several test bench setups. These test benches have the common prototype of 3 buffers at the input and 2 buffers at the output. The only difference is the number of adder cells stages used in between the input and output of the simulation setup. The number of stages starts from 2 and is increased gradually. The supply voltage is varied from 1.2 V to 0.7 V for all the different adder structures. The transistor sizes are equally maintained initially and then varied to test the efficiency of the output signals generated for all the full adder circuits. The maximum frequency given to the inputs is 2 GHz for simulations. The delays reported are the worst-case sum and carry delays observed in every adder and the static, dynamic power consumed is obtained for all the possible input combinations. The

PDP is computed for a voltage scaling from 1.2 V to 0.7 V . From the results shown in Tables $1-5$ and Figures $1-11$, one can state the following:

- Considering the power consumption of the whole test bench, the proposed structure shows savings of up to $76 \%$, obtained from the overall reduction of dynamic and leakage power components. The PDP and ADP obtained are less compared to other full adders as shown in Table 1.

Table 1. Performance factors for various adders (listed in Section 2).

| Adder <br> design | No: <br> of <br> Tran | Area <br> $\left(\mu \mathrm{m}^{2}\right)$ | Carry <br> Delay <br> $(\mathrm{ps})$ | Sum <br> Delay <br> $(\mathrm{ps})$ | Static <br> Power <br> $(\mu \mathrm{W})$ | Dynamic <br> Power <br> $(\mu \mathrm{W})$ | PDP <br> $\left(10^{-18} \mathrm{Ws}\right)$ | \% Gain <br> of <br> PDP | ADP <br> $\left(\mu \mathrm{m}^{2} \mathrm{ps}\right)$ | \% Gain <br> of <br> ADP |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Static | 28 | 6.14 | 71 | 94 | 2.56 | 15.27 | 2941.95 | - | 1013.1 | - |
| SR-CPL | 26 | 5.7 | 62.5 | 76 | 1.39 | 12.48 | 1921 | 34.7 | 789.45 | 22 |
| TG | 24 | 5.26 | 48 | 53 | 1.53 | 11.68 | 1334.21 | 54.6 | 531.26 | 47.5 |
| MBA $^{26}$ | 20 | 4.38 | 90 | 115 | 2.75 | 17.97 | 4247.6 | 30.7 | 897.9 | 11.3 |
| D3L $_{p g}$ | 42 | 9.21 | 55 | 68 | 1.28 | 14.25 | 969 | 67 | 1132.8 | 10.5 |
| sp-D3L $_{p g}$ | 48 | 10.53 | 38 | 49 | 1.57 | 26.43 | 1910.19 | 35 | 916.11 | 11.3 |
| SERF | 10 | 2.2 | 30 | 42 | 1.32 | 9.34 | 767.52 | 74 | 158.4 | 84.3 |
| 14T | 14 | 3.07 | 28 | 34 | 1.87 | 10.98 | 796.7 | 71 | 190.34 | 81.2 |
| HPSC | 24 | 5.26 | 60 | 78 | 1.88 | 14.78 | 2299.08 | 21.9 | 725.88 | 28.3 |
| PRO-FA | 13 | 2.85 | 25 | 20 | 1.1 | 9.11 | 459.45 | 84.3 | 128.25 | 87.3 |

- From the layout shown in Figure 6, it is evident that the PRO-FA requires the smallest area of $2.85 \mu \mathrm{~m}^{2}$, as this is one of the factors for lower delay and power. This implies that the size of transistors in the proposed structure is minimal. The output waveform of PRO-FA for all input combinations obtained is shown in Figure 7.


Figure 6. Layout of 1-bit PRO-FA.

- Separate simulations are carried out to determine the lowest power supply voltage that each full adder could tolerate while maintaining the logic functionality. The proposed full adder circuit could operate properly with a low voltage as 0.6 V . From Figure 9, it can be observed that the MUX adder failed to operate at low voltages below 0.9 V .


Figure 7. Simulation of 1-bit PRO-FA.


- The layout of the 16 -bit square-root CSLAS is shown in Figure 8. The overall delay for the proposed CSLAS both linear and square-root is reduced to about $70 \%$, as shown in Figure 10. The number of transistors for the proposed CSLAS is reduced even for basic gates, which enhanced the efficiency of the circuit as shown in Table 2.


CSLAS Types
Figure 10. Delay variation for CSLAS.

Table 2. Analysis of different bits of CSLAS.

|  | 16-bit CSLAS |  |  |  |  | 32-bit CSLAS |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Gate <br> count | No: of Trans | Avg <br> Power $(\mu \mathrm{W})$ | Total <br> Delay (ps) | $\begin{aligned} & \text { PDP } \\ & \left(10^{-18} \mathrm{Ws}\right) \end{aligned}$ | Gate <br> count | No: of trans | Avg <br> Power $(\mu \mathrm{W})$ | Total Delay (ps) | $\begin{aligned} & \text { PDP } \\ & \left(10^{-18} \mathrm{Ws}\right) \end{aligned}$ |
| Conventional linear | 105 | 762 | 490 | 756 | 370,440 | 210 | 1524 | 980 | 1512 | 1,481,760 |
| Proposed linear | 77 | 384 | 298 | 632 | 188,336 | 154 | 768 | 303 | 1164 | 352,692 |
| \% Gain |  |  | 39.1 | 16.4 | 71.5 |  |  | 44.8 | 23 | 76.2 |
| Conventional sqrt | 119 | 682 | 532 | 572 | 304,304 | 238 | 1364 | 1064 | 1144 | 1,217,216 |
| Proposed sqrt | 91 | 559 | 219 | 336 | 73,584 | 182 | 1118 | 538 | 573 | 308,274 |
| \% Gain |  |  | 54.8 | 41.2 | 83.8 | 23.5 | 18 | 59.8 | 51.1 | 87.1 |
|  | 64-bit | CSLAS |  |  |  |  |  |  |  |  |
|  | Gate count | No: of trans | Avg Power ( $\mu \mathrm{W}$ ) | Total Delay (ps) | $\begin{aligned} & \text { PDP } \\ & \left(10^{-18} \mathrm{Ws}\right) \end{aligned}$ |  |  |  |  |  |
| Conventional linear | 420 | 3048 | 1060 | 2004 | 2,124,240 |  |  |  |  |  |
| Proposed linear | 308 | 1536 | 558 | 1228 | 685,224 |  |  |  |  |  |
| \% Gain |  |  | 47.3 | 26.3 | 87.7 |  |  |  |  |  |
| Conventional sqrt | 476 | 2728 | 1128 | 2288 | 2,580,864 |  |  |  |  |  |
| Proposed sqrt | 364 | 2236 | 598 | 1346 | 804,908 |  |  |  |  |  |
| \% Gain |  |  | 65.8 | 54.7 | 92.5 |  |  |  |  |  |

- The minimum and maximum levels of PRO-FA's output voltage for different supply voltages are given in Table 3.

Table 3. PRO-FA's output voltage levels.

| Supply <br> voltage <br> $(\mathrm{V})$ | Min level <br> for high <br> output (V) | Max level <br> for low <br> output (V) |
| :--- | :--- | :--- |
| 1.2 | 1.113 | 0.00568 |
| 1.1 | 1.034 | 0.00452 |
| 1.0 | 0.963 | 0.00382 |
| 0.9 | 0.867 | 0.00213 |
| 0.8 | 0.721 | 0.00194 |
| 0.7 | 0.639 | 0.00017 |
| 0.6 | 0.521 | 0.00011 |

- From Figure 11, it is observed that the power consumption of the proposed CSLAS is reduced to about $76 \%$ in comparison to the conventional linear and square-root CSLAS.
- From Table 4, 3 different multiplier structures are implemented using the proposed architectures and the results obtained from the simulations are shown. The Wallace tree structure shows reduced delay and average power compared to other reported designs.
- The Frequency Analysis and Design (FDA) tool from MATLAB is used for the generation of filter coefficients and for analyzing the filter characteristics. The 4-tap low pass FIR filter is designed for
the pass band frequency of 0 to 20 MHz and stop band frequency range of 21 MHz to 26 MHz . In the pass band region the ripples have magnitude of 4.2 dB while in the stop band it is -41.2 dB .


Figure 11. Power variation for CSLAS.
Table 4. Multiplier performance for different adder structures.

| Multiplier design $(8 \times 8)$ | Existing |  | Proposed |  |
| :--- | :--- | :--- | :--- | :--- |
|  | Worst <br> Delay <br> $(\mathrm{ps})$ | Avg <br> Power <br> $(\mu \mathrm{W})$ | Worst <br> Delay <br> $(\mathrm{ps})$ | Avg <br> Power <br> $(\mu \mathrm{W})$ |
|  | 1476 | 623.56 | 1345 | 503.12 |
| Wallace Tree | 1149 | 278.29 | 1112 | 152.43 |
| Modified Booth | 1396 | 492.67 | 1279 | 376.36 |

- The 4-tap transposed FIR filter is implemented with 3 different multiplier structures designed using the PRO-FA and analyzed. The sampling frequency of the 4 -tap FIR filter using radix- 2 MB is 341.3 MHz and the results are given in Table 5. The number of transistors used for the filter design is reduced in comparison to the existing structures.

Table 5. FIR filter performance for different multiplier and adder architectures.

| 4-Tap FIR filter design using | Existing |  | Proposed |  |
| :---: | :---: | :---: | :---: | :---: |
|  | Worst Delay (ns) | Avg Power (mW) | Worst Delay (ns) | $\begin{aligned} & \text { Avg Power } \\ & (\mathrm{mW}) \end{aligned}$ |
| Array multiplier | 3.45 | 0.53 | 2.89 | 0.48 |
| Wallace Tree | 2.52 | 0.42 | 2.14 | 0.36 |
| Modified Booth | 3.20 | 0.23 | 2.93 | 0.16 |

## 5. Conclusion

An embedded logic structure for designing a full adder is introduced in this paper. The proposed 1-bit full adder (PRO-FA) is built using both transmission gates and pass transistors. The simulations are performed using the Cadence Virtuoso tool with TSMC 28 nm LP CMOS technology. This proposed adder is compared to other standard full adders and it shows an $84.3 \%$ improvement in PDP compared with the earlier best reported. The PRO-FA is extended for 16-, 32-, and 64 -bits and this was further used as a basic block for designing a linear and square-root carry select adder/subtractor (CSLAS). The performance of the proposed CSLAS is compared in terms of speed, power, and area with the conventional ones. The simulations for the proposed CSLAS showed about $76 \%$ power savings and $70 \%$ optimization of PDP. The proposed CSLAS shows an area efficient, low power
architecture compared to the earlier reported ones. An $8 \times 8$ radix- 2 modified Booth multiplier is designed to test the efficiency of the PRO-FA. Different multiplier architectures are implemented using the proposed adder designs and analyzed. Finally, to understand the realistic behavior of the proposed structure, its efficiency is tested in a 4 -tap FIR filter. In this paper an attempt has been made to enhance the performance of different architectures used for computations in DSP applications.

## References

[1] Aguirre-Hernandez M, Linares-Aranda M. Cmos full adders for energy efficient arithmetic applications. IEEE T VLSI Syst 2011; 19: 718-721.
[2] Jiang Y, Al-Sheraidah A, Wang Y, Sha E, Chung JG. A novel multiplexer based low power full adder. IEEE T Circuits-II 2004; 51: 345-348.
[3] Purohit S, Margala M. Investigating the impact of logic and circuit implementation on full adder performance. IEEE T VLSI Syst 2012; 20: 1327-1331.
[4] Shalem R, John E, John LK. A novel low power energy recovery full adder cell. In: IEEE 1999 Proceedings Ninth Great Lakes Symposium on VLSI; 4-6 March 1999; Michigan, USA: IEEE. pp. 380-383.
[5] Hassoune I, Flandre D, O'Connor I, Legat JD. ULPFA: a new efficient design of a power aware full adder. IEEE T Circuits-I 2010; 57: 2066-2074.
[6] Vesterbacka M. A 14-transistor CMOS full adder with full voltage swing nodes. In: IEEE 1999 Workshop on Signal Processing Systems; 20-22 October 1999; Taipei, Taiwan: IEEE. pp. 713-722.
[7] Zhang M, Gu J, Chang CH. A novel hybrid pass logic with static CMOS output drive full adder cell. In: IEEE 2003 International Symposium on Circuits and Systems; 25-28 May 2003; Bangkok, Thailand: IEEE. pp. 317-320.
[8] Rabaey JM, Chandrakasan A, Nikolic B. Digital Integrated Circuits - A Design Perspective. 2nd ed. Upper Saddle River, NJ, USA: Prentice Hall, 2001.
[9] Ramkumar B, Harish MK. Low power and area efficient carry select adder. IEEE T VLSI Syst 2012; 20: 371-375.
[10] Parhi K. VLSI Digital Signal Processing Systems - Design and Implementation. New York, NY, USA: John Wiley \& Sons, 1999.


[^0]:    *Correspondence: kala.b.anna@gmail.com

