Sunday, March 26, 2017

An FPGA SDR HF Transceiver, Part 3 -- Interpolation and Decimation Filters

In this third blog post in my FPGA SDR Transceiver series, I will discuss the FPGA implementation of the Receiver's Decimation Filter and the Transmitter's Interpolation Filter.

(Part 2 of this series is here:  Part 2)

But before I begin, let me again acknowledge Dick Benson, W1QG.  Dick is the father of this design, and although I've made some modifications to the FPGA logic, the underlying architecture and the vast majority of the Simulink implementation is Dick's.

Receiver Decimation Filters:

The receiver must reduce its incoming RF sample rate of 80 MHz down to the final, Audio, sample rate of 9765.625 Hz -- a decimation factor of 8192.

The Decimation Filter stage provides the anti-aliasing filters (and the sub-samplers) required to perform this down-sampling without aliasing artifacts.

Within the receiver architecture, the Decimation Filters lie just after the first set of mixers that down convert the RF signal to the Weaver Demodulator's IF frequency (in SSB mode this IF frequency would typically be between 1000 and 2000 Hz, depending upon the bandwidth of the information filter).

(Click on image to enlarge)

Note that this down-converted data into the Decimation Filter block is still being clocked at an 80 MHz rate. 

Within the FPGA Simulink Model's Top Level, the Receiver's Decimation Filters are the subsystem block circled in yellow, below:

(Click on image to enlarge)

Opening this subsystem block, we see three subsystems, consisting of two different filters and an Overload Detect circuit:
  1. CIC Filter (Cascaded-Integrator-Comb Filter)
  2. Decimating Filter
  3. Overload Detect
(Click on image to enlarge)

The filtering process begins with the CIC stage and finishes with the Decimating Filter stage (note that the CIC stage is also a decimating filter, despite the name assigned to the second stage of this filtering).

Let's look at the CIC Filter (named "CIC D64" and circled in yellow, below).

Receive CIC Filter:

(Click on image to enlarge)

The CIC filter reduces the sample rate from 80 MHz to 1.25 MHz, a factor of 64.  Although the CIC filter is an efficient, straightforward architecture, it is not used for the entire decimation process (from RF to the Audio sample rate) for two reasons: the filter's response is not flat -- passband attenuation increases with frequency -- that is, the response "droops" with frequency, and also there are aliasing components due to "spectrum foldover" from the decimation process (a good article: Understanding Cascaded-Integrator-Comb Filters, and other articles are listed at the end of this blog post.

Thus, the CIC filter's decimation is kept to 64.

Opening the "CIC D64" subsystem:

(Click on image to enlarge)

On the left side we have a CIC Decimation filter in the I channel and a CIC Decimation filter in the Q channel.  At the filter's output, Dick has added 4 different levels of digital gain (0, 6, 12, and 18 dB) that can be selected by the user.

And finally, there is an overload detector (on the I channel only) to detect if CIC gain is too large.

The CIC Decimation filters are implemented using Xilinx CIC blocks.  Looking "under the mask" of one of the CIC Decimation filters, we see its implementation:

(Click on image to enlarge)

Note that the CIC Filter subsystem only reduced the sample rate by a factor of 64 (you can see that this rate-change occurs in the middle of the CIC implementation, between the Integrator and the Comb stages).  We still need to reduce the sample rate by an additional factor of 128 to achieve our final Audio sample rate.

This additional filtering is accomplished by the next subsystem block we will look at.

Decimating Filter:

Dick has given this filter a mouthful of a name -- "Decimating Filter DC Optimized in Accumulator Plus Pipelined Stage Control. Improved dynamic  range."  For ease of typing, I will simplify its name to "Bit-Serial Decimating Filter".  It is outlined below, in yellow:

(Click on image to enlarge)

Typically in a decimating system, a CIC filtering stage is followed by a Compensation Filter which will compensate for the CIC's passband droop and also improve frequency rejection.  In this receiver, passband-droop is minimized by limiting the amount of decimation performed by the CIC filter (a factor of 64, in this case).

Thus, because the CIC Filter's output sample rate is still quite high (1.25 MHz) compared to our final Audio sample rate at 9765.625 Hz, the CIC Filter's droop in the 0-4883 Hz range of frequencies (representing the final maximum Audio output bandwidth) is negligible and thus no droop-compensation is necessary.

But further low-pass filtering and down-sampling is still required to reduce the sample rate from 1.25 MHz to 9765.625 Hz.

This filtering could be accomplished with off-the-shelf Xilinx FIR Intellectual Property, but there is a drawback in doing so -- these filters can suck up quite a bit of the FPGA logic fabric.

And that is the beauty of Dick's "Bit-Serial Decimating Filter" design -- it is a very hardware-efficient architecture, so it uses much less FPGA logic fabric than usual Decimating Filter architectures.

But it isn't just the filter's hardware efficiency that makes it such an elegant design.  An additional benefit is that this single filter block can decimate its input signal by any power of two.  In this receiver, this "Bit-serial Decimating Filter" block decimates its input by a factor of 128.

In a nutshell, of all the logic blocks that define this transceiver, it is this filter's structure that most of all allowed Dick to squeeze in TX and RX functionality into the low-cost Xilinx XC3S500E FPGA.

(Click on image to enlarge)

Let's look more closely at this filter.  The heart of Dick's design is a clever IIR low-pass filter implemented using what is known as the Peled-Liu Filter architecture.

Peled-Liu Filter Architecture:

Abraham Peled and Bede Liu described a filter architecture that required no hardware multipliers in their paper, "A New Hardware Realization of Digital Filters," Peled and Liu, IEEE Transactions on Acoustics, Speech, and Signal Processing, December, 1974.

They demonstrate an application of their architecture with an IIR filter using 3 input (x) samples and 2 feedback (y) samples.  Let's start our analysis by looking at their example and the mathematical foundation underpinning their architecture.

The following equations are quoted from the Beled-Liu paper.

The first equation in the image below (equation 5) represents the digital-filter example that Bede and Liu will implement, with x (input samples) and y (output samples) represented by two's-complement words, each "B" bits in length, and whose maximum values can be no larger than +/-1 (per equation 6).  And finally, equation 7 is simply equation 5 rewritten, in terms of equation 6, showing the filter equation in terms of the actual bits, themselves.

(Click on image to enlarge)

The example-filter's equation (equation 5, above) uses 5 input-variables, x(n), x(n-1), x(n-2), y(n-1), and y(n-2).  Beled-Liu define a function φ that also has 5 arguments, each argument a single-bit value (equation 8, below).
And a note regarding the form of equation 5.  Peled and Liu label the "feed-forward" coefficients ak and the "feed-backward" coefficients bk.  Modern usage typically reverses these labels, so that now the "feed-forward" coefficients are labeled bk and the "feed-backward" coefficients are labeled ak.
And given equation 8, equation 9 represents our original equation (5), but now in terms of  φ and the binary values of each sample bit position:

(Click on image to enlarge)

What do these last two equations tell us?

Let's look at equation 8.  First we note that, because φ has five 1-bit arguments, φ itself has only 32 possible values.  And then, given that these arguments can each be only 1 or 0, the maximum positive φ value is...

φ(max) = a0*1 + a1*1 + a2*1 - b1*0 - b2*0

and so: 

φ(max) = a0 + a1 + a2

Equation 9 tells us that we can calculate y(n) by summing a total of 'B-1' phi values (where B is the number of bits in an input or output sample word) and then, for the 'B'th operation, subtracting one final phi value -- the phi value for the samples' sign bits.

Each of the terms to be summed represents a phi value calculated with either a 1 or 0 per the bit value in a specific bit location within the samples, and that this phi value is then shifted-right (multiplied by 2^-j), depending upon its bit location, j (note that the LSB is shifted the furthest right).

Peled and Liu used a ROM to look up their 32 phi values:

(Click on image to enlarge)

Let's look more closely at the Peled and Liu filter, shown above.

SR1, SR2, SR3, and SR4 are shift registers.  They shift out the bits for their appropriate samples starting with the samples' LSB and ending with the samples' sign bits.

Register 6 is an accumulator that is first cleared prior to any calculations.

The calculations start by applying the LSB bits of the five samples to the phi table.  Depending upon the state of these address bits, thr ROM generates one of 32 possible values.

This value, clocked into register 5, is then summed with the value in the accumulator (R6), and then, before this sum is clocked back into the register, it is first right-shifted by 1 (i.e. multiplied by 2^-1).  And then this right-shifted sum is clocked into the accumulator.  (Note that the right-shift is not shown in the drawing, above, but it is described in the paper's text).

The shift registers then shift right by 1 bit and the next-highest bit of each of the five samples form the address of the ROM.  The resultant phi value is again added to the accumulator, the sum right-shifted, and then clocked into the accumulator.

This process continues through each bit, using successively higher and higher bit locations, until we reach the MSB -- the samples' sign bit.

The five sign bits address the ROM, but this time the resultant phi term is not added to the accumulator, but subtracted from it (per the subtraction sign for the right-hand term in equation 9).

And that's it!  Essentially, sample bits are shifted serially into the filter calculator, and with each shift a partial sum is calculated (and shifted) until finally, at the end, the sign-bit term is subtracted and we have finished our calculation.

W1QG Peled-Liu Implementation:

The foundation of Dick's decimating filter is a decimate-by-2 IIR filter using 9 input samples (x(n) through (x(n-8)) and two previous output samples (y(n-2) and y(n-4))

Dick has described his design in two papers:
R. A. Benson, "A Multichannel Input Subsystem,"  IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1983.
R. A. Benson, "An Efficient Hardware Implementation for Interpolating and Decimating Filters," Asilomar Conference on Signals, Systems, and Computers (ACSSC), 2009.
This filter is represented by the following equation:

y[n] = (28/2048)*(x[n] + x[n-8]) +
           (172/2048)*(x[n-1] + x[n-7]) +
           (503/2048)*(x[n-2] + x[n-6]) +
           (906/2048)*(x[n-3] + x[n-5]) +
          - (196/256)*y[n-2] - (87/256)*y[n-4]

This filter has the following characteristics:
  1. Its DC gain is exactly unity.
  2. There is very little passband ripple.  As Dick says in one of his papers, "The passband ripple for a cascade of six of these stages is less than 0.06 dB peak to peak."
  3. Stopband ripple is down at least 80 dB.
Pole-zero and frequency response graphs for a single decimate-by-two (D=2) stage:

(Click on image to enlarge)

Note that in the frequency response graph, the fold-over (Nyquist) frequency would be at '500' -- the far right hand side of the x-axis.

Here is a simplified block diagram of his filter:

(Click on image to enlarge)

Recall from the equation above that each pair of four pairs of samples use the identical filter coefficients.  This fact allows us to first add each pair together (in the Serial Adder blocks, above) and then to use their sum to address the phi-table ROM.

This sum, which is the output of each Serial Adder in the block diagram above, is a single bit.  The adder's carry-out bit is stored internally to be summed with the next-highest sample bits, when they are added together.  This allows us to reduce the ROM's size:  because only a single bit from each Serial Adder addresses the ROM table, we only need 7 bits, total, to form the ROM address from the sample bits, rather than the 11 bits that would have been required if bits from all eleven samples formed the ROM address.

Note that an eighth bit has been added to the ROM table address -- this bit negates the phi value, which results in a subtract operation for the sign-bit phi value.

To generate the contents of the ROM Lookup table, we can express the filter coefficients as integers if we first multiply all terms by 2048:

2048* y[n] = 28*(x[n] + x[n-8]) + 172*(x[n-1] + x[n-7]) +
                     503*(x[n-2] + x[n-6]) + 906*(x[n-3] + x[n-5]) +
                     1094*x[n-4] -
                     (196*8)*y[n-2] - (87*8)*y[n-4]

This will allow us to create the table in terms of integer values.  Note that, because the Lookup Table's x inputs and y inputs are each a single bit wide and thus either 1 or 0, the maximum positive output from the table will be 2703 ( = 28 + 172 + 503 + 906 + 1094, with the y values set to 0 so that they do not subtract).  Thus all possible positive values within the table can be represented with a 12 bit word width, and, by adding a 13th bit to the output word length, we can then represent all negative values, too.

Here is Dick's Matlab code for generating the ROM Lookup table of phi values:

 function phee=da_rom()

p = 28;        % Filter coefficient for   x[n] + x[n-8]
q = 172;       % Filter coefficient for x[n-1] + x[n-7]
r = 503;       % Filter coefficient for x[n-2] + x[n-6]
s = 906;       % Filter coefficient for x[n-3] + x[n-5]
t = 1094;      % Filter coefficient for x[n-4]
u = -(8*196);  % Filter coefficient for y[n-2]
v = -(8*87);   % Filter coefficient for x[n-4]
%   b6   b5   b4   b3   b2   b1 b0
X = [v,  u,   t,   s,   r,  q,  p];

phee = zeros(128,1);

for k = 0:127
    s=dec2bin(k,7);  % binary representation of k
    for kk=1:7
        if s(kk)=='1'
           phee(k+1)= phee(k+1)+ X(kk); 

Note that the above code expresses the filter's equation as:

2048*y[n] = p*(x[n] + x[n-8]) + q*(x[n-1] + x[n-7]) +
                    r*(x[n-2] + x[n-6]) + s*(x[n-3] + x[n-5]) +
                    t*x[n-4] +
                    u*y[n-2] + v*y[n-4]

This decimate-by-two filter forms the heart of a recursive decimating-filter architecture that will decimate our receive data stream by an additional factor of 128.  Let's look at this architecture, next...

Recursive Decimate-by-2 Structure:

As I mentioned earlier in this post, a large part of the cleverness in Dick's design is that this filter doesn't perform just one divide-by-two decimation, but that it can perform any number of them.

If we examine our earlier block diagram, we see that the inputs into the filter are two samples, x(n) and x(n-1), and that, as feedback terms, we use y(n-2) and y(n-4).

(Click on image to enlarge)

In other words, we wait until we've gathered two input samples at our 1.25 MHz sample rate and then, performing the sequential bit-serial arithmetic processing on these two samples at the 80 MHz system clock rate, we calculate an output sample.  So we are performing a filter-and-decimate calculation upon the arrival of every other incoming 1.25 MHz sample, not every sample.

This means the calculator is idle every other sample period.  We can make use of this idleness to further decimate by a factor of two the decimated-by-two output samples that we've just calculated, for a total decimation of the input sample stream by four.

For this second stage of decimation we again wait until we have two samples from the previous stage of decimation, but because these two samples are now at half the data rate (that is, they've already been decimated by two), the output sample from this decimation stage is calculated only once every four samples of our original sampling rate.

And if we were to decimate again by another factor of two (for a total decimation factor of 8), we would wait until we had two samples from the second stage of decimation before starting the new stage of decimation, which means we would only need to perform this third stage of decimation calculation once very eight input samples.

In other words, the fraction of "sample-slots" being utilized for our decimate-by-two calculations can be represented by the sequence:

1/2 + 1/4 + 1/8 + ...

where the 1/2 is the fraction of "input sample slots" required to calculate the first stage's decimate-by-two output, the 1/4 is the amount of "input sample slots" required for the second decimate-by-two stage's calculations, the 1/8 is the amount for the third decimate-by-two stage's calculations, and so on.

And so, as long as we decimate by a finite power of two, this sum will always total to a value less than 1, and thus there will always be "sample slots" available for additional decimation.  All we need is enough scratchpad memory space in which to store our intermediary values.  It is very clever design!

The timing diagram below (captured from the Xilinx model) shows the interleaving of the stages of decimation:

(Click on image to enlarge)

How much time do we have to perform each output calculation?

The FPGA system clock is 80 MHz but the sample-rate into our Peled-Liu decimation filter is at 1.25 MHz (the CIC filter has already decimated the input 80 MHz sample rate by a factor of 64).  Thus, a "sample slot" period is 800 ns.

Within this sample slot period we have 64 80 MHz clock cycles in which to calculate the output sample.  The data path width is 16 bits, therefore the entire bit-serial calculation for one output sample will require 16 (plus a few extra) 80-MHz clocks.

But note!  Because data at this point is complex, we actually need to calculate two values - the real and the imaginary components of the output sample.  So, given that 16 "plus a few" clock cycles are required to calculate, say, the real component, then to calculate both real and imaginary components we will need somewhere in the vicinity of 40 (out of 64) of our 80-MHz clock cycles.  Plenty of margin!

OK.  I've explained the principles behind the recursive bit-serial Decimating Filter, let's look at how it is implemented...

Decimating Filter, Xilinx Simulink Model:

Opening up the "Bit-Serial Decimating Filter" subsystem in the Xilinx Simulink model, we see it is hierarchical, and thus constructed with a number of other Simulink subsystems:

(Click on image to enlarge)

Within this subsystem, the Peled-Liu filtering math is performed within the "D = 2 bit Serial Filter Core", outlined below in yellow.  Let's look at that subsystem, first:

(Click on image to enlarge)

Opening it up, we see its structure is not unlike the block diagram I presented above:

(Click on image to enlarge)

Let's begin with what I believe is the most important block in the "D = 2 bit Serial Filter Core" subsystem, the ROM/Accumulator (the subsystem at the right-hand side).  Opening it:

(Click on image to enlarge)

Notes on the ROM/Accumulator:
  1. The ROM's phi-table contents have been discussed above.
  2. The "shift-left-by-5" stage ensures that at the end of all of the shift-and-accumulate cycles, the output sample's LSB is indeed in the LSB position (i.e. it compensates for the fact that the filter's coefficients have been multiplied by 2048).

 Here is the "D = 2 bit Serial Filter Core" subsystem's "Timing/Control1" subsystem...

(Click on image to enlarge)

...and, for completeness, its S-R (Set-Reset) Flip Flop:

(Click on image to enlarge)

While we are on the topic of the Timing/Control logic, let's look at a Timing Diagram:

(Click on image to enlarge)

Notes on Timing:
  1. The clock rate is 80 MHz.
  2. The "Run" signal is actually the "SR Enable" signal generated in the Timing/Control1 block.
  3. Because the sign-bits for the four pairs of x samples are added within the Serial Adders, which only have 1 bit out, the sign bit is extended for an additional clock to include any Carryout bits of the Serial Adders that might have been set when adding sign bits.  Why is this important?  Suppose we added the signs of two negative numbers -- the one-bit result would be 0, which also represents the addition of two positive numbers.  So to disambiguate this result, we need to also look and see if an adder's Carryout bit has been set (which it would be if both numbers were negative instead of positive).
  4. Rounding adds 1 to the data bit into the accumulator's register one clock prior to the sample-calculation finishing.  Thus, effectively adding 1 to the bit that is one bit to the right of the bit that will become the output's LSB.

Continuing on...

The sample shift-register blocks (SRA, SRB, and SRC) in the "D = 2 bit Serial Filter Core" subsystem are identical, and look like:

(Click on image to enlarge)

Here is the "D = 2 bit Serial Filter Core" subsystem's RAM-based Shift Register (SR):

(Click on image to enlarge)

Note that the Track/Hang subsystem (the right-hand block, above) serves two purposes:
  1. It holds the sign bits for one extra clock, essentially sign-extending the samples for one clock so that any Serial Adder 'Carryout' bits that might have been set can be used in  the phi-table lookup for the sign-bits' phi term.
  2. It adds an extra pipeline stage to breakup a "long" data path and ease the routing chore for the Xilinx ISE tools.

Here's the Track/Hang subsystem:

(Click on image to enlarge)

And here's the TH (Track and Hold) block within the above subsystem:

(Click on image to enlarge)

The "D = 2 bit Serial Filter Core" subsystem's SAD_1, SAD_2, SAD_3, and SAD_4 blocks are all identical, and take the following form:

The Serial Adder's model:

(Click on image to enlarge)

That completes the Bit-Serial Decimating Filter's "D = 2 bit Serial Filter Core" subsystem.

Now let's look at the "Stage Control with Pipe" subsystem, shown below, outlined in yellow.

(Click on image to enlarge)

Recall from my earlier discussion, above, the first stage calculates its decimated-by-two output samples every other sample-clock period.  The second stage calculates its decimated-by-two sample every fourth clock period, the next stage is every eighth clock period, etc.  And it is this subsystem that appropriately interleaves which stage calculation is being performed for any 800 nanosecond sample-clock period.

Opening this subsystem...

(Click on image to enlarge)

Opening "Stage_Control" -- this logic selects (via its Priority Encoder) which "stage of decimation" is being calculated:

(Click on image to enlarge)

...its "Priority Encoder":

(Click on image to enlarge)

The second subsystem that we will examine in the "Bit-Serial Decimating Filter" is the "Interstage Interface" subsystem, outlined below in yellow:

(Click on image to enlarge)

This stage serves several purposes:
  1. It contains a sample-delay (running at 1.25 MHz) so that the x(n-1) input sample will be presented to the decimate-by-two filter alongside the x(n) input sample.
  2. Within a 64-clock sample period, it selects whether the Real or the Imaginary component of a sample is being calculated.
  3. It stores, in a scratchpad RAM memory, the decimated outputs from other stages.  Recall that an output from one decimation stage will be used as one of the input samples for the next stage of decimation -- that is, it (and the sample that preceded it) will become the x(n) and x(n-1) input samples for the next stage of decimation.
  4. The first stage of decimation is stage 0.
Opening this subsystem:

(Click on image to enlarge)

The final subsystem in the "Bit-Serial Decimating Filter" is the "Real_Imag_Demux", outlined in yellow, below:

(Click on image to enlarge)

And here is that subsystem's logic:

(Click on image to enlarge)

This block simply demultiplexes the Real and Imaginary components of the final decimated sample (components arrive in sequence, first the Real component of the sample, followed by the Imaginary component) and presents them as two parallel samples for the next subsystem in the receiver.

Finally, back to the original Receiver Decimation Filter.  Let's examine the final subsystem, the "DA_Ovl_Det" subsystem:

(Click on image to enlarge)

This subsystem detects if the "Bit-Serial Decimation Filter" stage is overloading.  Here is its logic:

(Click on image to enlarge)

This block simply looks at the top four bits of the Real component of the output sample (the Imaginary component is ignored, to save a bit of logic fabric).  If these four bits are either 0x7 (positive peak) or 0x8 (negative peak), the sample is deemed to be too close to clipping and the Overload bit is set (to then be reset by the free-running counter).

That's it for the Receiver's Decimation Filters.  Now on to the Transmitter side!

Transmitter Interpolation Filters:

The Transmitter begins with an audio signal sampled at 9.876.625 Hz, and it ends with an RF signal whose sample rate is 80 MHz -- an increase in sample rate by a factor of 8192.

Most of the transmitter's signal processing is done at the audio sample rate.  It is only near the end of this processing (just prior to the final stage of frequency up-conversion) that the data sample rate is converted from Audio to RF.

To ensure that images of the audio signal don't appear throughout the 80 MHz spectrum, interpolation is used to create a smooth sequence of up-sampled data between the original samples arriving at the audio sample rate.

The image below identifies the Interpolation Filters in the Transmitter block diagram:

(Click on image to enlarge)

Let us now look at these filters in the actual Simulink Xilinx model.  The Transmitter Interpolation Filters can be found in the Top Level of the FPGA's Simulink Model, in the subsystem block circled in yellow, below:

(Click on image to enlarge)

Opening up this subsystem, we see:

(Click on image to enlarge)

There are two different interpolation filters in this subsystem.  The one on the left is a "bit-serial" interpolate-by-two filter based upon the low-pass IIR Peled-Liu filter architecture (described above, for the receiver), through which data that has been up-sampled by a factor of two is passed.

This Peled-Liu interpolate-by-two filter is the backbone of a recursive interpolating filter that interpolates by a factor of 128 -- similar in concept to the decimate-by-two Peled Liu filter used recursively in the receiver to achieve a decimation factor of 128.

The other interpolating filter in this subsystem is the CIC interpolating filter, shown on the right.  It completes the interpolating process by interpolating by a factor of 64.

(Click on image to enlarge)

Let's begin by looking at the left-hand recursive filter in the image above, that is based upon the Peled Liu filter architecture.

The heart of the recursive Peled-Liu Interpolation Filter is a low-pass IIR filter whose response is given by the equation (it is exactly the same response as was used in the receiver decimation filter):

y[n] = (28/2048)*(x[n] + x[n-8]) + (172/2048)*(x[n-1] + x[n-7]) +
          (503/2048)*(x[n-2] + x[n-6]) + (906/2048)*(x[n-3] + x[n-5]) +
          (1094/2048)*x[n-4] -
          (196/256)*y[n-2] - (87/256)*y[n-4]

But the actual implementation is not quite the same as it was in the receiver -- this time we are creating two samples from one, rather than decimating to one sample from two.

To get a better understanding of how this filter is used for interpolation, below is a Simulink model (used by Dick for algorithm development) that shows the Filter's basic underlying mathematical structure (leaving out the Peled-Liu implementation).

(Click on image to enlarge)

As an interpolator, for each x sample arriving at this filter's input, two new samples must be generated -- essentially, one sample which replaces the incoming x sample, and second "interpolated" sample that lies between x samples.

This interpolation filter accomplishes this by upping the sample-rate by two and stuffing a 0 sample between each of the original x samples (so that we now have twice as many samples) and then passing this new stream through a low-pass filter.

(Click on image to enlarge)

In the image above, the original stream of samples, x, has been up-sampled by a factor of two and zeroes stuffed between each x sample.  I've called this new, upsampled, data stream x'.

This upsampled stream x' is passed through the low-pass filter to create y and z (interpolated between y) samples.

So, to create y:

y[n] = (28/2048)*(x'[n] + x'[n-8]) +
          (172/2048)*(x'[n-1] + x'[n-7]) +
          (503/2048)*(x'[n-2] + x'[n-6]) +
          (906/2048)*(x'[n-3] + x'[n-5]) +
          - (196/256)*y[n-1] - (87/256)*y[n-2]

(Note that here I've changed the y-sample nomenclature so that y[n-1] and y[n-2] represent y[n-2] and y[n-4] in our original equation.  This is make the equation correspond to Dick's signal annotations in his Simulink models.  You will note that these samples are still 2 and 4 sample clocks (at the up-sampled 2x clock rate) from y[n] and z[n].)

Because every other x' sample is 0, this equation simplifies to:

y[n] = (28/2048)*(x'[n] + x'[n-8]) +
          (503/2048)*(x'[n-2] + x'[n-6]) +
          - (196/256)*y[n-1] - (87/256)*y[n-2]

which we can relate back to our original data stream, x:

y[n] = (28/2048)*(x[n] + x[n-4]) +
          (503/2048)*(x[n-1] + x[n-3]) +
          - (196/256)*y[n-1] - (87/256)*y[n-2]

Note that because of zero-stuffing, only five of the incoming x samples are used.

In a similar fashion, the calculation for the interpolated sample-stream, z, can be shown to be:

z[n] = (172/2048)*(x[n] + x[n-3]) +
          (906/2048)*(x[n-1] + x[n-2])
          - (196/256)*z[n-1] - (87/256)*z[n-2]

And this time, because of the zero-stuffing, only four of the incoming x samples need be used.

These two filters (one for y and one for z) are implemented within the FPGA with the Peled-Liu bit-serial arithmetic architecture.  Let's look at this implementation.

The filters are contained within the block highlighted in yellow, below:

(Click on image to enlarge)

Opening it, we find our familiar Peled-Liu architecture:

(Click on image to enlarge)

In the above diagram, let's first examine the ROM/Accumulator subsystem, where the new y and z samples are calculated:

(Click on image to enlarge)

You can see that there are two ROM/Accumulator calculators.  One calculates y, the other calculates z.

The phi tables are different for each (refer back to the equations, above, for y and z based upon the incoming stream, x).

The y sample's ROM is the "phee_1" ROM, and the z sample's ROM is the "phee_2" ROM.  Below is the Matlab code to generate the phee_1 and the phee_2 ROMs:

function [phee_1,phee_2]=da_i2_rom()
%%  Filter coefficients
p = 28*2;   % 2x factor to force gain to 1.0 in interpolate by 2 filter
q = 172*2;
r = 503*2;
s = 906*2;
t = 1094*2;
u = -(8*196);
v = -(8*87);

%% ROM 1
%      b4   b3   b2   b1   b0
X1 = [  v,  u,   t,  r,    p];
phee_1 = zeros(2^5,1);
for k = 0:31
    ss=dec2bin(k,5);  % binary representation of k
    for kk=1:5
        if ss(kk)=='1'
           phee_1(k+1)= phee_1(k+1)+ X1(kk); 

%%  ROM 2

%     b3   b2   b1   b0
X2 = [v,   u,   s,    q];
phee_2 = zeros(2^4,1);
for k = 0:15
    ss=dec2bin(k,4);  % binary representation of k
    for kk=1:4
        if ss(kk)=='1'
           phee_2(k+1)= phee_2(k+1)+ X2(kk); 

Two notes about these two tables:
  1. The filter coefficients have been converted to integers (from the fractions in the original equations) by multiplying them by 2048.
  2. The 'x' coefficients have been further multiplied by a factor of 2 so that the overall gain of the filter is 1.0.  
Apart from these changes in the ROM/Accumulator subsystem, the circuitry is the essentially the same as that of the receiver's decimation filter.  It is shown below, with additional notes highlighting any small differences from the receiver's implementation.

The Peled-Liu arithmetic engine's Timing Control:

(Click on image to enlarge)

An extra control line has been added to the Timing/Control circuit:  Extend_Sign.  The sign-extension logic in this filter is slightly different from the circuit used in the receiver, but the function is identical: extend the sign bit for one extra clock to utilize any carry-out bits that might have been set in the Serial Adders.

The 16-bit shift registers which create serial data streams for the Peled-Liu bit-serial arithmetic from the parallel data.  Note that the LSB shifts out first.

(Click on image to enlarge)

The RAM-based Shift Register in which previous x, y, and z samples are stored is shown below.  Note that, unlike the receiver, we only need to store four previous x samples.  Another difference is that there is now a single y sample and z sample stored.

(Click on image to enlarge)

The RAM-based Shift Register simply extracts the sample bits from the stored word.  Each word can be thought of as a bit-wide slice of all of the samples -- e.g. bit 5 of all of the samples.  These are extracted from the RAM word (via the circuit, below).  They are then written back into the RAM, but shifted by one, so that, for example, bit 5 of sample x[n] now becomes bit 5 of sample x[n-1], etc.

And here's the Bit-Extraction block:

(Click on image to enlarge)

And below, the final subsystem in the "RAM-based SR", the "Track and Hold k6jca" block.   This block extends the samples' sign bits by 1 clock and also contains a pipeline register to "ease" the timing requirements.

(Click on image to enlarge)

The Serial Adders are identical to those used in the Receiver's Decimation Filter.  Note that the Carry-out bit is used internally and so only a single bit from each adder (rather than two) is sent on.

(Click on image to enlarge)

To achieve an interpolation factor of 128, the "interpolate-by-two" architecture is accessed recursively 7 times.  The block highlighted below controls the timing of each stage of the interpolation process.  It is identical to the similar function in the receiver's decimation filter:

(Click on image to enlarge)

The circuitry within highlighted block:

(Click on image to enlarge)

(Click on image to enlarge)

And the block below stores the intermediate values used for each stage of the "Interpolate by 128" interpolation process:

(Click on image to enlarge)

Either the input (x) sample is selected, or a previous "intermediary" y-sample (that has been stored in the RAM):

(Click on image to enlarge) 

Now let's look at the "Commutator" block"

(Click on image to enlarge)

The Commutator sequentially receives the final interpolated real and imaginary sample data from the Interp_Core block at a 2.5 MHz word-rate.  That is, the imaginary-component of a final interpolated sample arrives 32 80-MHz clocks after the real-component.

The Commutator simply demultiplexes this serial stream into two parallel paths: the real-component and the imaginary-component.

(Click on image to enlarge)

And finally, we arrive at the last stage of the interpolation process, the CIC interpolator!

(Click on image to enlarge)

The CIC interpolators (one for the real component and one for the imaginary component of the complex data) has been implemented using a Xilinx CIC block, whose parameters are shown, below:

(Click on image to enlarge)

Whew! That's it for this post!

Background Notes:

SDR Notes:  Weaver Modulation and Demodulation
SDR Notes:  The Mixer Mathematics of Digital Down Conversion

Posts in this Series:

Part 1: Overview
Part 2: FPGA Modulation and Demodulation
Part 3: Interpolation and Decimation Filters
Part 5: Control Interface, Etc.


CIC Filter Introduction, Donadio
Understanding CIC Compensation Filters, Altera AN455
Understanding Cascaded-Integrated-Comb Filters, Lyons
CIC Compiler V4.0, Xilinx

Paper, "A New Hardware Realization of Digital Filters," Abraham Peled and Bede Liu, IEEE Transactions on Acoustics, Speech, and Signal Processing, December, 1974.

Paper, "A Multichannel Input Subsystem," Richard A. Benson, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1983.

Paper, "An Efficient Hardware Implementation for Interpolating and Decimating Filters," Richard A. Benson, Asilomar Conference on Signals, Systems, and Computers (ACSSC), 2009.

Book, Stewart, Barlee, Atkinson, Crockett, Software Defined Radio Using Matlab & Simulink and the RTL-SDR, Strathclyde Academic Media, 2015. (A good resource!  And this book can be downloaded via

Standard Caveat:

I or Dick might have made a mistake in our designs, equations, schematics, models, etc.  If anything looks confusing or wrong to you, please feel free to comment below or send me an email.

Also, I will note:

This design and any associated information is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.