AXIAL TOMOGRAHIC RECONSTRUCTION
Computed axial tomographic reconstruction using back-projection
is presented as a challenging real-world application illustrating
the performance obtainable with Alacron's FastSeries architecture
compared to standard cluster architectures. We will calculate the
performance of the latest microprocessors from Analog Devices, Intel,
Philips Semiconductors, and Texas Instruments.The important properties
of the selected processors are
presented
in the table below. Cluster mode designs reach bus saturation very
quickly to the point where adding additional processors will not
improve the rate at which calculations can be performed. The table
assumes that the processors are isolated from each other when they
are performing the calculations. This assumption is valid because
most multiprocessor co-processor boards have isolated or local memory
designs due to scalability issues. The only non-local memory design
is the "native" PIII-450 design, added for comparison purposes.
| SPECIFICATION | Intel PIII-450 | Philips TM1300 | TI C6701 | ADI 21160 |
| Architecture | CISC | VLIW | VLIW | VLIW |
| FPU | Yes | Yes | Yes | Yes |
| MFLOPs (Peak) | 250 | 720 | 1000 | 600 |
| MIPS (Peak) | 450 | 900 | 1336 | 100 |
| MOPS (Peak) | 900 | 4860 | 1336 | 800 |
|
Memory
Bus Bandwidth (MB/s)
|
800
|
570
|
332
|
400
|
|
1K
FP cfft (µsec)
|
300
|
106
|
108
|
90
|
|
1K
16 bit cfft (µsec)
|
191
|
63
|
108
|
90
|
|
1F
FP dot product (µsec)
|
7.38
|
2.84
|
3.07
|
5.12
|
|
16
x 16 MACs (MMAC/s)
|
2.32
|
1.42
|
3.07
|
5.32
|
|
8
x 8 MACs (MMAC/s)
|
9.34
|
6.55
|
7.11
|
11.80
|
|
512x512x8
bit Conv 3x3 (msec)
|
5.34
|
2.62
|
7.11
|
11.80
|
|
512x512x512x
8 bit Erosion/Dilation (msec)
|
7.34
|
1.42
|
3.62
|
3.93
|
|
"Glue"
Logic Cost ($/CPU)
|
$150
|
$3
|
$65
|
$39
|
|
CPU
Price ($)
|
$300
|
$100
|
$150
|
$100
|
Description of the Application
The reconstruction region is a square array of 1024 x 1024 pixels centered on the axis of rotation of the x-ray head (Figure 1). Data is collected from the x-ray head as radial scans, with 4 channels of 4096 values for each of 360 angles. The data collected occupies 4*4096*360*2 = 11,796,480 bytes. The data is moved into memory at an average rate of 5 megabytes per second (Figure 2).

Figure 1: Relation of the region of reconstruction to the x-ray source and detector array.

Figure 2: Data collection.
The following steps reconstruct the pixel densities:
- Calculation of the density estimate from the output of the four detectors at the end of each beam path.
- Back-projection. Conversion of the fan shaped polar form of the data to the square array of pixels in the reconstruction region is accomplished by summing the density along each ray into each pixel the ray encounters. Linear interpolation is used when the ray passes between two pixels (Figure 3).

Figure 3: Back Projection.
Four channels of data are used to estimate the actual density along the beam path to compensate for the normal beam hardening that occurs as x-rays pass through an object.
The value summed into each pixel is:

where Dbeam is the density measured for the current beam being processed and alpha is determined from the geometry of the beam.
- Application of a 2D FFT to the image array.
- Application of a weighting factor of the distance of each pixel from the axis of rotation. This compensates for the greater number of rays passing through pixels closer to the axis of rotation.
- Application of an inverse 2D FFT to the whole array
Steps 2 through 5 effectively apply a convolution to the data to correct for the weighting error generated by the beam geometry, giving a good approximation to the density distribution of the object positioned in the reconstruction area 1.
Algorithmic Complexity
The number of operations performed in each step can be counted as follows. These are summarized in Table 1.
Density estimation from detector data: A fifth order polynomial is applied to the output of each the four detectors at the end of each beam path, and the result of these four polynomials are summed to give the density estimate. This step requires 4*(5 Multiples + 5 Adds) plus three adds per beam path, or 81,920 MACs (multiply accumulates) and 12,288 adds. For 360 angles, the total number of MACs is 29,491,200, and the total number of adds is 4,423,680. Additionally, each detector has an array of 24 coefficients and 4 values that must be read from memory, followed by a write of the density estimate. This results in 41,287,680 reads from memory and 1,474,560 writes to memory to process all 360 angles.
Back-projection: For each angle, one third of the beam path pass into the top and out from the bottom of the reconstruction region, encountering 2 pixels for each of the 1024 lines it crosses. The weights for each beam path is pre-computed, and so each of the 2048 pixels on these 1366 beams requires one MAC, or a total of 2,797,568 MACs. The other two thirds of the beam paths access from 3 to 2048 pixels depending on how far up the left or right side of the reconstruction region the ray path lies. In total, these beam paths access 2,300,323 pixels, requiring 2,300,323 MACs. For all beams then, 5,097,891 MACs are needed. Each MAC requires a pixel value to be read from memory, a table value to be read from memory, and a final write of the new pixel value back to memory. For all of the 360 angles, the back-projection step required 1,835,240,760 MACs, 3,670,481,520 reads from memory, and 1,835,240,760 writes to memory.
2D FFT: A 1024 x 1024 2D real FFT is composed of 2048 real 1024 point FFTs, 1024 along the rows and 1024 along the columns. Two real signals can be processed concurrently with one complex 1024 point FFT plus 2,044 additional additions. Two real FFTs can be performed by this method in 460 microseconds on a single 40 MHz SHARC 2106x processor, averaging to 230 microseconds for a 1024 point real FFT. The whole 2D FFT thus requires 0.470 seconds plus, plus 0.013 seconds required to read and write the whole array twice from memory, or 0.483 seconds on a single processor.
Pixel weighting by radial position: A pre-computed table of radius values is read with the image data, multiplied and written back to memory. This step requires 2,097,152 reads from memory, 1,048,576 multiplies, and 1,048,576 writes to memory.
Inverse 2D FFT: The 2D inverse FFT requires the same number of operations as required by the 2D FFT in step 3.
| Step | MACs | Adds | Reads | Writes |
| 1 | 29,491,200 | 4,423,000 | 41,287,000 | 1,474,000 |
| 2 | 1,835,240,000 | 0 | 3,670,481,000 | 1,835,240,000 |
| 3 | 28,800,000* | 2,097,000 | 2,097,000 | |
| 4 | 1,048,000 | 2,097,000 | 1,048,000 | |
| 5 | 28,800,000* | 2,097,000 | 2,097,000 | |
| Total | 1,923,380,000 | 4,423,000 | 3,718,060,000 | 1,841,958,000 |
Table 1: Operations required.
*Based on averaged 230 µs FFT time.
Parallelization of the Algorithm
In order to improve performance, the computation will be divided between processors by splitting the reconstruction region into horizontal stripes. The first processor will get the first N rows (N = 1024 / number of processors), and each processor will get succeeding rows. The algorithm is inherently linear allowing this natural division between processors.
Memory Requirements
This algorithm uses several pre-computed tables. Each detector has a table of 24 calibration coefficients for a total of 98,304 four-byte floating point numbers. The pixel radius value table (distance from the rotation axis) requires 131,072 floats, taking advantage of symmetries. Again taking advantage of symmetries the ray table requires 57,351,273 floats. The reconstructed image requires 1,048,576 floats, and the input data requires 11,796,480 bytes. The total memory requirement is therefore under 256 megabytes.
Total (Elapsed) is the time the routine takes to run on one processor with overlapping I/O and compute.
| Processor | Total | Time (Mac) | Time (Add) | Time (read) | Time (write) |
| TM1300-180 | 38.88 | 2.67 | 0.006 | 26.00 | 12.88 |
| TI-C6701 | 85.18 | 5.79 | 0.013 | 57.27 | 27.91 |
| ADSP21160 | 55.60 | 9.62 | 0.022 | 37.18 | 18.42 |
| PIII-450-MMX | 27.78 | 8.55 | 0.009 | 18.58 | 9.20 |
Processor Cost vs. Reconstruction Time
The illustration below shows the cost/performance analysis for the
four processors.

Click to Register to Download the application note in pdf format
