

### **Arithmetic for Computers**

- Operations on integers
  - Addition and subtraction
  - Multiplication and division
  - Dealing with overflow
- Floating-point (real numbers)
  - Representation and operations



- Overflow if result out of range
  - Adding +ve and –ve operands, no overflow
  - Adding two +ve operands
    - Overflow if result sign is 1
  - Adding two –ve operands
    - Overflow if result sign is 0

#### **Integer Subtraction**

- Add negation of second operand
- Example: 7 6 = 7 + (–6)
  - +7: 0000 0000 ... 0000 0111
  - <u>-6: 1111 1111 ... 1111 1010</u>
  - +1: 0000 0000 ... 0000 0001
- Overflow if result out of range
  - Subtracting two +ve or two –ve operands, no overflow
  - Subtracting +ve from –ve operand
    - Overflow if result sign is 0
  - Subtracting –ve from +ve operand
    - Overflow if result sign is 1



### **Dealing with Overflow**

- Some languages (e.g., C) ignore overflow
  - Use MIPS addu, addui, subu instructions
- Other languages (e.g., Ada, Fortran) require raising an exception
  - Use MIPS add, addi, sub instructions
  - On overflow, invoke exception handler
    - Save PC in exception program counter (EPC) register
    - Jump to predefined handler address
    - mfc0 (move from coprocessor reg) instruction can retrieve EPC value, to return after corrective action



### **Arithmetic for Multimedia**

- Graphics and media processing operates on vectors of 8-bit and 16-bit data
  - Use 64-bit adder, with partitioned carry chain
    - Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors
  - SIMD (single-instruction, multiple-data)
- Saturating operations
  - On overflow, result is largest representable value
    - c.f. 2s-complement modulo arithmetic
  - E.g., clipping in audio, saturation in video





#### **Multiplication Hardware**



#### **Optimized Multiplier** Perform 32 add/shift steps with 1 register **Multiplicand** 32 bits 32-bit ALU Shift right Control Product Write test

64 bits

1 cycle per partial-product (add+shift), i.e.

- HI(Product) = HI(Product) + Multiplicand
- Product = srl(Product) srl is unconditional!

#### **A Faster Multiplier**

How many adders, how much faster?
Must consider cost/performance tradeoff



#### Can be "pipelined"

How many multiplications in parallel?

#### **MIPS Multiplication**

- Use two 32-bit registers to store the product
  - HI: most-significant 32 bits
  - LO: least-significant 32-bits
- Instructions
  - mult rs, rt / multu rs, rt
    - 64-bit product now stored in HI/LO
  - mfhi rd / mflo rd
    - Move from HI/LO to rd
    - Can test HI value to see if product overflows 32 bits
  - mul rd, rs, rt
    - Least-significant 32 bits of the product goes into rd



### Division



*n*-bit operands yield *n*-bit quotient and remainder

- Check for 0 divisor
- Long division (LEFT to RIGHT)
  - If divisor ≤ dividend
    - 1 bit in quotient, then subtract
  - Otherwise
    - 0 bit in quotient, bring down next dividend bit
- Restoring division
  - Do the subtract, and if remainder goes < 0, add divisor back</li>
- Signed division
  - Divide using absolute values
  - Adjust sign of quotient and remainder as required

#### **Division Hardware** Start Initially divisor Subtract the Divisor register from the in left half Remainder register and place the result in the Remainder register Divisor Remainder $\geq 0$ Remainder < 0 Shift right Test Remainder 64 bits 2a. Shift the Quotient register to the left, 2b. Restore the original value by adding Quotient setting the new rightmost bit to 1 the Divisor register to the Remainder 64-bit ALU register and placing the sum in the Shift left Remainder register. Also shift the 32 bits Quotient register to the left, setting the new least significant bit to 0 Remainder Control Write test 64 bits 3. Shift the Divisor register right 1 bit No: < 33 repetitions 33rd repetition? Initially dividend Yes: 33 repetitions Done Chapter 3 — Arithmetic for Computers — 13

### **Optimized Divider**

#### Perform 33 sub/srl/sll steps with 1 register



1 cycle per partial-remainder (sub+srl+sll)

- **Exercise:** Trace it like the optimized multiplier.
  - Same hardware can be used for both!

#### **Faster Division**

- Can't use parallel hardware as in multiplier
  - Subtraction is conditional on sign of remainder
- Faster dividers (e.g. SRT division) generate multiple quotient bits per step
  - Interesting historical tidbit:
  - http://en.wikipedia.org/wiki/Division\_(digital)
  - Many proposed algorithms exist.



#### **MIPS** Division

- Use HI/LO registers for result
  - HI: 32-bit remainder
  - LO: 32-bit quotient
- Instructions
  - div rs, rt / divu rs, rt
  - No overflow or divide-by-0 checking
    - Software must perform checks if required
  - Use mfhi, mfl o to access result



## **Floating Point**

- Representation for non-integral numbers
  - Including very small and very large numbers
- Like scientific notation



- In binary
  - $\pm 1.xxxxxx_2 \times 2^{yyyy}$
  - Types fl oat and doubl e in C

### **Floating Point Standard**

- Defined by IEEE Std 754-1985
- Developed in response to divergence of representations
  - Portability issues for scientific code
- Now almost universally adopted
- Two representations
  - Single precision (32-bit)
  - Double precision (64-bit)



### **IEEE Floating-Point Format**

|   | single: 8 bits<br>double: 11 bit | single: 23 bits<br>s double: 52 bits |
|---|----------------------------------|--------------------------------------|
| S | Exponent                         | Fraction                             |

 $x = (-1)^{S} \times (1 + Fraction) \times 2^{(Exponent-Bias)}$ 

- S: sign bit ( $0 \Rightarrow$  non-negative,  $1 \Rightarrow$  negative)
- Normalize significand:  $1.0 \le |significand| < 2.0$ 
  - Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit)
  - Significand is Fraction with the "1." restored
- Exponent: using excess representation
  - actual exponent = Exponent Bias
  - Ensures Exponent is unsigned (i.e. for very small numbers)
  - Single: Bias =  $127_{10}$ ; Double: Bias =  $1023_{10}$



### **Single-Precision Range**

- Exponents 0000000 and 11111111 reserved
- Smallest value
  - Exponent: 00000001
    - $\Rightarrow$  actual exponent = 1 127 = –126
  - Fraction:  $000...00 \Rightarrow$  significand = 1.0
  - $\pm 1.0 \times 2^{-126} \approx \pm 1.2 \times 10^{-38}$
- Largest value
  - exponent: 11111110  $\Rightarrow$  actual exponent = 254 - 127 = +127
  - Fraction:  $111...11 \Rightarrow$  significand  $\approx 2.0$
  - $\pm 2.0 \times 2^{+127} \approx \pm 3.4 \times 10^{+38}$



#### **Double-Precision Range**

- Exponents 0000...00 and 1111...11 reserved
- Smallest value
  - Exponent: 0000000001 ⇒ actual exponent = 1 – 1023 = –1022
  - Fraction:  $000...00 \Rightarrow$  significand = 1.0
  - $\pm 1.0 \times 2^{-1022} \approx \pm 2.2 \times 10^{-308}$
- Largest value
  - Exponent: 1111111110
     ⇒ actual exponent = 2046 1023 = +1023
  - Fraction:  $111...11 \Rightarrow$  significand  $\approx 2.0$
  - $\pm 2.0 \times 2^{+1023} \approx \pm 1.8 \times 10^{+308}$



#### **Floating-Point Precision**

- Relative precision
  - all fraction bits are significant
  - Single: approx 2<sup>-23</sup>
    - Equivalent to 23 × log<sub>10</sub>2 ≈ 23 × 0.3 ≈ 6 decimal digits of precision
  - Double: approx 2<sup>-52</sup>
    - Equivalent to 52 × log<sub>10</sub>2 ≈ 52 × 0.3 ≈ 16 decimal digits of precision



#### **Floating-Point Example**

Represent and store –0.75

- $-0.75 = (-1)^1 \times 1.1_2 \times 2^{-1}$
- S = 1
- Fraction (stored) =  $1000...00_2$
- Exponent (stored) = -1 + Bias
  - Single: -1 + 127 = 126 = 01111110<sub>2</sub>

Double:  $-1 + 1023 = 1022 = 01111111110_2$ 

- Single: 1011111101000...00 (in mem)
- Double: 101111111101000...00 (in mem)

#### **Floating-Point Example**

- What number is represented by the singleprecision float
  - 1100000101000...00 (stored in memory)
  - S = 1

- Fraction = 01000...00<sub>2</sub>
- Exponent = 10000001<sub>2</sub> = 129

• 
$$\mathbf{x} = (-1)^1 \times (1 + 01_2) \times 2^{(129 - 127)}$$
  
=  $(-1) \times 1.25 \times 2^2$   
=  $-5.0$ 

# • Exponent = 000...1; hidden bit = 0 $x = (-1)^{S} \times (0 + Fraction) \times 2^{1-Bias}$

- Gradual underflow (smaller gap to zero)
  - Smallest +ve SP norm: 1.0000... 0 × 2<sup>-126</sup>
  - Smallest +ve SP denorm: 0.0000...  $1 \times 2^{-126}$



#### **±Infinities and NaNs**

- Exponent = 111...1, Fraction = 000...0
  - Cover  $\pm \infty$
  - Can be used in subsequent calculations, avoiding need for overflow checks
- Exponent = 111...1, Fraction ≠ 000...0
  - Not-a-Number (NaN)
  - Indicates illegal or undefined results
    - e.g., 0.0 / 0.0
  - Can be used in subsequent calculations



### **Floating-Point Addition**

- Consider a 4-digit <u>decimal</u> example
  - $9.999 \times 10^{1} + 1.610 \times 10^{-1}$
- 1. Align decimal points
  - Shift number with smaller exponent
  - $9.999 \times 10^1 + 0.016 \times 10^1$
- 2. Add the significands
  - $9.999 \times 10^1 + 0.016 \times 10^1 = 10.015 \times 10^1$
- 3. Normalize result & check for over/underflow
  - 1.0015 × 10<sup>2</sup>
- 4. Round and renormalize if necessary
  - 1.002 × 10<sup>2</sup>



#### **Floating-Point Addition**

- Now consider a 4-digit <u>binary</u> example
  - $1.000_2 \times 2^{-1} + -1.110_2 \times 2^{-2} (0.5 + -0.4375)$
- 1. Align binary points
  - Shift number with smaller exponent
  - $1.000_2 \times 2^{-1} + -0.111_2 \times 2^{-1}$
- 2. Add the significands
  - $\bullet 1.000_2 \times 2^{-1} + -0.111_2 \times 2^{-1} = 0.001_2 \times 2^{-1}$
- 3. Normalize result & check for over/underflow
  - $1.000_2 \times 2^{-4}$ , with no over/underflow
- 4. Round and renormalize if necessary
  - $1.000_2 \times 2^{-4}$  (no change) = 0.0625

#### **FP Adder Hardware**

- Much more complex than integer adder
- Doing it in one clock cycle would take too long (i.e. cycle time will be long)
  - Much longer than integer operations
  - Slower clock would penalize all instructions
- FP adder usually takes several cycles

But it can be pipelined



#### **FP Adder Hardware**



### **Floating-Point Multiplication**

- Consider a 4-digit <u>decimal</u> example
  - $1.110 \times 10^{10} \times 9.200 \times 10^{-5}$
- 1. Add exponents
  - For biased exponents, subtract bias from sum
  - New exponent = 10 + -5 = 5
- 2. Multiply the significands
  - $1.110 \times 9.200 = 10.212 \implies 10.212 \times 10^5$
- 3. Normalize result & check for over/underflow
  - 1.0212 × 10<sup>6</sup>
- 4. Round and renormalize if necessary
  - 1.021 × 10<sup>6</sup>
- 5. Determine sign of result from signs of operands
  - +1.021 × 10<sup>6</sup>



### **Floating-Point Multiplication**

- Now consider a 4-digit <u>binary</u> example
  - $1.000_2 \times 2^{-1} \times -1.110_2 \times 2^{-2}$  (0.5 × -0.4375)
- 1. Add exponents
  - Unbiased: -1 + -2 = -3
  - Biased: (-1 + 127) + (-2 + 127) = -3 + 254 127 = -3 + 127
- 2. Multiply the significands
  - $1.000_2 \times 1.110_2 = 1.1102 \implies 1.110_2 \times 2^{-3}$
- 3. Normalize result & check for over/underflow
  - $1.110_2 \times 2^{-3}$  (no change) with no over/underflow
- 4. Round and renormalize if necessary
  - $1.110_2 \times 2^{-3}$  (no change)
- 5. Determine sign: +ve  $\times$  –ve  $\Rightarrow$  –ve
  - $-1.110_2 \times 2^{-3} = -0.21875$

#### **FP** Arithmetic Hardware

- FP multiplier is of similar complexity to FP adder
  - But uses a <u>multiplier</u> for significands instead of an adder
- FP arithmetic hardware usually does
  - Addition, subtraction, multiplication, division, reciprocal, and square-root
  - FP  $\leftrightarrow$  integer conversion (type cast?)
- Operations usually takes several cycles
  - Can be pipelined like FP adder



#### **FP Instructions in MIPS**

- FP hardware is coprocessor 1
  - Adjunct processor that extends the ISA
- Separate FP registers
  - 32 single-precision: \$f0, \$f1, ... \$f31
  - Paired for double-precision: \$f0/\$f1, \$f2/\$f3, ...
    - Release 2 of MIPs ISA supports 32 × 64-bit FP reg's
- FP instructions operate only on FP registers
  - Programs generally don't do integer ops on FP data, or vice versa
  - More registers with minimal code-size impact
- FP load and store instructions
  - Iwc1, Idc1, swc1, sdc1
    - e.g., I dc1 \$f8, 32(\$sp)



#### **FP Instructions in MIPS**

- Single-precision arithmetic
  - add. s, sub. s, mul. s, div.s
    - e.g., add. s \$f0, \$f1, \$f6
- Double-precision arithmetic
  - add. d, sub. d, mul. d, di v. d
    - e.g., mul.d \$f4, \$f4, \$f6
- Single- and double-precision comparison
  - c. xx. s, c. xx. d (xx can be eq, l t, l e, ...)
  - Sets or clears FP condition-code bit
    - e.g. c.lt.s \$f3, \$f4
- Branch on FP condition code true or false
  - bc1t, bc1f
    - e.g., bc1t TargetLabel



### FP Example: °F to °C

C code:

```
float f2c (float fahr) {
    return ((5.0/9.0)*(fahr - 32.0));
}
```

- fahr in \$f12, result in \$f0, literals in global memory space
- Compiled MIPS code:
  - f2c: lwc1 \$f16, const5(\$gp) # 5.0
    lwc2 \$f18, const9(\$gp) # 9.0
    div.s \$f16, \$f16, \$f18
    lwc1 \$f18, const32(\$gp) # 32.0
    sub.s \$f18, \$f12, \$f18
    mul.s \$f0, \$f16, \$f18
    jr \$ra



#### **FP Example: Array Multiplication**

$$X = X + Y \times Z$$

All 32 × 32 matrices, 64-bit double-precision elements

#### C code:

i, j, k in \$s0, \$s1, \$s2

#### **FP Example: Array Multiplication**

#### MIPS code:

|     | li   | \$t1,  | 32     |      | # | \$t1  | = 32 (row size/loop end)             |
|-----|------|--------|--------|------|---|-------|--------------------------------------|
|     | l i  | \$s0,  | 0      |      | # | i =   | O; initialize 1st for loop           |
| L1: | l i  | \$s1,  | 0      |      | # | j =   | 0; restart 2nd for loop              |
| L2: | l i  | \$s2,  | 0      |      | # | k =   | 0; restart 3rd for loop              |
|     | sH   | \$t2,  | \$s0,  | 5    | # | \$t2  | = i * 32 (size of row of x)          |
|     | addu | \$t2,  | \$t2,  | \$s1 | # | \$t2  | = i * size(row) + j                  |
|     | sH   | \$t2,  | \$t2,  | 3    | # | \$t2  | <pre>= byte offset of [i][j]</pre>   |
|     | addu | \$t2,  | \$a0,  | \$t2 | # | \$t2  | <pre>= byte address of x[i][j]</pre> |
|     | I.d  | \$f4,  | 0(\$t2 | 2)   | # | \$f4  | = 8 bytes of x[i][j]                 |
| L3: | sH   | \$t0,  | \$s2,  | 5    | # | \$t0  | = k * 32 (size of row of z)          |
|     | addu | \$t0,  | \$t0,  | \$s1 | # | \$t0  | = k * size(row) + j                  |
|     | sH   | \$t0,  | \$t0,  | 3    | # | \$t0  | <pre>= byte offset of [k][j]</pre>   |
|     | addu | \$t0,  | \$a2,  | \$t0 | # | \$t0  | <pre>= byte address of z[k][j]</pre> |
|     | I.d  | \$f16, | 0(\$1  | t0)  | # | \$f16 | 6 = 8 bytes of z[k][j]               |

...

#### **FP Example: Array Multiplication**

. . .

| •••    |                |         |          |   |                                            |
|--------|----------------|---------|----------|---|--------------------------------------------|
| sll \$ | \$tO, \$       | \$s0, 5 | 5        | # | <pre>\$t0 = i *32 (size of row of y)</pre> |
| addu   | \$t0,          | \$t0,   | \$s2     | # | \$t0 = i *size(row) + k                    |
| sH     | \$t0,          | \$t0,   | 3        | # | <pre>\$t0 = byte offset of [i][k]</pre>    |
| addu   | \$t0,          | \$a1,   | \$t0     | # | <pre>\$t0 = byte address of y[i][k]</pre>  |
| I.d    | \$f18,         | 0(\$1   | t0)      | # | <pre>\$f18 = 8 bytes of y[i][k]</pre>      |
| mul.d  | \$f16,         | \$f18   | 3, \$f16 | # | <pre>\$f16 = y[i][k] * z[k][j]</pre>       |
| add. d | \$f4,          | \$f4,   | \$f16    | # | f4=x[i][j] + y[i][k]*z[k][j]               |
| addi u | \$s2,          | \$s2,   | 1        | # | \$k k + 1                                  |
| bne    | \$s2,          | \$t1,   | L3       | # | if (k != 32) go to L3                      |
| s.d    | \$ <b>f</b> 4, | 0(\$t2  | 2)       | # | x[i][j] = \$f4                             |
| addi u | \$s1,          | \$s1,   | 1        | # | \$j = j + 1                                |
| bne    | \$s1,          | \$t1,   | L2       | # | if (j != 32) go to L2                      |
| addi u | \$s0,          | \$s0,   | 1        | # | \$i = i + 1                                |
| bne    | \$s0,          | \$t1,   | L1       | # | if (i != 32) go to L1                      |

#### **Accurate Arithmetic**

- IEEE Std 754 specifies additional rounding control
  - Extra bits of precision (guard, round, sticky)
  - Choice of rounding modes
  - Allows programmer to fine-tune numerical behavior of a computation
- Not all FP units implement all options
  - Most programming languages and FP libraries just use defaults
- Trade-off between hardware complexity, performance, and market requirements



#### **Interpretation of Data**

#### **The BIG Picture**

Bits have no inherent meaning

 Interpretation depends on the instructions applied (i.e. how it is encoded)

#### Computer representations of numbers

- Finite range and precision
- Need to account for this in user programs



#### Associativity

- Parallel programs may interleave operations in unexpected orders
  - Assumptions of associativity may fail

|   | Initial   | (x+y)+z  | x+(y+z)   |
|---|-----------|----------|-----------|
| X | -1.50E+38 |          | -1.50E+38 |
| у | 1.50E+38  | 0.00E+00 |           |
| Z | 1.0       | 1.0      | 1.50E+38  |
|   |           | 1.00E+00 | 0.00E+00  |

 Need to validate parallel programs under varying degrees of parallelism (y >> z)



#### **x86 FP Architecture**

- Originally based on 8087 FP coprocessor
  - 8 × 80-bit extended-precision registers
  - Used as a push-down stack
  - Registers indexed from TOS: ST(0), ST(1), …
- FP values are 32-bit or 64 in memory
  - Converted on load/store of memory operand
  - Integer operands can also be converted on load/store
- Very difficult to generate and optimize code
  - Result: poor FP performance



### **x86 FP Instructions**

| Data transfer                                              | Arithmetic                                                                                      | Compare                           | Transcendental                                      |
|------------------------------------------------------------|-------------------------------------------------------------------------------------------------|-----------------------------------|-----------------------------------------------------|
| FILD mem/ST(i)<br>FISTP mem/ST(i)<br>FLDPI<br>FLD1<br>FLDZ | FIADDP mem/ST(i)<br>FISUBRP mem/ST(i)<br>FIMULP mem/ST(i)<br>FIDIVRP mem/ST(i)<br>FSQRT<br>FABS | FICOMP<br>FIUCOMP<br>FSTSW AX/mem | FPATAN<br>F2XMI<br>FCOS<br>FPTAN<br>FPREM<br>FPSI N |
|                                                            | FRNDI NT                                                                                        |                                   | FYL2X                                               |

- Optional variations
  - I : integer operand
  - P: pop operand from stack
  - R: reverse operand order
  - But not all combinations allowed



#### **Streaming SIMD Extension 2 (SSE2)**

- Adds 4 × 128-bit registers
  - Extended to 8 registers in AMD64/EM64T
- Can be used for multiple FP operands
  - 2 × 64-bit double precision
  - 4 × 32-bit double precision
  - Instructions operate on them simultaneously

<u>Single-Instruction Multiple-Data</u>



### **Right Shift and Division**

- Left shift by *i* places multiplies an integer by 2<sup>i</sup>
- Right shift divides by 2<sup>i</sup>?
  - Only for unsigned integers
- For signed integers
  - Arithmetic right shift: replicate the sign bit
  - e.g., -5 / 4
    - $11111011_2 >> 2 = 11111110_2 = -2$
    - Rounds toward —∞
  - c.f. 11111011<sub>2</sub> >>> 2 = 00111110<sub>2</sub> = +62



#### Who Cares About FP Accuracy?

- Important for scientific code
  - But for everyday consumer use?
    - "My bank balance is out by 0.0002¢!" ⊗
- The Intel Pentium FDIV bug
  - The market expects accuracy
  - See Colwell, The Pentium Chronicles

### **Concluding Remarks**

- ISAs support arithmetic
  - Signed and unsigned integers
  - Floating-point approximation to reals
- Bounded range and precision
  - Operations can overflow and underflow
- MIPS ISA
  - Core instructions: 54 most frequently used
     100% of SPECINT, 97% of SPECFP
  - Other instructions: less frequent

