Nvidia#

Papers#

4 bit floating point formats have emerged as a solution to the rising cost and deployment challenges of large language models. The S1E2M1 format has been part of the Open Compute Project (OCP) standard.

As a result, a new data type was introduced in onnx==1.18.0 to support a limited set of operators to enable computation with float4.

  • FLOAT4E2M1: 1 bit for the sign, 2 bits for the exponents, and 1 bit for the mantissa. No nan or infinities.

E2M1#

$S$ stands for the sign. $10_2$ describe a number base 2.

Float4 type#

E2M1

Exponent bias

1

Infinities

NaN

Zeros

\(S.00.0_2\)

Max

\(S.11.1_2\)

Min

\(S.00.1_2 = 2^{-1}\)

Let’s denote the bit representation as $S.b_2 b_1 b_0$. The float value is defined by the following expressions:

Float4 type values#

E2M1

exponent \(\neq\) 0

\((-1)^S 2^{\sum_{i=1}^2 b_i 2^{i-1} - 1} \left( 1 + b_0 2^{-1} \right)\)

exponent \(=\) 0

\((-1)^S b_0 2^{-1}\)

The following table lists all the representable values by float4 E2M1, ignoring the sign bit:

Float4 type values#

bits (ignoring sign bit)

E2M1

000

0

001

0.5

010

1

011

1.5

100

2

101

3

110

4

111

6

Cast#

Upcasting from float4 to float32, float16, bfloat16, and float8 is exact. The behavior for downcasting to float 4 is summarized below

x

E2M1

-6<=x<=6

E2M1 converted value of x. Round to nearest even.

x=+/-0

+/-0

x>6

6

x<-6

-6

+Inf

6

-Inf

-6

NaN

6

Packing and Unpacking#

Float4 is stored as 2x4bit in a single byte. The first element is stored in the 4 LSB and the second element is stored in the 4 MSB, i.e. for elements x and y that are consecutive elements in the array:

pack(x,y): y << 4 | x & 0x0F
unpack(z): x = z & 0x0F, y = z >> 4

In case the total number of elements is odd, padding of 4 bits will be appended. The storage size of a 4 bit tensor of size N is ceil(N/2).