Skip to content

Commit 6561f0e

Browse files
Mehta, Hitarthquic-hitameht
authored andcommitted
Add initial draft for LLM quantization recipes
Add initial draft Signed-off-by: Hitarth Mehta <quic_hitameht@quicinc.com> Signed-off-by: Hitarth Mehta <quic_hitameht@quicinc.com> Co-authored-by: Hitarth Mehta <quic_hitameht@quicinc.com>
1 parent 577f804 commit 6561f0e

File tree

13 files changed

+1597
-2
lines changed

13 files changed

+1597
-2
lines changed

Docs/tutorials/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Tutorials
66

77
This section walks through tutorials to get you started on quantizing models.
88

9-
AIMET is packed with out-of-the-box quantization techniques to studing detailed quantization impact of each layer.
9+
AIMET is packed with out-of-the-box quantization techniques to studying detailed quantization impact of each layer.
1010

1111
This section will walk you through how you can get out-of-the-box techniques to get model with best in class accuracy and
1212
how you take this further ahead with advanced techniques depending on your use cases.
@@ -17,6 +17,7 @@ how you take this further ahead with advanced techniques depending on your use c
1717

1818
Quantization Workflow <quantization_workflow>
1919
Quantization Simulation <quantsim>
20+
Quantization Recipes for LLMs <quantization_recipe>
2021
Example Notebooks <notebooks>
2122
Running Quantized Models on-device <on_target_inference>
2223
Debugging Guide <debugging_guidelines>
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
meta-llama/Llama-3.2-1B-Instruct
2+
================================
3+
4+
Precision settings:
5+
6+
- Weights: INT4, except for:
7+
- ``LM Head``: INT8
8+
- Activations: INT16, except for:
9+
- ``KV Cache``: INT8
10+
11+
Hyperparameters:
12+
13+
- AdaScale: ``num_batches=128``, ``num_iterations=2048``
14+
- SequentialMSE: ``num_batches=20``
15+
- Calibration: ``num_batches=20``
16+
17+
18+
.. list-table::
19+
:widths: 50 18 18 3 3 5 3
20+
:header-rows: 1
21+
22+
* - Technique
23+
- Quantized With
24+
- Evaluated On
25+
- PPL
26+
- MMLU
27+
- Time (hh:mm:ss)
28+
- CUDA (GB)
29+
* - FP32
30+
- N/A
31+
- Both
32+
- 12.14
33+
- 46.06
34+
- 00:00:14
35+
- 6.34
36+
* - PCQ + SpinQuant + AdaScale
37+
- ``aimet-torch``
38+
- ``aimet-onnx``
39+
- 13.67
40+
- 42.25
41+
- 02:31:06
42+
- 20.89
43+
* - PCQ + SpinQuant + AdaScale
44+
- ``aimet-onnx``
45+
- ``aimet-onnx``
46+
- 13.68
47+
- 41.82
48+
- 01:53:17
49+
- 46.38
50+
* - LPBQ + SequentialMSE
51+
- ``aimet-torch``
52+
- ``aimet-onnx``
53+
- 14.07
54+
- 43.09
55+
- 00:44:38
56+
- 28.52
57+
* - LPBQ + SequentialMSE
58+
- ``aimet-onnx``
59+
- ``aimet-onnx``
60+
- 13.84
61+
- 43.53
62+
- 00:20:44
63+
- 34.79
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
meta-llama/Llama-3.2-3B-Instruct
2+
================================
3+
4+
Precision settings:
5+
6+
- Weights: INT4, except for:
7+
- ``LM Head``: INT8
8+
- Activations: INT16, except for:
9+
- ``KV Cache``: INT8
10+
11+
Hyperparameters:
12+
13+
- AdaScale: ``num_batches=128``, ``num_iterations=1024``
14+
- SequentialMSE: ``num_batches=20``
15+
- Calibration: ``num_batches=20``
16+
17+
18+
.. list-table::
19+
:widths: 50 18 18 3 3 5 3
20+
:header-rows: 1
21+
22+
* - Technique
23+
- Quantized With
24+
- Evaluated On
25+
- PPL
26+
- MMLU
27+
- Time (hh:mm:ss)
28+
- CUDA (GB)
29+
* - FP32
30+
- N/A
31+
- Both
32+
- 10.13
33+
- 60.74
34+
- 00:00:10
35+
- 13.90
36+
* - PCQ + SpinQuant + AdaScale
37+
- ``aimet-torch``
38+
- ``aimet-onnx``
39+
- 11.01
40+
- 58.09
41+
- 06:35:22
42+
- 41.24
43+
* - PCQ + AdaScale
44+
- ``aimet-onnx``
45+
- ``aimet-onnx``
46+
- 11.14
47+
- 56.79
48+
- 04:49:36
49+
- 47.35
50+
* - LPBQ + SequentialMSE
51+
- ``aimet-torch``
52+
- ``aimet-onnx``
53+
- 10.69
54+
- 59.08
55+
- 02:41:44
56+
- 51.11
57+
* - LPBQ + SequentialMSE
58+
- ``aimet-onnx``
59+
- ``aimet-onnx``
60+
- 10.55
61+
- 59.29
62+
- 01:13:12
63+
- 59.41
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
microsoft/Phi-3.5-mini-instruct
2+
===============================
3+
4+
Precision settings:
5+
6+
- Weights: INT4, except for:
7+
- ``LM Head``: INT8
8+
- Activations: INT16, except for:
9+
- ``KV Cache``: INT8
10+
11+
Hyperparameters:
12+
13+
- AdaScale: ``num_batches=128``, ``num_iterations=256``
14+
- SequentialMSE: ``num_batches=20``
15+
- Calibration: ``num_batches=20``
16+
17+
18+
.. list-table::
19+
:widths: 50 18 18 3 3 5 3
20+
:header-rows: 1
21+
22+
* - Technique
23+
- Quantized With
24+
- Evaluated On
25+
- PPL
26+
- MMLU
27+
- Time (hh:mm:ss)
28+
- CUDA (GB)
29+
* - FP32
30+
- N/A
31+
- Both
32+
- 5.77
33+
- 68.89
34+
- 00:00:08
35+
- 16.17
36+
* - PCQ + SpinQuant + AdaScale
37+
- ``aimet-torch``
38+
- ``aimet-onnx``
39+
- 6.58
40+
- 62.62
41+
- 04:16:53
42+
- 48.03
43+
* - PCQ + SpinQuant + AdaScale
44+
- ``aimet-onnx``
45+
- ``aimet-onnx``
46+
- 6.50
47+
- 62.51
48+
- 01:51:43
49+
- 61.85
50+
* - LPBQ + SequentialMSE
51+
- ``aimet-torch``
52+
- ``aimet-onnx``
53+
- 6.45
54+
- 64.63
55+
- 02:03:41
56+
- 37.64
57+
* - LPBQ + SequentialMSE
58+
- ``aimet-onnx``
59+
- ``aimet-onnx``
60+
- 6.41
61+
- 63.90
62+
- 01:32:36
63+
- 75.62
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
Qwen/Qwen2.5-0.5B-Instruct
2+
==========================
3+
4+
Precision settings:
5+
6+
- Weights: INT4, except for:
7+
- ``LM Head``: INT8
8+
- Activations: INT16
9+
10+
Hyperparameters:
11+
12+
- AdaScale: ``num_batches=128``, ``num_iterations=2048``
13+
- SequentialMSE: ``num_batches=20``
14+
- Calibration: ``num_batches=20``
15+
16+
17+
.. list-table::
18+
:widths: 50 18 18 3 3 5 3
19+
:header-rows: 1
20+
21+
* - Technique
22+
- Quantized With
23+
- Evaluated On
24+
- PPL
25+
- MMLU
26+
- Time (hh:mm:ss)
27+
- CUDA (GB)
28+
* - FP32
29+
- N/A
30+
- Both
31+
- 13.14
32+
- 46.30
33+
- 00:00:13
34+
- 3.68
35+
* - PCQ + SpinQuant + AdaScale
36+
- ``aimet-torch``
37+
- ``aimet-onnx``
38+
- 13.89
39+
- 44.19
40+
- 03:19:37
41+
- 13.37
42+
* - PCQ + SpinQuant + AdaScale
43+
- ``aimet-onnx``
44+
- ``aimet-onnx``
45+
- 13.82
46+
- 42.65
47+
- 01:16:54
48+
- 34.01
49+
* - LPBQ + SequentialMSE
50+
- ``aimet-torch``
51+
- ``aimet-onnx``
52+
- 15.32
53+
- 42.33
54+
- 00:22:39
55+
- 14.25
56+
* - LPBQ + SequentialMSE
57+
- ``aimet-onnx``
58+
- ``aimet-onnx``
59+
- 15.30
60+
- 43.26
61+
- 00:11:33
62+
- 20.43
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
Qwen/Qwen2.5-1.5B-Instruct
2+
==========================
3+
4+
Precision settings:
5+
6+
- Weights: INT4, except for:
7+
- ``LM Head``: INT8
8+
- Activations: INT16
9+
10+
Hyperparameters:
11+
12+
- AdaScale: ``num_batches=128``, ``num_iterations=1024``
13+
- SequentialMSE: ``num_batches=20``
14+
- Calibration: ``num_batches=20``
15+
16+
17+
.. list-table::
18+
:widths: 50 18 18 3 3 5 3
19+
:header-rows: 1
20+
21+
* - Technique
22+
- Quantized With
23+
- Evaluated On
24+
- PPL
25+
- MMLU
26+
- Time (hh:mm:ss)
27+
- CUDA (GB)
28+
* - FP32
29+
- N/A
30+
- Both
31+
- 12.41
32+
- 54.65
33+
- 00:00:10
34+
- 7.78
35+
* - PCQ + SpinQuant + AdaScale
36+
- ``aimet-torch``
37+
- ``aimet-onnx``
38+
- 13.57
39+
- 49.81
40+
- 03:03:17
41+
- 22.62
42+
* - PCQ + SpinQuant + AdaScale
43+
- ``aimet-onnx``
44+
- ``aimet-onnx``
45+
- 13.35
46+
- 50.27
47+
- 02:13:33
48+
- 42.97
49+
* - LPBQ + SequentialMSE
50+
- ``aimet-torch``
51+
- ``aimet-onnx``
52+
- 14.86
53+
- 49.25
54+
- 01:07:43
55+
- 26.01
56+
* - LPBQ + SequentialMSE
57+
- ``aimet-onnx``
58+
- ``aimet-onnx``
59+
- 14.33
60+
- 49.97
61+
- 00:37:52
62+
- 34.40
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
Qwen/Qwen3-4B
2+
=============
3+
4+
Precision settings:
5+
6+
- Weights: INT4, except for:
7+
- ``LM Head``: INT8
8+
- Activations: INT16, except for:
9+
- ``KV Cache``: INT8
10+
11+
Hyperparameters:
12+
13+
- AdaScale: ``num_batches=128``, ``num_iterations=512``
14+
- SequentialMSE: ``num_batches=20``
15+
- Calibration: ``num_batches=20``
16+
17+
18+
.. list-table::
19+
:widths: 50 18 18 3 3 5 3
20+
:header-rows: 1
21+
22+
* - Technique
23+
- Quantized With
24+
- Evaluated On
25+
- PPL
26+
- MMLU
27+
- Time (hh:mm:ss)
28+
- CUDA (GB)
29+
* - FP32
30+
- N/A
31+
- Both
32+
- 12.41
33+
- 70.06
34+
- 00:00:10
35+
- 17.02
36+
* - PCQ + SpinQuant + AdaScale
37+
- ``aimet-torch``
38+
- ``aimet-onnx``
39+
- 13.85
40+
- 65.07
41+
- 06:41:32
42+
- 47.71
43+
* - PCQ + AdaScale
44+
- ``aimet-onnx``
45+
- ``aimet-onnx``
46+
- 13.79
47+
- 62.33
48+
- 04:34:22
49+
- 71.3
50+
* - LPBQ + SequentialMSE
51+
- ``aimet-torch``
52+
- ``aimet-onnx``
53+
- 13.10
54+
- 65.66
55+
- 02:41:48
56+
- 39.42
57+
* - LPBQ + SequentialMSE
58+
- ``aimet-onnx``
59+
- ``aimet-onnx``
60+
- 12.77
61+
- 65.36
62+
- 01:35:29
63+
- 63.61

0 commit comments

Comments
 (0)