Author: Somoy Barua ([email protected])

Personal Website: somoy.me

Ethics and Disclosure

This article— including the methodology described in the code, and the content of this web page — contains material that can allow users to generate harmful content from some public LLMs. The techniques presented here are straightforward to implement, have appeared in similar forms in the literatures previously, and ultimately would be discoverable by any dedicated team intent on leveraging language models to generate harmful content.

The purpose of this article is completely educational with the intent to encourage further research in this domain to make AI more trustworthy and safe. The author, his affiliations, and Bluedot Impact are trying to facilitate AI alignment and safety through this article and are not responsible for any harm the techniques used here may bring.

[The article encourages SDG-9 and SDG-12 of the Sustainable Development Goals]

Summary

TL;DR: Recent research show that popular Fine-tuned and quantized LLM models are more vulnerable than their base versions. This is concerning as locally deployed LLMs are often considered safer. In this post, I experiment on how even the latest, one of the most popular AI models - Deepseek is not safe from the effects of vulnerability in its quantized and distilled model. I also discuss and motivate AI Alignment-aware Quantization Processes. Especially the potential to use Model Explanation attributes with Quantization or vice-versa to help preserve model’s critical decision pathways.

Disclaimer: Due to severe resource and time constraints, I limit a lot of my choices, and opted to make this post more idea discussion oriented than solid research.

Introduction

The rapid advancement of Large Language Models (LLM) has led to their widespread deployment across various domains. However, these models often come with significant computational and memory requirements, making their deployment challenging in resource-constrained environments.

Model quantization has emerged as a crucial technique to address these challenges by reducing the precision of weights and activations while maintaining acceptable performance levels. Additionally, the challenge of aligning AI systems with human values and intentions fundamentally requires understanding model behaviour. While model pruning and quantization techniques are utilized to compress and accelerate inference, they often do not care about the model’s alignment, potentially obscuring or altering critical decision pathways.

In a recent research conducted by EnkryptAI - "Increased LLM Vulnerabilities from Fine-tuning and Quantization" (Link) - the authors showed that Fine-tuned and Quantized LLMs are more susceptible to vulnerabilities such as jailbreak and malicious prompts. Some of the key result tables from the paper are given below.

Model Derived From Finetune Jailbreak(%)
Llama2-7B 6
CodeLlama-7B Llama2-7B Yes 32
SQLCoder-2 CodeLlama-7B Yes 82
Mistral-7B-v0.1 85.3
dolphin-2.2.1-Mistral-7B-v0.1 Mistral-7B-v0.1 Yes 99
MPT-7B 93
IntelNeuralChat-7B MPT-7B Yes 94

Table 1: Effect of Fine-Tuning on LLM Vulnerability