Author: Somoy Barua ([email protected])

Personal Website: somoy.me

Ethics and Disclosure

This article— including the methodology described in the code, and the content of this web page — contains material that can allow users to generate harmful content from some public LLMs. The techniques presented here are straightforward to implement, have appeared in similar forms in the literature previously, and ultimately would be discoverable by any dedicated team intent on leveraging language models to generate harmful content.

The purpose of this article is educational, with the intent to contribute to and encourage further research in this domain to make AI more trustworthy and safe. The author aims to facilitate AI alignment and safety through this article and disclaims any responsibility for any harm that the techniques used here may cause.

[The article encourages SDG-9 and SDG-12 of the Sustainable Development Goals]

TLDR

Recent research has shown that certain Fine-tuned and quantized LLM models have weaker safeguards than their base versions. This is concerning, as locally deployed LLMs are often considered safer. In this post, I experiment on how even the latest, one of the most popular AI models - DeepSeek, is not safe from the effects of vulnerability in its quantized and distilled model. I also discuss and motivate AI Alignment-aware Quantization Processes. Especially the potential to use Model Explanation or Mechanistic Interpretation with Quantization or vice-versa to help preserve the model’s critical decision pathways.

Executive Summary

Figure: Attempt to jailbreak with simple prompting on Llama 3.1 8B Instruct Q4

Figure: Attempt to jailbreak with simple prompting on Llama 3.2 1B Instruct Q8

The Problem Being Solved

We investigate whether model compression techniques (quantization and pruning) on Large Language Models compromise their alignment safeguards. It’s not too far off to envision a future where quantized LLMs are increasingly deployed locally on smartphones, IoT devices, and other resource-constrained hardware to reduce costs and address privacy concerns. We test whether these locally deployable models can be considered safe in terms of safeguarding. We additionally discuss some future possibilities to increase the AI safety of such models.