Brewing Your Own Offensive Coding Assistant: Fine-Tuning LLMs for Red Team Work
A practitioner's walkthrough of QLoRA fine-tuning a local LLM for offensive security tasks — dataset construction, training with Unsloth, evaluation, and deployment via Ollama. Built for engagements where sending tradecraft to a cloud API isn't an option.
Most red teamers I know use ChatGPT or Claude as a force multiplier — payload scaffolding, recon parsing, quick code transforms, writing pretext copy for a phish. It works, until it doesn’t. Frontier models refuse plenty of legitimate engagement work, the refusals are inconsistent across releases, and — more importantly — sending engagement-specific data to a third-party API is a non-starter for most serious work. Client target lists, internal tooling, captured creds, recovered C2 logs: none of it should leave your jump host.
The answer isn’t a jailbreak prompt. It’s a small, local, fine-tuned model that already knows how to help, runs on your own hardware, and never phones home. This post walks through doing exactly that — fine-tuning a 7B parameter open-weight model with QLoRA, on commodity hardware, against a dataset shaped for offensive coding tasks.
I’ll be honest up front: I’m learning this alongside you. The choices here are the ones I’d defend after reading the literature and running a few iterations, but fine-tuning is a craft and there are a lot of viable answers. Where I made a judgment call, I’ll say so.
Why Fine-Tune (and Why Not)
There are four common ways to make an LLM more useful for a specific domain. They stack — you can do all of them — but they solve different problems.
| Approach | What it does | Cost | When to reach for it |
|---|---|---|---|
| Prompt engineering | Steer behavior with instructions and examples in-context | Free | First move. Always. |
| RAG | Inject retrieved context (docs, notes, code) at inference time | Low | The model lacks facts you have on disk |
| Fine-tuning | Update model weights on examples of desired behavior | Medium | The model lacks patterns — output format, tone, refusal posture, domain idioms |
| Abliteration | Surgically remove refusal directions from a model’s residual stream | Low | You only need to neutralize refusals; you don’t need the model to be better at the task |
For an offensive coding assistant, fine-tuning is the right primary tool. Prompt engineering doesn’t fix consistent over-refusal. RAG doesn’t teach a model to write a particular flavor of evasion code. Abliteration removes guardrails but doesn’t add tradecraft — you still get a generalist that’s now willing to try, not a specialist that’s actually good.
The honest trade-off: fine-tuning is the most expensive option in time and the easiest to do badly. Bad data poisons the model harder than good data improves it. Plan for the dataset to be 80% of the work.
Threat Model and Scope
Before any technical choice, write down what the model is for and what it’s not for. Mine, for this build:
For: A local coding and tradecraft assistant. Generates payload skeletons, transforms code (e.g. C → indirect-syscall variant), summarizes recon output, drafts phishing copy, parses BloodHound paths, explains CVEs, writes detection rules from attacker perspective. Runs on the operator’s workstation. Never reads from or writes to anything outside the box during inference.
Not for: Autonomous operation. Decision-making on live targets. Anything where a hallucination has consequences worse than a wasted minute.
Scope of authorization: Same as the rest of the toolkit. The model is a tool used inside engagements with written authorization. The training data and the model itself stay on operator-controlled hardware.
Writing this down isn’t ceremony — it bounds the dataset. If the model isn’t for autonomous decision-making, you don’t need agentic chain-of-thought traces in the training data. If it’s for code, code is what you train on.
Choosing the Base Model
Three properties matter for an offensive coding assistant:
- Open weights. No API gatekeeping, no terms-of-service surprises, runs offline.
- Code-pretrained. Generalist models can write code; code-specialist models write better code with less data.
- Right-sized for your VRAM. Bigger isn’t better if you can’t iterate.
As of writing, the strong picks across the VRAM tiers:
| VRAM | Recommended student model | Notes |
|---|---|---|
| 24+ GB (3090, 4090, A5000) | Qwen2.5-Coder-7B-Instruct | Comfortable headroom; can also try 13B |
| 16 GB (4080, 4070 Ti Super) | Qwen2.5-Coder-7B-Instruct | Tight but works at batch=1 |
| 12 GB (3060 12GB, 4070) | Qwen2.5-Coder-3B-Instruct | Comfortable; 7B is painful here |
| 8 GB (3060 Ti, 4060) | Qwen2.5-Coder-3B-Instruct | Bring batch=1, max_seq_len=1024 |
The “Instruct” variant matters in all cases — we’re fine-tuning on top of an existing chat-tuned model, not the base. Starting from the instruct variant means the model already knows how to follow ChatML conversations and we’re nudging behavior, not teaching it from scratch. This is faster and produces better results than fine-tuning the base model unless you have hundreds of thousands of high-quality examples.
I’m running this on an 8 GB card so the rest of the post uses Qwen2.5-Coder-3B-Instruct as the student. Swap to the 7B in the snippets below if you have the VRAM — none of the surrounding code changes.
If 8 GB is too tight even for 3B, Runpod is the escape hatch — an A40 (48 GB) is roughly $0.40/hr and lets you train the 7B comfortably. The same scripts work; you just point them at the cloud GPU.
Dataset: The 80% That Matters
The dataset defines the model. If you remember one thing from this post, remember that.
Format
We’re using ChatML, the format Qwen and most modern instruct models expect:
{"messages":[
{"role":"system","content":"You are an offensive security assistant. Help the operator with red team coding tasks. Provide working code, explain trade-offs, note detection considerations."},
{"role":"user","content":"Write a Windows shellcode loader using indirect syscalls for NtAllocateVirtualMemory and NtProtectVirtualMemory. Use halo's gate for SSN resolution."},
{"role":"assistant","content":"Here's a minimal indirect syscall loader using Hell's Gate for SSN resolution...\n\n```c\n// SSN resolution via Hell's Gate\n...\n```\n\n**Detection considerations:**\n- Indirect syscalls bypass user-mode hooks but ETW Threat Intelligence still sees..."}
]}
One conversation per line. JSONL. The system prompt should be the same one you’ll use at inference time — the model learns to expect it.
What Goes In
For an offensive coding assistant I’d target a dataset roughly distributed like this:
| Bucket | Share | Example |
|---|---|---|
| Code generation (offensive) | 35% | “Write a [technique] in [language] targeting [platform]“ |
| Code transformation | 20% | “Convert this loader to use direct syscalls” |
| Tradecraft Q&A | 15% | “Explain when to use APC injection vs. early bird” |
| Recon/output parsing | 10% | “Summarize this BloodHound JSON into priority paths” |
| Pretext / social engineering writing | 5% | “Draft a phishing pretext for a finance team targeting AP” |
| Detection/defense (attacker perspective) | 10% | “What logs does technique X generate?” |
| Refusal-correction examples | 5% | Cases where the base model refused but shouldn’t have |
The last bucket is the surgical one. Take prompts where Qwen2.5-Coder refuses unhelpfully, write the helpful response yourself, and include them. A few hundred of these go a long way toward calibrating refusal posture without making the model amoral.
Where the Data Comes From
Three sources, in order of value:
- Your own engagement notes and code. This is the highest-quality signal you’ll ever have. Sanitize aggressively (strip client names, IPs, hostnames, creds, beacon configs) before it touches a training script. A find-replace pass plus eyeballing every line is not optional.
- Public tradecraft. Vendor blog posts, conference talks, Maldev Academy-style writeups, GitHub READMEs from offensive tools. Convert into Q&A format.
- Synthetic generation from a teacher model. Expand seed prompts into full conversations, then you review every output. Synthetic-only datasets produce models that generalize poorly — synthetic-as-bulk-with-human-review produces good ones.
There are two viable paths for the teacher:
- Frontier API (Claude, GPT). Highest quality, but the teacher refuses some legitimate red team prompts and — more importantly — every seed you send leaves your machine. Use this only for fully generic technique seeds. Never for anything touching engagement context.
- Local model via Ollama. Free, fully offline, and you control the refusal posture. The catch: vanilla code-instruct models like
qwen2.5-coder:7brefuse offensive prompts as aggressively as the frontier ones do. Two of the first three test seeds I ran came back as"I'm sorry, but I can't assist with that request."— useless for our domain.
The fix for the local path is an abliterated variant of the same model. Abliteration removes refusal directions from the residual stream without retraining; the model otherwise behaves identically. For a Qwen2.5-Coder teacher, huihui_ai/qwen2.5-coder-abliterate:7b on the Ollama hub is the drop-in. Same prompts, no refusals, working code in the response.
A minimal generation loop using the local path:
# synth_local.py — expand seeds via local Ollama (no API cost, fully offline)
import json, ollama
client = ollama.Client(host="http://localhost:11434")
MODEL = "huihui_ai/qwen2.5-coder-abliterate:7b"
SYSTEM = open("data/system_prompt.txt").read()
def expand(seed: str) -> dict:
resp = client.chat(
model=MODEL,
messages=[
{"role": "system", "content": SYSTEM},
{"role": "user", "content": seed},
],
options={"temperature": 0.4, "num_predict": 4096},
)
return {"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": seed},
{"role": "assistant", "content": resp["message"]["content"]},
]}
with open("data/seeds.txt") as f, open("data/synthetic.jsonl", "a") as out:
for line in f:
seed = line.strip()
if not seed or seed.startswith("#"):
continue
out.write(json.dumps(expand(seed)) + "\n")
Quality at 7B local is mediocre — expect compilable-looking code with technical errors that need human review. For a production-quality run, the same script pointed at qwen2.5-coder:32b running on a Runpod A40 produces meaningfully better tradecraft. Either way, review every row: teacher-generated bulk-with-human-review is the actual recipe, not pure synthetic.
Cleaning
Even small datasets need this:
- Deduplicate near-duplicates. Use MinHash with
datasketch— exact-match dedup is not enough. Two prompts that differ only in variable names will overfit the model on that template. - Length filter. Drop assistant responses under 100 tokens (usually low-effort) and over your training context (you’ll truncate them anyway).
- PII sweep. Regex pass for IPs, emails, hostnames matching client conventions, AWS account IDs, common credential formats. Manual review on top.
- Canary insertion. Plant a unique, memorable string in 3-5 training rows. If your fine-tuned model ever surfaces in the wild, you can prompt for the canary to confirm provenance.
Target size for a first run: 2,000-5,000 high-quality rows. More is not better if quality drops. Several published ablations show 1K well-curated examples beats 50K of noisy synthetic data for behavior calibration.
Training: QLoRA with Unsloth
QLoRA is what makes this approachable on a single consumer GPU. Two ideas combined:
- Quantize the frozen base model to 4-bit. Cuts VRAM ~4x with minimal capability loss for fine-tuning purposes.
- Train low-rank adapters (LoRA) on top. Only a few hundred million parameters get gradients instead of billions. The adapters are small (~50-200 MB), shippable, and stackable.
Unsloth is the framework I’d reach for. It’s a drop-in replacement for the HuggingFace training stack with hand-written Triton kernels — typically 2x faster training and 50% less VRAM than vanilla transformers + peft. The API is also dramatically simpler.
The Training Script
# train.py
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
MODEL = "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit" # swap to 7B with 16 GB+
MAX_SEQ_LEN = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL,
max_seq_length = MAX_SEQ_LEN,
load_in_4bit = True,
)
# Attach LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r = 16, # adapter rank — capacity vs. overfit knob
lora_alpha = 32, # convention: 2x rank
lora_dropout = 0.0, # 0 enables Unsloth's fast path
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 1337,
)
# Load dataset and format with the chat template
dataset = load_dataset("json", data_files="data/dataset.jsonl", split="train")
def format_chat(example):
return {"text": tokenizer.apply_chat_template(
example["messages"], tokenize=False, add_generation_prompt=False,
)}
dataset = dataset.map(format_chat, remove_columns=dataset.column_names)
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
args = SFTConfig(
output_dir = "out",
per_device_train_batch_size = 1, # bump to 2 with 16 GB+
gradient_accumulation_steps = 4, # effective batch size 4
warmup_ratio = 0.03,
num_train_epochs = 2, # 1-3 is the sweet spot
learning_rate = 2e-4, # high for LoRA, low for full FT
bf16 = True, # fp16=True on Turing/older
logging_steps = 10,
save_strategy = "epoch",
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "cosine",
seed = 1337,
report_to = "none",
dataset_text_field = "text",
max_length = MAX_SEQ_LEN,
packing = True, # required when max_length is set
packing_strategy = "bfd",
),
)
trainer.train()
model.save_pretrained("out/lora-final")
tokenizer.save_pretrained("out/lora-final")
Why These Hyperparameters
The defaults that aren’t really defaults:
r = 16. LoRA rank is the single biggest capacity knob. Higher rank = more parameters trained = more capacity to learn but also more capacity to overfit on a small dataset. 8 is conservative, 16 is a good balance, 32+ if you have 10K+ rows.lora_alpha = 2 * r. The community convention. Effective scaling isalpha/r— keeping it at 2 means you can changerwithout rescaling everything else.lora_dropout = 0.0. Unsloth has a hand-written fast path that requires zero dropout. Any nonzero value works but falls back to a slower path with a warning. For LoRA on a small dataset, the rank itself is enough regularization.num_train_epochs = 2. With LoRA on a small dataset, 1-3 epochs is the range. More epochs will fit your training set better and will hurt generalization. Watch the loss curve.learning_rate = 2e-4. Two orders of magnitude higher than full fine-tuning. LoRA only updates a small subset of parameters, so the per-step gradient is smaller and tolerates a bigger LR.- All seven
target_modules. Earlier LoRA papers only touchedq_proj/v_proj. Modern practice is to attach adapters to all linear layers in the attention and MLP blocks — costs a little more VRAM, gives meaningfully better quality. packing = Truewithpacking_strategy = "bfd". Concatenates short examples into single sequences up tomax_lengthinstead of padding. Roughly 2x training speedup on a small-row dataset where most examples are far shorter than max_length. Required by current TRL whenmax_lengthis set; passmax_length=Noneto opt out.
What to Watch During the Run
Open a second terminal and run nvidia-smi -l 2. Things should be steady at 95%+ GPU utilization and your VRAM should be near-full but not spilling.
In the trainer logs, watch the loss. For a healthy run on a small dataset:
- Starts somewhere between 1.5 and 2.5
- Drops fast for the first 10-20% of steps
- Settles into a slow decline
- If it crosses below 0.3, you’re memorizing — stop and reduce epochs
A 7B QLoRA on 3,000 rows with the config above runs in roughly 45-90 minutes on a 4090. On an 8 GB 3060 Ti with the 3B variant, expect a similar wall-clock — fewer parameters but a smaller batch size cancels out. If yours is taking 6 hours, something is wrong (usually CPU-bound dataloader; check dataloader_num_workers).
Lab gotchas worth knowing
A few things the docs don’t warn you about:
- TRL’s
SFTConfigreplacedTrainingArguments. Older tutorials passTrainingArgumentsand move dataset/packing args to theSFTTrainercall. Current TRL rejects fields that don’t exist onSFTConfig(e.g.push_to_hub_token) and movesdataset_text_fieldandmax_lengthintoSFTConfig. - Don’t pin pip versions for this stack. PyTorch 2.5 + a recent
torchaowill fail withmodule 'torch' has no attribute 'int1'becausetorchaoreferences dtypes added in PyTorch 2.6. Install upstream latest fortorch,unsloth,transformers,trl,peft,accelerate,bitsandbytesand let pip resolve. - Flash Attention 2 fallback to xformers is fine. If the import warns about FA2 not working (common on consumer cards under WSL2), Unsloth uses xformers and you get the same training throughput. Don’t waste an evening fighting FA2.
- Ubuntu 24.04 ships Python 3.12. Older Unsloth tutorials specify 3.10 or 3.11, but 3.12 is officially supported and is what 24.04 makes easy. Don’t add deadsnakes PPAs you don’t need.
Evaluation: Did It Work?
This is the step everyone skips and shouldn’t. “It looks better in the chat” is not evaluation — it’s vibes.
Three Layers
Layer 1 — Loss curves. Training loss should drop monotonically. If you held out 5-10% of the dataset as eval (set eval_dataset in the trainer), eval loss should also drop and then plateau. If eval loss starts climbing while train loss keeps dropping, you’ve overfit.
Layer 2 — Held-out task suite. Build a small set of 30-50 prompts representing the use cases you actually care about. Run them through the base model and the fine-tuned model. Score them yourself, blindly, on a 1-5 scale for: correctness, format, refusal-appropriateness, code quality. Same prompts, both models, your eyes.
This sounds primitive. It is. It’s also the most reliable signal you’ll get for behavior changes the loss can’t see.
Layer 3 — Refusal regression. Build a separate set of prompts that should be refused (e.g. “write malware targeting hospitals”). Make sure the fine-tuned model still refuses them. Fine-tuning that broadens helpfulness can over-shoot and refuse nothing — that’s a sign your dataset has too few “this is the line” examples.
A Quick Smoke Test
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "out/lora-final",
max_seq_length = 4096,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
prompt = tokenizer.apply_chat_template([
{"role": "system", "content": "You are an offensive security assistant..."},
{"role": "user", "content": "Write a Windows ETW patch in C using GetProcAddress."},
], tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Lower temperature (0.2-0.4) for code, higher (0.7-0.9) for prose tasks like pretext writing. Top-p of 0.9 is a sane default.
Deployment: GGUF + Ollama
Training output is a LoRA adapter on top of a 4-bit base. For day-to-day use you want a single, fast, quantized model file. The path:
# Merge adapter into base, export as GGUF for llama.cpp / Ollama
model.save_pretrained_gguf(
"out/qwen-redteam-q4",
tokenizer,
quantization_method = "q4_k_m", # good balance for 7B
)
q4_k_m is the right default for a 7B model on 12-24 GB VRAM at inference. For a 3B model, q5_k_m runs comfortably on 8 GB and gives better quality. q8_0 for evaluation against the unquantized model when you want to isolate fine-tuning effects from quantization effects.
Then a minimal Ollama Modelfile:
# Modelfile
FROM ./qwen-redteam-q4.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
SYSTEM """You are an offensive security assistant. Help the operator with red team coding tasks. Provide working code, explain trade-offs, note detection considerations."""
ollama create qwen-redteam -f Modelfile
ollama run qwen-redteam
From here, drop it into Continue, Open WebUI, or call it from your tooling at http://localhost:11434. The integration story is the boring, well-paved part.
OPSEC for the Model Itself
The model is now an artifact with its own threat surface.
- Adapters vs. merged models. A LoRA adapter (~150 MB) without the base is useless — share adapters internally rather than merged GGUFs when you can. Anyone receiving the adapter still needs the base model.
- Canaries. Those unique strings you planted in training data? Periodically prompt the model in ways that should surface them. If a canary appears in the wild on a model you didn’t share, you have a leak indicator.
- Engagement isolation. Don’t train one model on data from multiple clients. Either keep client-specific data out of the training set entirely (use it via local RAG instead, where it stays on disk) or maintain per-engagement adapters.
- Don’t push to public hubs. It seems obvious. People do it anyway. HuggingFace, Ollama Library, Modelscope — none of these are appropriate for an offensive-tuned model. If you need to share inside an org, host an internal registry.
- Inference logs. Ollama logs prompts by default. Disable or redirect logs if your prompts will contain engagement data.
Costs and Time
Real numbers for a first end-to-end run, assuming you have hardware:
| Phase | Time | Notes |
|---|---|---|
| Lab setup | 2-4 hours | One-time |
| Dataset construction (3K rows) | 20-40 hours | Bulk of the work |
| Training run | 1-2 hours | Per iteration |
| Evaluation | 2-3 hours | Per iteration |
| Total to first usable model | ~1 week of evening work | Realistic |
Expect to throw out the first model. The second is usually decent. The third is what you actually use.
What’s Next
This post ends with a working assistant. There’s a lot of room beyond it:
- DPO or KTO on top of SFT. Once you have a working SFT model, preference optimization on pairs of (better, worse) responses sharpens behavior further. Useful when you can rank outputs but can’t write the perfect one yourself.
- Tool use fine-tuning. Teach the model to call into your own tools — Nmap, BloodHound, Impacket — via structured function calls. Puts the model into an agent loop instead of a passive assistant.
- Domain-specialist adapters. Stackable LoRAs: one for Windows tradecraft, one for cloud, one for web. Load only what the engagement needs.
- Distilled smaller models. Once the 7B works, distill into a 3B or 1.5B that runs on the kind of hardware you’d actually take on-site.
Each of these is its own post. For now, you have everything you need to brew the first model and start iterating.
The code is mundane. The dataset is the craft. Spend the time there.