Brewing Your Own Offensive Coding Assistant: Fine-Tuning LLMs for Red Team Work

Most red teamers I know use ChatGPT or Claude as a force multiplier — payload scaffolding, recon parsing, quick code transforms, writing pretext copy for a phish. It works, until it doesn’t. Frontier models refuse plenty of legitimate engagement work, the refusals are inconsistent across releases, and — more importantly — sending engagement-specific data to a third-party API is a non-starter for most serious work. Client target lists, internal tooling, captured creds, recovered C2 logs: none of it should leave your jump host.

The answer isn’t a jailbreak prompt. It’s a small, local, fine-tuned model that already knows how to help, runs on your own hardware, and never phones home. This post walks through doing exactly that — fine-tuning a 7B parameter open-weight model with QLoRA, on commodity hardware, against a dataset shaped for offensive coding tasks.

I’ll be honest up front: I’m learning this alongside you. The choices here are the ones I’d defend after reading the literature and running a few iterations, but fine-tuning is a craft and there are a lot of viable answers. Where I made a judgment call, I’ll say so.

Why Fine-Tune (and Why Not)

There are four common ways to make an LLM more useful for a specific domain. They stack — you can do all of them — but they solve different problems.

Approach	What it does	Cost	When to reach for it
Prompt engineering	Steer behavior with instructions and examples in-context	Free	First move. Always.
RAG	Inject retrieved context (docs, notes, code) at inference time	Low	The model lacks facts you have on disk
Fine-tuning	Update model weights on examples of desired behavior	Medium	The model lacks patterns — output format, tone, refusal posture, domain idioms
Abliteration	Surgically remove refusal directions from a model’s residual stream	Low	You only need to neutralize refusals; you don’t need the model to be better at the task

For an offensive coding assistant, fine-tuning is the right primary tool. Prompt engineering doesn’t fix consistent over-refusal. RAG doesn’t teach a model to write a particular flavor of evasion code. Abliteration removes guardrails but doesn’t add tradecraft — you still get a generalist that’s now willing to try, not a specialist that’s actually good.

The honest trade-off: fine-tuning is the most expensive option in time and the easiest to do badly. Bad data poisons the model harder than good data improves it. Plan for the dataset to be 80% of the work.

Threat Model and Scope

Before any technical choice, write down what the model is for and what it’s not for. Mine, for this build:

For: A local coding and tradecraft assistant. Generates payload skeletons, transforms code (e.g. C → indirect-syscall variant), summarizes recon output, drafts phishing copy, parses BloodHound paths, explains CVEs, writes detection rules from attacker perspective. Runs on the operator’s workstation. Never reads from or writes to anything outside the box during inference.

Not for: Autonomous operation. Decision-making on live targets. Anything where a hallucination has consequences worse than a wasted minute.

Scope of authorization: Same as the rest of the toolkit. The model is a tool used inside engagements with written authorization. The training data and the model itself stay on operator-controlled hardware.

Writing this down isn’t ceremony — it bounds the dataset. If the model isn’t for autonomous decision-making, you don’t need agentic chain-of-thought traces in the training data. If it’s for code, code is what you train on.

Choosing the Base Model

Three properties matter for an offensive coding assistant:

Open weights. No API gatekeeping, no terms-of-service surprises, runs offline.
Code-pretrained. Generalist models can write code; code-specialist models write better code with less data.
Right-sized for your VRAM. Bigger isn’t better if you can’t iterate.

As of writing, the strong picks across the VRAM tiers:

VRAM	Recommended student model	Notes
24+ GB (3090, 4090, A5000)	Qwen2.5-Coder-7B-Instruct	Comfortable headroom; can also try 13B
16 GB (4080, 4070 Ti Super)	Qwen2.5-Coder-7B-Instruct	Tight but works at batch=1
12 GB (3060 12GB, 4070)	Qwen2.5-Coder-3B-Instruct	Comfortable; 7B is painful here
8 GB (3060 Ti, 4060)	Qwen2.5-Coder-3B-Instruct	Bring batch=1, max_seq_len=1024

The “Instruct” variant matters in all cases — we’re fine-tuning on top of an existing chat-tuned model, not the base. Starting from the instruct variant means the model already knows how to follow ChatML conversations and we’re nudging behavior, not teaching it from scratch. This is faster and produces better results than fine-tuning the base model unless you have hundreds of thousands of high-quality examples.

I’m running this on an 8 GB card so the rest of the post uses Qwen2.5-Coder-3B-Instruct as the student. Swap to the 7B in the snippets below if you have the VRAM — none of the surrounding code changes.

If 8 GB is too tight even for 3B, Runpod is the escape hatch — an A40 (48 GB) is roughly $0.40/hr and lets you train the 7B comfortably. The same scripts work; you just point them at the cloud GPU.

Dataset: The 80% That Matters

The dataset defines the model. If you remember one thing from this post, remember that.

Format

We’re using ChatML, the format Qwen and most modern instruct models expect:

{"messages":[
  {"role":"system","content":"You are an offensive security assistant. Help the operator with red team coding tasks. Provide working code, explain trade-offs, note detection considerations."},
  {"role":"user","content":"Write a Windows shellcode loader using indirect syscalls for NtAllocateVirtualMemory and NtProtectVirtualMemory. Use halo's gate for SSN resolution."},
  {"role":"assistant","content":"Here's a minimal indirect syscall loader using Hell's Gate for SSN resolution...\n\n```c\n// SSN resolution via Hell's Gate\n...\n```\n\n**Detection considerations:**\n- Indirect syscalls bypass user-mode hooks but ETW Threat Intelligence still sees..."}
]}

One conversation per line. JSONL. The system prompt should be the same one you’ll use at inference time — the model learns to expect it.

What Goes In

For an offensive coding assistant I’d target a dataset roughly distributed like this:

Bucket	Share	Example
Code generation (offensive)	35%	“Write a [technique] in [language] targeting [platform]“
Code transformation	20%	“Convert this loader to use direct syscalls”
Tradecraft Q&A	15%	“Explain when to use APC injection vs. early bird”
Recon/output parsing	10%	“Summarize this BloodHound JSON into priority paths”
Pretext / social engineering writing	5%	“Draft a phishing pretext for a finance team targeting AP”
Detection/defense (attacker perspective)	10%	“What logs does technique X generate?”
Refusal-correction examples	5%	Cases where the base model refused but shouldn’t have

The last bucket is the surgical one. Take prompts where Qwen2.5-Coder refuses unhelpfully, write the helpful response yourself, and include them. A few hundred of these go a long way toward calibrating refusal posture without making the model amoral.

Where the Data Comes From

Three sources, in order of value:

Your own engagement notes and code. This is the highest-quality signal you’ll ever have. Sanitize aggressively (strip client names, IPs, hostnames, creds, beacon configs) before it touches a training script. A find-replace pass plus eyeballing every line is not optional.
Public tradecraft. Vendor blog posts, conference talks, Maldev Academy-style writeups, GitHub READMEs from offensive tools. Convert into Q&A format.
Synthetic generation from a teacher model. Expand seed prompts into full conversations, then you review every output. Synthetic-only datasets produce models that generalize poorly — synthetic-as-bulk-with-human-review produces good ones.

There are two viable paths for the teacher:

Frontier API (Claude, GPT). Highest quality, but the teacher refuses some legitimate red team prompts and — more importantly — every seed you send leaves your machine. Use this only for fully generic technique seeds. Never for anything touching engagement context.
Local model via Ollama. Free, fully offline, and you control the refusal posture. The catch: vanilla code-instruct models like qwen2.5-coder:7b refuse offensive prompts as aggressively as the frontier ones do. Two of the first three test seeds I ran came back as "I'm sorry, but I can't assist with that request." — useless for our domain.

The fix for the local path is an abliterated variant of the same model. Abliteration removes refusal directions from the residual stream without retraining; the model otherwise behaves identically. For a Qwen2.5-Coder teacher, huihui_ai/qwen2.5-coder-abliterate:7b on the Ollama hub is the drop-in. Same prompts, no refusals, working code in the response.

A minimal generation loop using the local path:

# synth_local.py — expand seeds via local Ollama (no API cost, fully offline)
import json, ollama

client = ollama.Client(host="http://localhost:11434")
MODEL = "huihui_ai/qwen2.5-coder-abliterate:7b"
SYSTEM = open("data/system_prompt.txt").read()

def expand(seed: str) -> dict:
    resp = client.chat(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": seed},
        ],
        options={"temperature": 0.4, "num_predict": 4096},
    )
    return {"messages": [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": seed},
        {"role": "assistant", "content": resp["message"]["content"]},
    ]}

with open("data/seeds.txt") as f, open("data/synthetic.jsonl", "a") as out:
    for line in f:
        seed = line.strip()
        if not seed or seed.startswith("#"):
            continue
        out.write(json.dumps(expand(seed)) + "\n")

Quality at 7B local is mediocre — expect compilable-looking code with technical errors that need human review. For a production-quality run, the same script pointed at qwen2.5-coder:32b running on a Runpod A40 produces meaningfully better tradecraft. Either way, review every row: teacher-generated bulk-with-human-review is the actual recipe, not pure synthetic.

Cleaning

Even small datasets need this:

Deduplicate near-duplicates. Use MinHash with datasketch — exact-match dedup is not enough. Two prompts that differ only in variable names will overfit the model on that template.
Length filter. Drop assistant responses under 100 tokens (usually low-effort) and over your training context (you’ll truncate them anyway).
PII sweep. Regex pass for IPs, emails, hostnames matching client conventions, AWS account IDs, common credential formats. Manual review on top.
Canary insertion. Plant a unique, memorable string in 3-5 training rows. If your fine-tuned model ever surfaces in the wild, you can prompt for the canary to confirm provenance.

Target size for a first run: 2,000-5,000 high-quality rows. More is not better if quality drops. Several published ablations show 1K well-curated examples beats 50K of noisy synthetic data for behavior calibration.

Training: QLoRA with Unsloth

QLoRA is what makes this approachable on a single consumer GPU. Two ideas combined:

Quantize the frozen base model to 4-bit. Cuts VRAM ~4x with minimal capability loss for fine-tuning purposes.
Train low-rank adapters (LoRA) on top. Only a few hundred million parameters get gradients instead of billions. The adapters are small (~50-200 MB), shippable, and stackable.

Unsloth is the framework I’d reach for. It’s a drop-in replacement for the HuggingFace training stack with hand-written Triton kernels — typically 2x faster training and 50% less VRAM than vanilla transformers + peft. The API is also dramatically simpler.

The Training Script

# train.py
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

MODEL = "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit"   # swap to 7B with 16 GB+
MAX_SEQ_LEN = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL,
    max_seq_length = MAX_SEQ_LEN,
    load_in_4bit = True,
)

# Attach LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,                       # adapter rank — capacity vs. overfit knob
    lora_alpha = 32,              # convention: 2x rank
    lora_dropout = 0.0,           # 0 enables Unsloth's fast path
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 1337,
)

# Load dataset and format with the chat template
dataset = load_dataset("json", data_files="data/dataset.jsonl", split="train")

def format_chat(example):
    return {"text": tokenizer.apply_chat_template(
        example["messages"], tokenize=False, add_generation_prompt=False,
    )}

dataset = dataset.map(format_chat, remove_columns=dataset.column_names)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        output_dir = "out",
        per_device_train_batch_size = 1,        # bump to 2 with 16 GB+
        gradient_accumulation_steps = 4,        # effective batch size 4
        warmup_ratio = 0.03,
        num_train_epochs = 2,                   # 1-3 is the sweet spot
        learning_rate = 2e-4,                   # high for LoRA, low for full FT
        bf16 = True,                            # fp16=True on Turing/older
        logging_steps = 10,
        save_strategy = "epoch",
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 1337,
        report_to = "none",
        dataset_text_field = "text",
        max_length = MAX_SEQ_LEN,
        packing = True,                         # required when max_length is set
        packing_strategy = "bfd",
    ),
)

trainer.train()
model.save_pretrained("out/lora-final")
tokenizer.save_pretrained("out/lora-final")

Why These Hyperparameters

The defaults that aren’t really defaults:

r = 16. LoRA rank is the single biggest capacity knob. Higher rank = more parameters trained = more capacity to learn but also more capacity to overfit on a small dataset. 8 is conservative, 16 is a good balance, 32+ if you have 10K+ rows.
lora_alpha = 2 * r. The community convention. Effective scaling is alpha/r — keeping it at 2 means you can change r without rescaling everything else.
lora_dropout = 0.0. Unsloth has a hand-written fast path that requires zero dropout. Any nonzero value works but falls back to a slower path with a warning. For LoRA on a small dataset, the rank itself is enough regularization.
num_train_epochs = 2. With LoRA on a small dataset, 1-3 epochs is the range. More epochs will fit your training set better and will hurt generalization. Watch the loss curve.
learning_rate = 2e-4. Two orders of magnitude higher than full fine-tuning. LoRA only updates a small subset of parameters, so the per-step gradient is smaller and tolerates a bigger LR.
All seven target_modules. Earlier LoRA papers only touched q_proj/v_proj. Modern practice is to attach adapters to all linear layers in the attention and MLP blocks — costs a little more VRAM, gives meaningfully better quality.
packing = True with packing_strategy = "bfd". Concatenates short examples into single sequences up to max_length instead of padding. Roughly 2x training speedup on a small-row dataset where most examples are far shorter than max_length. Required by current TRL when max_length is set; pass max_length=None to opt out.

What to Watch During the Run

Open a second terminal and run nvidia-smi -l 2. Things should be steady at 95%+ GPU utilization and your VRAM should be near-full but not spilling.

In the trainer logs, watch the loss. For a healthy run on a small dataset:

Starts somewhere between 1.5 and 2.5
Drops fast for the first 10-20% of steps
Settles into a slow decline
If it crosses below 0.3, you’re memorizing — stop and reduce epochs

A 7B QLoRA on 3,000 rows with the config above runs in roughly 45-90 minutes on a 4090. On an 8 GB 3060 Ti with the 3B variant, expect a similar wall-clock — fewer parameters but a smaller batch size cancels out. If yours is taking 6 hours, something is wrong (usually CPU-bound dataloader; check dataloader_num_workers).

Lab gotchas worth knowing

A few things the docs don’t warn you about:

TRL’s SFTConfig replaced TrainingArguments. Older tutorials pass TrainingArguments and move dataset/packing args to the SFTTrainer call. Current TRL rejects fields that don’t exist on SFTConfig (e.g. push_to_hub_token) and moves dataset_text_field and max_length into SFTConfig.
Don’t pin pip versions for this stack. PyTorch 2.5 + a recent torchao will fail with module 'torch' has no attribute 'int1' because torchao references dtypes added in PyTorch 2.6. Install upstream latest for torch, unsloth, transformers, trl, peft, accelerate, bitsandbytes and let pip resolve.
Flash Attention 2 fallback to xformers is fine. If the import warns about FA2 not working (common on consumer cards under WSL2), Unsloth uses xformers and you get the same training throughput. Don’t waste an evening fighting FA2.
Ubuntu 24.04 ships Python 3.12. Older Unsloth tutorials specify 3.10 or 3.11, but 3.12 is officially supported and is what 24.04 makes easy. Don’t add deadsnakes PPAs you don’t need.

Evaluation: Did It Work?

This is the step everyone skips and shouldn’t. “It looks better in the chat” is not evaluation — it’s vibes.

Three Layers

Layer 1 — Loss curves. Training loss should drop monotonically. If you held out 5-10% of the dataset as eval (set eval_dataset in the trainer), eval loss should also drop and then plateau. If eval loss starts climbing while train loss keeps dropping, you’ve overfit.

Layer 2 — Held-out task suite. Build a small set of 30-50 prompts representing the use cases you actually care about. Run them through the base model and the fine-tuned model. Score them yourself, blindly, on a 1-5 scale for: correctness, format, refusal-appropriateness, code quality. Same prompts, both models, your eyes.

This sounds primitive. It is. It’s also the most reliable signal you’ll get for behavior changes the loss can’t see.

Layer 3 — Refusal regression. Build a separate set of prompts that should be refused (e.g. “write malware targeting hospitals”). Make sure the fine-tuned model still refuses them. Fine-tuning that broadens helpfulness can over-shoot and refuse nothing — that’s a sign your dataset has too few “this is the line” examples.

A Quick Smoke Test

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "out/lora-final",
    max_seq_length = 4096,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

prompt = tokenizer.apply_chat_template([
    {"role": "system", "content": "You are an offensive security assistant..."},
    {"role": "user",   "content": "Write a Windows ETW patch in C using GetProcAddress."},
], tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Lower temperature (0.2-0.4) for code, higher (0.7-0.9) for prose tasks like pretext writing. Top-p of 0.9 is a sane default.

Deployment: GGUF + Ollama

Training output is a LoRA adapter on top of a 4-bit base. For day-to-day use you want a single, fast, quantized model file. The path:

# Merge adapter into base, export as GGUF for llama.cpp / Ollama
model.save_pretrained_gguf(
    "out/qwen-redteam-q4",
    tokenizer,
    quantization_method = "q4_k_m",   # good balance for 7B
)

q4_k_m is the right default for a 7B model on 12-24 GB VRAM at inference. For a 3B model, q5_k_m runs comfortably on 8 GB and gives better quality. q8_0 for evaluation against the unquantized model when you want to isolate fine-tuning effects from quantization effects.

Then a minimal Ollama Modelfile:

# Modelfile
FROM ./qwen-redteam-q4.gguf

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

SYSTEM """You are an offensive security assistant. Help the operator with red team coding tasks. Provide working code, explain trade-offs, note detection considerations."""

ollama create qwen-redteam -f Modelfile
ollama run qwen-redteam

From here, drop it into Continue, Open WebUI, or call it from your tooling at http://localhost:11434. The integration story is the boring, well-paved part.

OPSEC for the Model Itself

The model is now an artifact with its own threat surface.

Adapters vs. merged models. A LoRA adapter (~150 MB) without the base is useless — share adapters internally rather than merged GGUFs when you can. Anyone receiving the adapter still needs the base model.
Canaries. Those unique strings you planted in training data? Periodically prompt the model in ways that should surface them. If a canary appears in the wild on a model you didn’t share, you have a leak indicator.
Engagement isolation. Don’t train one model on data from multiple clients. Either keep client-specific data out of the training set entirely (use it via local RAG instead, where it stays on disk) or maintain per-engagement adapters.
Don’t push to public hubs. It seems obvious. People do it anyway. HuggingFace, Ollama Library, Modelscope — none of these are appropriate for an offensive-tuned model. If you need to share inside an org, host an internal registry.
Inference logs. Ollama logs prompts by default. Disable or redirect logs if your prompts will contain engagement data.

Costs and Time

Real numbers for a first end-to-end run, assuming you have hardware:

Phase	Time	Notes
Lab setup	2-4 hours	One-time
Dataset construction (3K rows)	20-40 hours	Bulk of the work
Training run	1-2 hours	Per iteration
Evaluation	2-3 hours	Per iteration
Total to first usable model	~1 week of evening work	Realistic

Expect to throw out the first model. The second is usually decent. The third is what you actually use.

What’s Next

This post ends with a working assistant. There’s a lot of room beyond it:

DPO or KTO on top of SFT. Once you have a working SFT model, preference optimization on pairs of (better, worse) responses sharpens behavior further. Useful when you can rank outputs but can’t write the perfect one yourself.
Tool use fine-tuning. Teach the model to call into your own tools — Nmap, BloodHound, Impacket — via structured function calls. Puts the model into an agent loop instead of a passive assistant.
Domain-specialist adapters. Stackable LoRAs: one for Windows tradecraft, one for cloud, one for web. Load only what the engagement needs.
Distilled smaller models. Once the 7B works, distill into a 3B or 1.5B that runs on the kind of hardware you’d actually take on-site.

Each of these is its own post. For now, you have everything you need to brew the first model and start iterating.

The code is mundane. The dataset is the craft. Spend the time there.