I will test your llm chatbot for jailbreaks, data leaks and unsafe behavior


About this gig
LLM Behavioral & Safety Testing by a QA Lead
I'm a QA Lead (6+ yrs) applying systematic test design to AI. I build test sets that surface where your LLM-powered bot behaves unsafely or breaks its own rules jailbreaks, prompt injection, prompt leaks, hallucinations, refusal failures, and data-access risks.
How it works:
- You share your system prompt + how the bot is used
- I map the risk zones specific to your use-case
- I build the test cases (input expected behavior + severity + rationale)
- You get JSONL + CSV + a readable report ready for your eval harness
Premium: I also run the tests against your model and deliver a findings report each failure with input, expected vs actual, and severity.
What I don't do: I don't judge factual or domain accuracy (legal, medical, etc.) that needs a subject-matter expert. I test behavior, safety & instruction-following.
Need a large or ongoing set? Message me for a custom quote. Written-first, GMT+7. Message me before ordering.
Get to know Vladislav Boev
Senior QA Lead and Test Architect
- FromVietnam
- Member sinceJun 2026
- Avg. response time1 hour
Languages
Russian, English
FAQ
Do you check if my bot's answers are factually correct?
No — I test behavior, safety and instruction-following (does it break rules, leak data, get jailbroken). Judging factual or domain accuracy (legal, medical, etc.) needs a subject-matter expert. I'll tell you upfront if your case needs that.
What do you need from me to start?
Your system prompt (the instructions you give the model) and a short description of how the bot is used. For Premium runs: API access to your model, or you run my test cases and send back the outputs.
Which models do you support?
Any text-based LLM or chatbot (GPT, Claude, Gemini, Llama, open-source, fine-tuned). I test behavior at the prompt level, so the underlying model doesn't matter.
Can you test legal, medical or financial bots?
I can test their safety and rule-following behavior (e.g. that they refuse advice they shouldn't give), but not whether their domain answers are correct. For high-risk domains I keep scope to behavior/safety and say so clearly.
I need a large or recurring test set — can you do that?
Yes. The packages cover focused sets; for large volumes or ongoing testing, message me before ordering and I'll send a custom quote.

