I will build a domain specific sft dataset for llm finetuning

Vietnam

I speak English, Vietnamese

LLM FineTuning Data and AI Automation

I'm an AI Engineer with a Computer Science background, specializing in LLM fine-tuning data and AI automation systems. I build production-ready SFT datasets, custom AI pipelines, and document-aware kn...
About this Gig

Fine-tuning a language model starts with the data. Vague responses, duplicate samples, or wrong formats will hurt your model regardless of how good your training setup is.


I build domain-specific SFT datasets through a 5-stage pipeline: generation, validation, deduplication, LLM-as-judge scoring, and human quality review. Every sample that reaches your training loop has passed all five stages.


WHAT YOU RECEIVE

  • train.jsonl + val.jsonl (90/10 split)
  • data_card.md (dataset documentation)


FORMATS

  • Alpaca single-turn, all packages
  • ShareGPT multi-turn, Standard and Premium


COMPATIBLE WITH

  • Axolotl, LLaMA-Factory, Unsloth, OpenAI Fine-tune API, Together AI


DOMAINS

E-commerce, healthcare Q&A, legal summarization, coding assistant, SaaS support, finance, HR, EdTech, multilingual support, and more. Message me if yours isn't listed.


Not sure which package fits your use case? Send me a message before ordering.

Programming Language:

Python

Pytorch

AI Model Frameworks & Tools:

Hugging Face Transformers

Data Type:

Text

AI Engine:

GPT

Gemini

DeepSeek

Llama

Grok

My Portfolio

Related tags