Synthetic Data Generation Using CTGAN with Agentic Workflows and Retrieval-Augmented Generation
DOI:
https://doi.org/10.34190/icair.5.1.4280Keywords:
Artificial Intelligence, Machine Learning, Dataset Generation, Generative AI, RAG systems, Multi-Agent, LLM AgentsAbstract
Real-world data in domains such as finance and fraud detection can be rare, imbalanced, or inaccessible, necessitating synthetic data as a crucial alternative. Gathering and leveraging real-world data in such domains is subject to important challenges such as privacy issues, legality, high cost of annotation, and restricted access due to proprietary ownership. Synthetic data generation in this context offers a meaningful alternative to real data gathering, reducing both privacy and computational costs while allowing for the construction of flexible, scalable datasets. This paper presents a new paradigm for tabular data synthesis through CTGAN (Conditional Tabular GAN) with integration into agentic workflows and retrieval-augmented generation (RAG). The proposed system herein accepts partial data samples and column constraints as inputs from a user-friendly chatbot interface and augment the dataset intelligently through an AI-agent-based generation pipeline. These AI agents aid in the automation of preprocessing, column semantics interpretation, and the enforcement of user-specified constraints specified in natural language, minimizing manual intervention by a considerable margin. The framework further includes ChromaDB to enable semantic retrieval of past relevant datasets. With this semantic memory, the model can improve generation quality, apply schema-level consistency, and update even synthesis of new datasets based on column names or metadata alone. It allows for context-aware, structurally sound, and domain-conformant data generation—without the need to access sensitive or full datasets. The current research utilizes statistical measures like mean, variance, and the Kolmogorov–Smirnov (KS) test to confirm the fidelity of data produced. The approach maintains a mean difference of just 0.16% and a KS statistic of 0.0020, which reflects outstanding statistical consistency with original distributions of data. Preliminary results show significant enhancements in data realism, diversity, and variability without sacrificing domain coherence. The system introduced is particularly well-adapted to financial datasets, such as applications in credit card fraud detection, and offers a scalable, privacy-aware method of synthetic data generation in sensitive or data-scarce environments.