Synthetic Data Generation Using CTGAN with Agentic Workflows and Retrieval-Augmented Generation

Sinchana K C; Maria George Anthraper; Kusuma Sanjaykumar; Shruti Kumari; Uma D

doi:10.34190/icair.5.1.4280

Authors

Sinchana K C PES University
Maria George Anthraper PES University
Kusuma Sanjaykumar PES University
Shruti Kumari PES University
Uma D PES University

DOI:

https://doi.org/10.34190/icair.5.1.4280

Keywords:

Artificial Intelligence, Machine Learning, Dataset Generation, Generative AI, RAG systems, Multi-Agent, LLM Agents

Abstract

Real-world data in domains such as finance and fraud detection can be rare, imbalanced, or inaccessible, necessitating synthetic data as a crucial alternative. Gathering and leveraging real-world data in such domains is subject to important challenges such as privacy issues, legality, high cost of annotation, and restricted access due to proprietary ownership. Synthetic data generation in this context offers a meaningful alternative to real data gathering, reducing both privacy and computational costs while allowing for the construction of flexible, scalable datasets. This paper presents a new paradigm for tabular data synthesis through CTGAN (Conditional Tabular GAN) with integration into agentic workflows and retrieval-augmented generation (RAG). The proposed system herein accepts partial data samples and column constraints as inputs from a user-friendly chatbot interface and augment the dataset intelligently through an AI-agent-based generation pipeline. These AI agents aid in the automation of preprocessing, column semantics interpretation, and the enforcement of user-specified constraints specified in natural language, minimizing manual intervention by a considerable margin. The framework further includes ChromaDB to enable semantic retrieval of past relevant datasets. With this semantic memory, the model can improve generation quality, apply schema-level consistency, and update even synthesis of new datasets based on column names or metadata alone. It allows for context-aware, structurally sound, and domain-conformant data generation—without the need to access sensitive or full datasets. The current research utilizes statistical measures like mean, variance, and the Kolmogorov–Smirnov (KS) test to confirm the fidelity of data produced. The approach maintains a mean difference of just 0.16% and a KS statistic of 0.0020, which reflects outstanding statistical consistency with original distributions of data. Preliminary results show significant enhancements in data realism, diversity, and variability without sacrificing domain coherence. The system introduced is particularly well-adapted to financial datasets, such as applications in credit card fraud detection, and offers a scalable, privacy-aware method of synthetic data generation in sensitive or data-scarce environments.

Author Biographies

Sinchana K C, PES University

Ms. Sinchana K C (sinchanakc51103@gmail.com) is an undergraduate student at PES University pursuing Bachelor of Technology in Computer Science with specialization in Machine Intelligence and Data Science. Her interests include Big Data, Data Analytics and Machine Learning with focus on Deep Learning and Generative AI.

Maria George Anthraper, PES University

Ms. Maria George Anthraper (anthraper.maria@gmail.com) is an undergraduate student at PES University pursuing Bachelor of Technology in Computer Science with specialization in Machine Intelligence and Data Science. Her interests include Machine Learning Research with focus on Deep Learning and Generative AI as well as Data Analytics.

Kusuma Sanjaykumar, PES University

Ms. Kusuma Sanjaykumar (kusuma.sanjaykumar123@gmail.com) is an undergraduate student at PES University pursuing a Bachelor of Technology degree specializing in Machine Learning and Data Science. Her interests focus on AI-driven systems, especially in Generative and Multimodal Learning, Data-driven Machine Learning, and Data Analytics.

Shruti Kumari, PES University

Ms. Shruti Kumari (shrutikumarijsr2002@gmail.com) is an undergraduate Computer Science and Engineering student at PES University. Her interests lie in artificial intelligence, machine learning, and human computer interaction. She enjoys building creative and intelligent systems, especially voice assistants and AI-driven applications that combine innovation with real-world impact.

Uma D, PES University

Uma is a Professor in the Department of Computer Science Engineering at PES-University with 27 years of teaching experience. She completed her undergraduate and postgraduate studies at Bharathiar University and holds a Ph.D. in Applied Mathematics from VTU. Her research interests include Data-Science, Computer-Vision, Machine-Intelligence, Deep-Learning, and Natural Language Processing.

Synthetic Data Generation Using CTGAN with Agentic Workflows and Retrieval-Augmented Generation

Authors

DOI:

Keywords:

Abstract

Author Biographies

Sinchana K C, PES University

Maria George Anthraper, PES University

Kusuma Sanjaykumar, PES University

Shruti Kumari, PES University

Uma D, PES University

Downloads

Published

Issue

Section

Current Issue

Information