Programmable Structured Synthetic Data Generation for Privacy-Preserving Data Sharing​

Avatar for Venkata Abhishek NALAM
Venkata Abhishek NALAM    
Assistant Professor

Read More 

Avatar for Tram Truong HUU
Tram TRUONG HUU    
Associate Professor

Read More 

This project proposes the development of a programmable synthetic data generation framework to enable privacy-preserving data sharing. 

By leveraging hybrid NLP models and Generative AI, the algorithm aims to generate structured synthetic datasets that statistically resemble real-world PII data while ensuring data privacy. The project primarily addresses the issue of fairness in AI model training. 

Project Deliverables/Outcomes/Impact:
  • A hybrid NLP module for accurate identification of PII in structured datasets.
  • A GAN-based synthetic data generator integrated with DP for enhanced privacy.
  • A benchmarking tool to assess utility, fairness, and privacy risk of synthetic datasets.
     
A linear flowchart illustrating a machine learning pipeline for synthetic data generation, starting from structured tabular data, passing through a Hybrid NLP Module and a GAN with Differential Privacy (DP), and ending with a Data Quality Assessment.