Programmable Structured Synthetic Data Generation for Privacy-Preserving Data Sharing

This project proposes the development of a programmable synthetic data generation framework to enable privacy-preserving data sharing.

By leveraging hybrid NLP models and Generative AI, the algorithm aims to generate structured synthetic datasets that statistically resemble real-world PII data while ensuring data privacy. The project primarily addresses the issue of fairness in AI model training.

Project Deliverables/Outcomes/Impact:

A hybrid NLP module for accurate identification of PII in structured datasets.
A GAN-based synthetic data generator integrated with DP for enhanced privacy.
A benchmarking tool to assess utility, fairness, and privacy risk of synthetic datasets.

A linear flowchart illustrating a machine learning pipeline for synthetic data generation, starting from structured tabular data, passing through a Hybrid NLP Module and a GAN with Differential Privacy (DP), and ending with a Data Quality Assessment.

Admissions AY2026

Apply Now!

Programmable Structured Synthetic Data Generation for Privacy-Preserving Data Sharing​

This project proposes the development of a programmable synthetic data generation framework to enable privacy-preserving data sharing.

Project Deliverables/Outcomes/Impact:

Programmable Structured Synthetic Data Generation for Privacy-Preserving Data Sharing