Geometry-Based Differentially Private Synthetic Tabular Data Generation via K-Means Clustering with Bounded and Discrete Feature Constraints

Robby Robby, Agus Sukmana, Erwinna Chendra

Abstract


Most clustering-based differentially private synthetic data generation methods assume unconstrained continuous feature spaces and offer no mechanism for hard feature bound enforcement or discrete-valued attribute handling, which limits their practical applicability to real-world tabular data where such constraints are common. This paper proposes a geometry-based mechanism that generates synthetic tabular data by application of Laplace noise jointly to K-means cluster centroids and within-cluster radial distances, calibrated via a data-dependent sensitivity approximation. Three components distinguish the approach from prior work: coordinate-wise centroid reflection to enforce hard feature bounds after perturbation, coordinate-wise clipping to enforce bounds on reconstructed synthetic points, and randomized rounding for discrete features as a post-processing step. A utility-driven calibration strategy selects the privacy budget ε to meet a user-specified target Adjusted Rand Index (ARI), which makes the privacy–utility trade-off directly interpretable. Baseline comparisons on a two-dimensional illustrative example show that the proposed mechanism achieves ARI = 0.666 at ε ≈ 1.60, which substantially outperforms direct coordinate-wise noise addition at the same budget (ARI = 0.199), while it matches the non-private synthesis baseline (ARI = 0.624). Across 30 independent runs the mechanism achieves mean ARI = 0.629 ± 0.108, which confirms that the calibration target is reliably met under stochastic variation.

Keywords


ARI-guided Calibration; Bounded Features; Differential Privacy; K-means Clustering; Synthetic Tabular Data.

Full Text:

PDF

References


[1] Mengmeng Yang, Longxia Huang, and Cheng Pei Tang. “K-Means Clustering with Local Distance Privacy”. Big Data Mining and Analytics (2023). DOI: https://doi.org/10.26599/BDMA.2022.9020050.

[2] Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, Min Lyu, and Hongxia Jin. “Differentially Private K-Means Clustering and a Hybrid Approach to Private Optimization”. ACM Transactions on Knowledge Discovery from Data (2017). DOI: https://doi.org/10.1145/3133201.

[3] Boyu Zhu, Yuan Zhang, Tingting Chen, and Sheng Zhong. “Differentially Private K-Means Publishing with Distributed Dimensions”. Proceedings of the IEEE Conference on Computer Supported Cooperative Work in Design. 2024. DOI: https://doi.org/10.1109/CSWD61410.2024.10580021.

[4] Shuhui Fang, Xuejun Wan, Jun Wang, Lin Chai, Wenlin Pan, and Wu Wang. “HiDS Data Clustering Algorithm Based on Differential Privacy”. Proceedings of IEEE NaNA (2024). DOI: https://doi.org/10.1109/NANA63151.2024.00029.

[5] Messaoud Saoudi. “MC-GEN: Multi-Level Clustering for Private Synthetic Data Generation”. Knowledge-Based Systems (2023). DOI: https://doi.org/10.1016/j.knosys.2022.110239.

[6] Tarek Benkhelif, Françoise Fessant, Fabrice Clérot, and Guillaume Raschia. “Co-Clustering for Differentially Private Synthetic Data Generation”. Advances in Knowledge Discovery and Data Mining. Springer, 2017. DOI: https://doi.org/10.1007/978-3-319-71970-2_5.

[7] Zhikun Zhang, Tianhao Wang, Ninghui Li, Jean Honorio, Michael Backes, Shibo He, Jiming Chen, and Yang Zhang. “PrivSyn: Differentially Private Data Synthesis”. Proceedings of the VLDB Endowment (2021). URL: https://www.usenix.org/conference/usenixsecurity21/presentation/zhang-zhikun.

[8] James Jordon, Jinsung Yoon, and Mihaela van der Schaar. “PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees”. International Conference on Learning Representations. 2019. URL: https://openreview.net/forum?id=S1zk9iRqF7.

[9] Noseong Park, Mihail Popescu, Youngja Park, Sungchul Kim, Ryan A. Rossi, and Franck Dernoncourt. “Differentially Private Tabular Data Synthesis Using Large Language Models”. arXiv preprint (2023). DOI: https://doi.org/10.48550/arXiv.2304.10701.

[10] Jingwen Zhao, Yunfang Chen, and Wei Zhang. “Differential Privacy Preservation in Deep Learning: Challenges, Opportunities and Solutions”. IEEE Access (2019). DOI: https://doi.org/10.1109/ACCESS.2019.2909559.

[11] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. “Deep Learning with Differential Privacy”. Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 2016, pp. 308–318. DOI: https://doi.org/10.1145/2976749.2978318.

[12] Mónica Ribero, Jette Henderson, Sinead A. Williamson, and Haris Vikalo. “Federating Recommendations Using Differentially Private Prototypes”. Pattern Recognition (2022). DOI: https://doi.org/10.1016/j.patcog.2022.108746.

[13] Jian-Zhi Zhao, Wenji Wang, Jiabao Wang, Songyang Zhang, Zhelin Fan, and Stan Matwin. “Privacy-Preserved Federated Clustering with Non-IID Data via GANs”. The Journal of Supercomputing (2025). DOI: https://doi.org/10.1007/s11227-025-07006-2.

[14] Qaiser Razi, Souptik Datta, Vikas Hassija, G. S. S. Chalapathi, and Biplab Sikdar. “Privacy Utility Tradeoff Between PETs: Differential Privacy and Synthetic Data”. IEEE Transactions on Computational Social Systems (2024). DOI: https://doi.org/10.1109/TCSS.2024.3479317.

[15] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating Noise to Sensitivity in Private Data Analysis”. Theory of Cryptography Conference. Vol. 3876. Lecture Notes in Computer Science. Springer, 2006, pp. 265–284. DOI: https://doi.org/10.1007/11681878_14.




DOI: https://doi.org/10.18860/cauchy.v11i2.41403

Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Robby Robby

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.