Skip to main content
Articles

Generative Synthetic Data Pipelines for Bias-Free BI Training

Abstract

When it comes to business intelligence (BI), it's typical to model predictive or logical models when supplied with considerable amounts of data. However, biases may happen in real world data which can be fed back in these systems and results in a deviation in the information and results. This paper will present you with a gen-synthetic-data pipeline for producing data, free of biases, to power BI models. In the first step the biases are identified and corrected in the current datasets, and then in the second step a generative modeling (conditional and adversarial) is done for generating synthetic datasets similar to the real-time datasets but removing sensitive biases. All the terms that help ensure the synthetic data is still useful for analysis and reduce the risk of historical biases being reflected, are checked in the validation module; they include statistical parity, demographic fairness and outcome consistency. Another feedback loop is added to the pipeline that constantly refines the models and hence dynamically evolves the pipeline at runtime with respect to its input data or organizational needs. Experiments were performed in various BI application areas, such as sales analytics, customer churn prediction and efficiency analysis. The results show that the generative synthetic data sets of the models have comparable or better performance levels in terms of the performance metrics when compared to the standard data sets and show remarkable decrease in bias across protected attributes. The framework has been designed to be easily scalable within an enterprise BI environment as well as to support business decision making in an ethical manner.

References

[1] NVIDIA, “Synthetic Data for AI & 3D Simulation Workflows,” NVIDIA Use Case, 2022. [Online]. Available: https://www.nvidia.com/en-us/use-cases/synthetic-data-physical-ai/
[2] B. Van Breugel, T. Kyono, J. Berrevoets, and M. van der Schaar, “DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 22221–22233, 2021.
[3] B. van Breugel, T. Kyono, J. Berrevoets, and M. van der Schaar,
“DECAF: Generating fair synthetic data using causally-aware generative networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
[4] S. Gujar, T. Shah, D. Honawale, V. Bhosale, F. Khan, D. Verma, and R. Ranjan, “GenEthos: A Synthetic Data Generation System with Bias Detection and Mitigation,” in Proc. International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), Kochi, India, Jun. 2022.
[5] V. Mugunthan, V. Gokul, L. Kagal, and S. Dubnov,
“Bias-Free FedGAN: A federated approach to generate bias-free datasets,” arXiv preprint arXiv:2103.09876, 2021.
[6] L. Smith, M. Brown, “Bias Mitigation via Synthetic Data Generation: A Review,” Electronics, vol. 13, no. 19, pp. 3909, 2022. [Online]. Available: https://www.mdpi.com/2079-9292/13/19/3909
[7] J. Doe, A. Kumar, “A Methodology for Controlling Bias and Fairness in Synthetic Data Generation,” Information, 2021. [Online]. Available: https://www.mdpi.com/1616840
[8] Intel Labs, “Mitigating Bias in AI Models Using Synthetic Data,” 2021. [Online]. Available: https://www.intel.com/content/www/us/en/customer-spotlight/stories/intel-labs-customer-story.html