State of the Field
Recent advancements in synthetic data generation are addressing critical gaps in various sectors, particularly where real data is scarce or encumbered by privacy concerns. For instance, new tools are being developed to create customizable datasets for anti-money laundering research, enabling more effective model training by incorporating both structural and temporal characteristics of illicit transactions. In remote sensing, frameworks are emerging that leverage vision and language models to enhance the interpretability and utility of synthetic data, demonstrating that augmented datasets can outperform those based solely on real images. Similarly, utility companies are utilizing multimodal large language models to generate synthetic defect images for power line inspections, significantly improving classification accuracy in data-scarce environments. Additionally, frameworks are being introduced to ensure fairness in synthetic financial data generation, addressing biases that can skew automated decision-making. These innovations highlight a shift towards more practical, scalable solutions that enhance model performance while mitigating ethical concerns in data usage.
Papers
1–7 of 7Grounding Synthetic Data Generation With Vision and Language Models
Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typ...
Tide: A Customisable Dataset Generator for Anti-Money Laundering Research
The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while ...
PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents
Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However,...
Evaluating Synthetic Data for Baggage Trolley Detection in Airport Logistics
Efficient luggage trolley management is critical for reducing congestion and ensuring asset availability in modern airports. Automated detection systems face two main challenges. First, strict securit...
FairFinGAN: Fairness-aware Synthetic Financial Data Generation
Financial datasets often suffer from bias that can lead to unfair decision-making in automated systems. In this work, we propose FairFinGAN, a WGAN-based framework designed to generate synthetic finan...
Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models
Utility companies increasingly rely on drone imagery for post-event and routine inspection, but training accurate defect-type classifiers remains difficult because defect examples are rare and inspect...
Improving TabPFN's Synthetic Data Generation by Integrating Causal Structure
Synthetic tabular data generation addresses data scarcity and privacy constraints in a variety of domains. Tabular Prior-Data Fitted Network (TabPFN), a recent foundation model for tabular data, has b...