Skip to main content
Skip table of contents

Introduction to

What is Synthetic Data?

Data is critical for training AI. Most computer vision users rely on real data captured by physical sensors, but real data can have issues including bias, cost, and inaccurate labeling.

Synthetic data is artificial or engineered content that AI interprets as if it is real data. Synthetic data is used for training and validating artificial intelligence (AI) and machine learning (ML) systems and workflows. Engineered content may be derived from sampling or bootstrapping techniques using real world datasets or synthetic data may be generated by simulating real world scenarios in high resolution 3D.

The entertainment and computer graphics industries have created 3D synthetic environments for years for movies and games, training simulators, and educational materials. The use of engineered data for training and validating AI and ML systems extends the concept from simulating one environment or scenario to simulating many environments or scenarios to create large datasets of thousands of images or data points that can be used in AI and ML workflows.

Synthetic datasets can be made using configurable pipelines that effectively offer an unlimited range of scenarios or environments to generate data with known diversity and distribution of labelled assets, a critical part of using synthetic data for training AI.

Synthetic microscopy, x-ray, and aerial images generated by the Platform

Existing datasets have limits

Data is an essential ingredient in training and testing AI and ML systems. The quality, distribution, and completeness of data directly impacts the effectiveness of these systems when used in real world scenarios. AI has often been seen to fail when used with real data that may differ from limited training data. Issues with a particular AI system, such as bias and poor performance, often reflected in Average Precision (AP) scores, are directly driven by the quality of data that is used to train and test AI.

Typical problems that AI-focused organizations encounter are:

  • Bias and low precision: Datasets from real world scenarios and sensors can only capture the frequency of assets or events as they occur in reality. Rare objects and unusual events that are difficult to capture in real data sets will often cause algorithms to be biased or to have high error (low precision) when classifying particular entities.

  • Expense of data labeling: When using real world datasets, users require labelled or annotated datasets in which the source content is paired with information that indicates what the content contains with respect to specific asset, entity, or event characteristics or types. Labelling existing datasets is an expensive and error-prone process that may also lead to bias and precision issues when done poorly.

  • Unavailable data: In the case of attempting to build models for sensors or scenarios that don’t yet exist or that are hard to access, it may simply not be possible to obtain datasets to train AI.

  • High risk data: Use of some datasets may incur risks that an organization is unwilling to support such as datasets that are restricted because of personally identifiable information or for security reasons.

Benefits of adding synthetic data to AI workflows

Synthetic data is one of the tools that organizations are using to overcome the limitations and costs of using real world datasets. Synthetic data is controlled by the author or data engineer, can be designed to model physical sensor-based characteristics, account for statistical distributions and costs less because it is simulated.

Some of the opportunities of using synthetic data include:

  • Expanding and controlling the distribution of datasets: Engineered or synthetic data can be algorithmically controlled to produce datasets that match real world metadata but with distributions of entities and events that can be designed to both overcome and test for bias and precision issues in AL and ML systems.

  • Reducing labeling and data acquisition costs: Synthetic data is produced using techniques that enable data labels and image masks to be precisely determined for every piece of data without requiring any post-processing or human labeling effort.

  • Exploring and simulating new scenarios and sensors: In domains such as urban planning, future conditions may not exist to be captured in photogrammetry or lidar, but they can be simulated, presenting an opportunity for creation and use of synthetic data sets. Similarly, a hardware vendor who is planning new sensors or sensor platforms can use synthetic data to create representative content that is expected to be produced using digital models of proposed equipment.

  • Eliminating privacy and security concerns: Synthetic data can be produced using anonymized location and human models, removing security risks and any components of PII while providing plausible content that can be processed for AI workflows.

Read next

We recommend that you next read about the platform and why we call it ‘a platform.’

The Platform

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.