Creating Synthetic Data

Creating Synthetic Image Data

The image depicts a neoclassical building from inside

This is an image that was generated synthetically using this code Stable Diffusion XL. Note that I did not let the image converge to its best resolution. This takes considerably longer and performance of the model this synthetic data was generated for was still great even though the test data consisted of real world data. There are other models that perform reasonably well, such as Stable Diffusion XL Turbo and although the image quality was higher using the Turbo model, the variance was quite low. Most images looked quite similar to one another.

from diffusers import AutoPipelineForText2Image, DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")

# I am on a Mac, so I am using Metal to accelerate the image generation
pipe.to("mps")

# I set a couple properties that I don't want to be present in my image
negative_prompt = "blurry, low quality, low resolution, low fidelity, low definition, pixelated, bad lighting, bad quality, bad resolution, bad fidelity, bad definition"

# Those are the properties I want my image to have
default = "architecture building from the corner, photorealistic, high quality"

# I want to have pictures of buildings from different epochs. From inside as well as outside. 
neoclassical = "neoclassical"
greek_roman = "greek roman"
gothic = "gothic"

style = [neoclassical, greek_roman, gothic]

from_outside = "from outside"
from_inside = "from inside"

view = [from_outside, from_inside]

# This will show you one sample image per style:
import matplotlib.pyplot as plt

for s in style:
    for v in view:
        text = f"{s} architecture building {v}, {default}"
        image = pipe(prompt=text, negative_prompt=negative_prompt, num_inference_steps=2, timestep_spacing='trailing', height=1024, width=1024, guidance_scale=0.0).images[0]
        plt.title(f"{s} {v}")
        plt.axis('off')
        plt.imshow(image)
        plt.show()

# Now I am generating 100 images of each configuration:
import os
import matplotlib.pyplot as plt

num_images = 100

# create images folder if it does not exist
if not os.path.exists("images"):
    os.makedirs("images")

for num in range(num_images):
    num = num
    for s in style:
        for v in view:
            text = f"{s} architecture building {v}, {default}"
            image = pipe(prompt=text, negative_prompt=negative_prompt, num_inference_steps=10, timestep_spacing='trailing', height=1024, width=1024).images[0]
            # image = pipe(prompt=text, negative_prompt=negative_prompt, num_inference_steps=4, timestep_spacing='trailing', height=512, width=512, guidance_scale=0.0).images[0]
            if not os.path.exists(f"images/{s}_{v}"):
                os.makedirs(f"images/{s}_{v}")
            image.save(f"images/{s}_{v}/{s}_{v}_{num}.png")

Creating Synthetic Audio Data

In other scenarios you might want to generate Audio Data. While it is possible to also run audio generation models locally, I relied on the TTS API to generate my synthetic data. The process is straightforward. You simply have to provide an audio file for reference and some text and the TTS model will handle the rest for you. This is all the code thats necessary to achieve that:

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")

# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", file_path="./output/output1.wav", speaker_wav="./input/audio.wav", language="en")

The audio quality is quite good. I only encountered one issue when trying to generate audio from text that is incorrect, such as "gooood". For my keyword spotting algorithm I wanted to have different emphasis on some characters, to make the model more robust. In this scenario the audio sometimes contained noise.

Creating Synthetic Tabular Data

Generating tabular data is also quite straightforward on your local machine. I used the sdv library to achieve that, which offers several methods to generate data. I did not use a Neural Network to generate the data. While Neural Networks such as variational autoencoders also work great for synthetic data generation, they take much longer to train, especially if you are dealing with high dimensional data.

from sdv.single_table import GaussianCopulaSynthesizer, CTGANSynthesizer, CopulaGANSynthesizer
from sdv.metadata import SingleTableMetadata
from sdv.datasets.demo import download_demo
import pandas as pd
import numpy as np
from scipy.stats import entropy
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Initialize metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data)

print("Starting to fit GaussianCopulaSynthesizer")
# Initialize and fit the GaussianCopulaSynthesizer
gc_model = GaussianCopulaSynthesizer(metadata)
gc_model.fit(data)
gc_synthetic_data = gc_model.sample(1000)
# convert metadata to dict
metadata = metadata.to_dict()
# save model
gc_model.save('gc_model.pkl')
# load model
gc_model_loaded = GaussianCopulaSynthesizer.load('gc_model.pkl')
from sdmetrics.reports.single_table import QualityReport

real_data = data
synthetic_data = gc_model_loaded.sample(1000)

my_report = QualityReport()
my_report.generate(real_data, synthetic_data, metadata)