In this project, we explore the usage of diffusion models for the purposes of image generation
and editing.
Part A: The Power of Diffusion Models!
As the first part of the project, we use the existing, pretrained DeepFloyd IF diffusion model
and precomputed text embeddings to creating sampling loops and perform image editing tasks.
Part 1: Sampling Loops
The main idea between our diffusion models is to train a neural net that can iteratively reverse
the noising process of an image. This way, we have a model that incremental takes a noisy image
and moves it towards the desired image manifold, allowing us to generate new images.
1.1: Implementing the Forward Process
In order to do this, we first need to be able to add noise to images at our desired threshold.
The equation that achieves this is:
$$x_t = \sqrt{\bar{\alpha_t}}x_0 + \sqrt{1 - \bar{\alpha_t}}\epsilon$$
where \(\epsilon \sim N(0,1)\) is sampled at random. This is implemented in
forward(im, t)
.
Our \(\bar{\alpha_t}\) variable is taken from the array
alphas_cumprod
, which gives
us magnitudes for the desired noise at the different time periods. Below are the results of applying
this forward method on an image of the Campanile for \(t \in [250, 500, 750]\):
Image of the Campanile.
|
Noised at t=250.
|
Noised at t=500.
|
Noised at t=750.
|
1.2: Classical Denoising
Before we use our pretrained neural nets to denoise the image, we first demonstrate the results
of classical noising techniques via a low pass filter. Below are the results for the previously
noised images:
Noised at t=250.
|
Noised at t=500.
|
Noised at t=750.
|
Gaussian blur denoise at t=250.
|
Gaussian blur denoise at t=500.
|
Gaussian blur denoise at t=750.
|
As we can see, this method isn't very effective especially at high noise levels. Note a kernel size of 15
and a \sigma of 2 were used in the gaussian blurs.
1.3: One-Step Denoising
Now we can actually try using our neural net to denoise the image. For now, we will try and
estimate the original image in one step. Our neural net gives us a noise estimate \(\epsilon\) given the
noised image and the timestep t, and we can then recover the estimated original image by
solving for \(x_0\) in the original forward pass, giving:
$$x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha_t}}\epsilon}{\sqrt{\bar{\alpha_t}}}$$
Applying this formula, we get the following results:
Noised at t=250.
|
Noised at t=500.
|
Noised at t=750.
|
One-step denoise at t=250.
|
One-step denoise at t=500.
|
One-step denoise at t=750.
|
As we can see, this results are much better than those obtained using classical denoising in
part 1.2.
1.4: Iterative Denoising
We can further improve this denoising process by iteratively denoising the image, instead of
simply trying to do it in one pass. This can be thought of as each step taking a linear interpolated
step for our current state towards the estimated clean image produced by 1.3. Additionally, instead
of strictly iterating step by step down the values of t, we can take strided steps. For these examples,
we will take strided steps of 30 from 990 to 0. The iterative step is defined below:
$$x_{t'} = \frac{\sqrt{\bar{\alpha_{t'}}}\beta_t}{1 - \bar{\alpha_t}}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha_{t'}})}{1-\bar{\alpha_t}}x_t + v_\sigma$$
where \(x_t\) is our current image, \(x_0\) is the estimate of the original image detailed in
part 1.3, \(\bar{\alpha_t}\) is as defined before from
alphas_cumprod
, \(\alpha_t = \frac{\bar\alpha_t}{\bar\alpha_{t'}}\),
\(\beta_t = 1 - \alpha_t\), and \(v_\sigma\) is a predicted noise that is also outputted by DeepFloyd.
Then, with
t_start = 10
, iterating following these steps along our
strided_timesteps
gives us the following results
(with comparison to our methods provided):
Denoised at t=90.
|
Denoised at t=240.
|
Denoised at t=390.
|
Denoised at t=540.
|
Denoised at t=690.
|
Original.
|
Iteratively denoised.
|
One-step denoise.
|
Gaussian blurred denoise.
|
1.5: Diffusion Model Sampling
We can actually use our
iterative_denoise
function from part 1.4 to also generate
new images. This is done by setting
t_start = 0
and passing in random noise as our
images. Five sampled images are shown below using this method:
5 sampled images using iterative denoising.
|