Original Author: Steven Wang
Introduction
Two brothers, N.Coder and D.Coder, run an art gallery. One weekend, they held a very strange exhibition, as it had only one wall and no physical artworks. When they received a new painting, N.Coder chose a point on the wall to represent it and discarded the original artwork. When customers requested to view the painting, D.Coder attempted to recreate the artwork using only the coordinates of the relevant marks on the wall.
The exhibition wall is shown in the following figure, with each black dot representing a mark placed by N.Coder to represent a painting. The original artwork (original) at the coordinate [-3.5, 1] on the wall was reconstructed by N.Coder.
The following figure shows more examples, where the top row represents the original artwork, the middle row represents the coordinates at which N.Coder hung the images on the wall, and the bottom row represents the works recreated by D.Coder based on the coordinates.
The question is, how does N.Coder determine the coordinates on the exhibition wall for each painting, such that D.Coder can successfully rebuild the original artwork using only those coordinates? It turns out that the two brothers carefully monitored the loss of income caused by customer demands for refunds due to poor reconstruction quality, during the process of placing marks and reconstructing artworks. After years of "training," they gradually became "proficient" in mark placement and artwork reconstruction, while minimizing this loss of income. From the comparison between the original artwork and the reconstruction in the above figure, it can be seen that the cooperation between the two brothers is quite good. Very few customers who come to visit the artworks complain that the artworks recreated by D.Coder are significantly different from the original artworks they came to see.
One day, N.Coder looked at the exhibition wall and had a bold idea. For those parts of the wall that were currently not marked, what kind of artwork could be created if D.Coder rebuilt them? If successful, they could hold their own 100% original art exhibition. Just thinking about it got them excited, so D.Coder randomly selected the coordinates (marked by red dots) that were not previously marked for reconstruction, and the results are shown in the image below.
As you can see, the reconstruction is not very successful, and some images can't even be recognized as numbers. So what went wrong, and how can the Coder brothers improve their plan?
1. Autoencoder
The story at the beginning is actually an analogy to an autoencoder. D.Coder, also known as the encoder, is responsible for converting images into coordinates, while N.Coder, known as the decoder, is responsible for reconstructing the coordinates into images. The income loss monitored by the two brothers in the previous section is actually the loss function used during model training.
Now, let's take a closer look at the rigorous description of the autoencoder. It is essentially a neural network that consists of:
An encoder: Used to compress high-dimensional data into a low-dimensional representation vector.
A decoder: Used to reconstruct the low-dimensional representation vector back into high-dimensional data.
The process is illustrated in the diagram below. The original input data is the high-dimensional image data, which contains many pixels and is therefore high-dimensional. The representation vector is a low-dimensional vector, and in this example, it is a two-dimensional vector [-2.0, -0.5].
This network is trained to find the weights of the encoder and decoder that minimize the loss between the original input and the reconstruction after passing through the encoder and decoder. The latent vector is a compressed representation of the original image in a lower-dimensional latent space. By selecting any point in the latent space, we should be able to generate a new image by passing that point to the decoder, as the decoder has learned how to transform points in the latent space into visible images.
In the introductory description, N.Coder and D.Coder use vectors within a two-dimensional latent space (wall) to encode each image. The choice of two dimensions is for visualizing the latent space, but in practice, the latent space is often higher dimensional to more freely capture larger subtle differences in images.
2.Model Parsing
2.1 First Encounter
In general, it is preferable to create a class for the model in a separate file, such as the Autoencoder class shown below. This allows other projects to easily call this class. The code below first demonstrates the framework of Autoencoder, where __init__() is the constructor, _build() is called to create the model, compile() function sets the optimizer, save() function is used to save the model, load_weights() function is used to load the weights for future use, and train() function is used to train the model.
The constructor function has 8 required parameters and 2 default parameters. input_dim is the dimension of the image, z_dim is the dimension of the latent space, and the remaining 6 required parameters are the number of filters, filter size, and stride size for the encoder and decoder.
Use the constructor function to create an autoencoder named AE. The input data is black and white images with dimensions ( 28, 28, 1). The latent space is a 2D plane, so z_dim = 2. In addition, the values of the six parameters are lists of size 4, so both the encoding model and the decoding model have 4 layers.
Inside the AutoEncoder class, define the _build() function to construct the encoder and decoder and connect them, as shown in the code framework below (the next three sections will be analyzed step by step):
In the next two sections, we will analyze the encoding model and decoding model in the autoencoder.
2.2 Encoding Model
The task of the encoder is to convert the input image into a point in the latent space. The implementation of the encoding model in the _build() function is as follows:
Code explanation:
Lines 2-3 define the image as the input to the encoder.
Lines 5-17 stack the convolutional layers in order.
Line 19 records the shape of x, where K.int_shape returns a tuple (None, 7, 7, 64), and the 0th element is the sample size. Use [ 1:] to return the data shape excluding the sample size ( 7, 7, 64).
Line 20 flattens the last convolutional layer into a 1D vector.
The dense layer in line 21 converts this vector into another 1D vector of size z_dim.
The 22nd line builds the encoder model, and determines the input encoder_input and encoder_output in the Model() function.
Use the summary() function to print out the information of the encoding model, which describes the name type (layer (type)), output shape (Output Shape), and number of parameters (Param #) for each layer.
2.3 Decoding Model
The decoder is a mirror image of the encoder, except that instead of using convolutional layers, it uses convolutional transpose layers to construct the model. When the stride is set to 2, the convolutional layer halves the height and width of the image each time, while the convolutional transpose layer doubles the height and width of the image. The specific operation is shown in the following image.
The implementation of the decoder in the _build() function is as follows:
The code is explained as follows:
The 1st line defines the output of the encoder as the input of the decoder.
The 2nd-3rd lines reshape the 1D vector into a tensor with shape ( 7, 7, 64).
The 6th-15th lines stack up the convolutional transpose layers sequentially.
The 7th-22nd lines:
If it is the last layer, use the sigmoid function to convert it, and the result is between 0-1 as the pixel.
If it is not the last layer, use the leaky relu function for conversion, and add batch normalization and dropout.
Lines 24-25 construct the decoder model, with decoder_input and decoder_output as inputs to the Model() function. The former is the output of the encoder, which is the point in the latent space, and the latter is the reconstructed image.
Print the information of the decoding model using the summary() function.
2.4 Connecting Together
In order to train both the encoder and decoder together, we need to connect them.
The code is explained as follows:
Line 1 treats encoder_input as the input model_input of the entire model (the intermediate product encoder_output is the output of the encoder).
Line 2 treats the output of the decoder as the output model_output of the entire model (the input of the decoder is the output of the encoder).
Line 3 constructs the autoencoder model, with model_input and model_output as inputs to the Model() function.
A picture is worth a thousand words.
2.5 Training Model
After building the model, you only need to define the loss function and compile the optimizer. The loss function is usually chosen as Root Mean Square Error (RMSE). The implementation of the compile() function using the Adam optimizer and a learning rate of 0.0005 is as follows:
The fit() function is used to train the model, with a batch size of 32 and 200 epochs. The code is as follows:
Let's take a look at the results of randomly selecting 10 from the test set:
Only 4 out of 10 images have decent reconstruction results.
3. Three Major Defects
After model training, we can visualize the situation of images in the latent space. Use the encoder in the model to generate coordinates for the test set and display them on a 2D scatter plot.
There are three phenomena worth noting in the image:
Some digits occupy a small area, such as the red 9, while others occupy a large area, such as the purple 0.
The points in the graph are asymmetrical with respect to (0, 0). For example, there are more negative points on the x-axis compared to positive points, and some points even go as far as x = -15.
There are large gaps between colors, which contain very few points, as shown in the upper left corner of the image.
These three defects make sampling from the latent space very difficult:
For defect 1, since the area occupied by the digit 9 is larger than that of 0, it is easier for us to sample 9.
For defect 2, technically, we can sample any point on the plane. However, the distribution of each digit is uncertain. If the distribution is not symmetrical, random sampling becomes challenging.
For defect 3, as can be seen from the figure below, it is impossible to reconstruct good-looking numbers from the blank space in the latent space.
It is understandable that the blank space in defect 3 cannot reconstruct numbers, but the reconstruction represented by the two red lines in the figure below is worrying. These two points are not in the blank space but still cannot be decoded into recognizable numbers. The root cause is that the autoencoder does not enforce continuity in the generated latent space. For example, even if (2, -2) can generate satisfactory digit 4, the model does not have a mechanism to ensure that the point (2.1, -2.1) can also generate satisfactory digit 4.
Summary
An autoencoder only needs features, not labels. It is a model for unsupervised learning and is used for data reconstruction. This model is a generative model, but as mentioned in the previous section, it does not perform well on low-dimensional grayscale digits, so its performance on high-dimensional color faces will be even worse.
This autoencoder framework is good, so how can we solve these three issues and create a powerful autoencoder? This will be discussed in the next section, the Variational Autoencoder (VAE).
You can search for SignalPlus in the Plugin Store of ChatGPT 4.0 to access real-time encrypted information. If you want to receive our updates instantly, you can follow our Twitter account @SignalPlus_Web3, or join our WeChat group (add assistant WeChat: SignalPlus 123), Telegram group, and Discord community to interact and communicate with more friends.
SignalPlus Official Website: https://www.signalplus.com
