Extreme Style Machines:
Using Random Neural Networks to Generate Textures

Blog / Extreme Style Machines:
Using Random Neural Networks to Generate Textures
  

Wait, what! Generating high-quality images based on completely random neural networks? That’s the unreasonable effectiveness of deep representations…
(This research project was possible thanks to the nucl.ai Conference [1].)

Synthesizing high-quality images with deep learning currently relies on neural networks trained extensively for classification on millions of images. It takes days! Most neural networks used for classification are very big too, which makes the generative algorithms even slower. It’s a huge problem for adoption since it takes a lot of resources to train these models, so scaling quality up and down is difficult.

While trying to train a more efficient neural network for Neural Doodle using current best practices, I stumbled on a random discovery… I found I could use completely random neural networks as feature detectors and still get compelling results! Think of this as a form of reservoir computing similar to Extreme Learning Machines — which has known limitations, but could help out here.

After investigating this, I identified that two big architectural decisions were required for this to work at all. The details follow below, but here’s the punchline:

  1. Exponential Linear Units as activation functions.
  2. Strided Convolution as down-sampling strategy.

The underlying image generation algorithm is from a paper I call Neural Patches — which is basically brute force nearest neighbor matching of 3×3 patterns — but using the neural network’s post-activation outputs (conv3_1 and conv4_1) rather than doing operations in image space. This tends to improve the quality of the results significantly, and you’ll see both good and bad examples below.

WARNING: This research was done in less than 24h. I’m writing this blog post already because I Tweeted about early results and everyone is excited — so I can’t keep this contained ;-) If you’d like to collaborate on further research and writing a full paper, let me know!

Experiment Setup

The main focus of this report is on example-based texture generation. The neural network is given an input photograph (grass, dirt, brick) and must re-synthesize a new output image from random noise. This is a great application because it’s the easiest problem in image synthesis and it only involves one loss component so it’s nice and simple too! Other types of image synthesis don’t work quite as well with these Extreme Style Machines (yet?).

48 units,  3x3 shape              # conv1_1
48 units,  3x3 shape              # conv1_2
80 units,  2x2 shape, stride 2x2  # conv2_1
80 units,  3x3 shape              # conv2_2
112 units, 2x2 shape, stride 2x2  # conv3_1
112 units, 3x3 shape              # conv3_2
112 units, 3x3 shape              # conv3_3
176 units, 2x2 shape, stride 2x2  # conv4_1
176 units, 3x3 shape              # conv4_2
176 units, 3x3 shape              # conv4_3

For all the other parameters, they are as follows: there were a total of four scales processed using --phases=4, each with ten optimization steps as --iterations=10. Style weight was set to hundred, higher than usual with --style-weight=100.0 and the patch variety was set to a small amount for visual fidelity --variety=0.1. Total variation smoothing was set to very small value --smoothness=0.1.

Weight Initialization

The weights are initialized to default of the Lasagne library, which is GlorotUniform. See the source code for more details, or read the full paper on the subject.

No experiments were performed on the type of weight initialization yet, though any approach that helps add more diversity to the weight matrices will improve the overall quality. This points towards orthogonal initialization strategies as a great option!

Source Code, Images, Scripts

You can find the code in the random branch of the Neural Doodle repository. Here are the details for downloading the images and running the script.

python3 doodle.py --style GrassPhoto.jpg --output GrassTest.png \
        --iterations=10 --phases=4 \
        --style-weight=100.0 --variety=0.1 --smoothness=0.1

Activation Function

As each image is processed by the neural network, the values accumulated in each neuron are passed through a non-linear function called an activation. This section compares the effect of different activation functions: Rectified Linear (the standard approach that VGG uses), Leaky Rectifier (used in recent adversarial architectures, but a very leaky version), and the most recent Exponential Linear Units. I was experimenting with ELU due to the beautiful function shape that looks suited to image synthesis, and the fact it could reduce the need for batch normalization in deeper networks.

Rectified Linear

Average Pool Maximum Pool Strided Convolution

Very Leaky Rectifier

Average Pool Maximum Pool Strided Convolution

Exponential Linear

Average Pool Maximum Pool Strided Convolution

Speculative Explanation

It seems ReLU units discard too much information by setting the output to zero if the input is negative. LReLU — specifically the very leaky variant — do much better because no information is lost. However, ELU seems to do even better because the distribution of the output is more balanced by the time it reaches layers conv3_1 and conv4_1, as mentioned in the original paper.

This generative experiment is a good way to visualize just how good ELU is at doing its job! It’s possible batch normalization would work similarly, however in the generative setting there are no batches and it seems significantly easier to use the right activation function in the first place…

Down-Sampling Strategy

When images are processed by neural networks, the internal representation gets smaller and smaller as is propagates through the network. This section now compares the different ways of down-sampling the activations at regular intervals in the deep network. The different options are average pooling (often used in generative architectures), max pooling (often used in classification), and strided convolutions (recent favorite to avoid pools).

Average Pooling

Rectified Linear Very Leaky Rectifier Exponential Linear

Maximum Pooling

Rectified Linear Very Leaky Rectifier Exponential Linear

Strided Convolution

Rectified Linear Very Leaky Rectifier Exponential Linear

Speculative Explanation

Averaging the activations causes the output to become blurry, particularly as the network gets deeper. The combination of input and random weights likely causes the activation values to converge to a constant value (0.0) as more depth is added. Max pool makes the output more crisp, but the diversity in the activations is probably also lost—which reduces the quality of the patch matching. Strided convolutions work because the connections in its weights are also random.

Unit Numbers

As you can see from these trial runs, the number of neurons affects quality incrementally. With fewer neurons, the results are less predictable and certain patterns are not correctly “understood” by the algorithm and become patchy or noisy.

Full Units

Quarter Units

Half Units

Network Depth

The following experiments were run with half of the units to see if depth improves the quality or degrades it. Overall, from this informal analysis, it seems like additional layers degrades the quality.

Double Layers (Half Units)

Triple Layers (Half Units)

Technical Summary

The idea of deep random feature detectors is appealing: it avoids a lot of intensive training and helps adapt existing models to new constraints quickly, as well as scale up and down in quality on demand. Taken further, it could certainly help apply example-based neural style transfer techniques to domains where classifier models are not as strong as in the image domain.

Of course, there are known theoretical limitations to the “Extreme Learning Machines” model (see this reply by Soumith). Adding depth doesn’t help much when first layers already have high-frequency coverage thanks to randomness, compared to deep learning with gradient descent that typically works better with depth. However, there’s likely a balance to be found using trained models (time consuming, significant investment) and replying on random networks (fast setup, low cost).

Short term, here are specific things to take away from this experiment:

  • Strided Convolution has become more popular for a variety of reasons, it helps significantly here.
  • Exponential Linear Units are well placed to become the default activation for generative models.
  • Neural Architectures used for image processing have a very strong and useful prior built-in!

Random neural networks don’t seem work very well for more advanced image synthesis operations like style transfer or even high quality Neural Doodles; more research is required. However, for operations like example-based super-resolution of images this could also work very effectively.

Alex J. Champandard

Addendum

This particular project has been fascinating! It’s been less than 24h since I randomly stumbled on a surprising result, and after Tweeting about it (with follow up), I was almost arm-wrestled into doing more experiments and posting the results. It’s not yet clear if and how more complex style transfer operations could work reliably for images or other media, but it’s a fascinating line of research nonetheless.

The process itself has also been interesting for me personally: while I specifically set out to investigate Exponential Linear Units using best practices like strided convolution, I never expected this to come out of the process. It feels appropriate to finish with a quote from Louis Pasteur:

“In the field of observation, chance favours only the prepared mind.”
— Louis Pasteur