In this article I discuss about a different way to "train" an artificial neural network. This technique is based on the translation of a neural network into into a constraint problem that can be solved by a constraint solver. I have worked on this method during my free time this year, just for fun because I was interested about the outcomes.
I've worked by the past on a binary analysis technique named concolic execution which implied to use a constraint solver. It solves constraint problems such as sudoku. My idea was to extend the use of constraint solver to another field, here artificial neural network.
Artificial neural networks are usually trained by making them converge to the "good answer". By make it guess then correct them. In order to do it a bunch of training data is required. The ANN will train faster depending on its shape (number of layers, number of nodes in each layer, bonds between nodes etc.) and decide its shape can sometime be a pretty though job.
Let's define a really simple neural network, two nodes in input layer, three in hidden layer and one in output. Basically the goal would be to solve binary xor calculus, here each node of a layer is connected to each node of the next layer.
Each node output is defined by the sum of its inputs weighted by the connection strength between the two nodes. Pretty straightforward here, the output of 2 is from then on:
In the same way, we can define the output formula of each node. Training an artificial neural network is basically finding a good combination of weight which is solving our problem.
The main idea behind this article is that we can easily define a whole neural network with a bunch of n equations, were n is the number of output nodes. The weights of bonds are our unknowns. The goal is to find them.
The previously described neural network shape is fully defined by this equation:
As usual, training data are mandatory in order to define what should be the output for each input set. With the set of training data, we can define the neural network with a new "dimension" of equation set. One equation for each output node for each training set. Where the value of the output and input node are from now on know.
These equations can be defined as a constraint problem (For the given formula, the given inputs and the given outputs let's find a combination of weights that fits all these constraints).
Let's come back to our example. For a Xor we got four different training set:
x0 x1 y5 0 ^ 0 = 0 0 ^ 1 = 1 1 ^ 0 = 1 1 ^ 1 = 0
From then on, we can describe the neural network with a system of four equations (four for each output node). Each x and y terms are replaced by the values of the training set.
The goal is to find a good combination of value for all the w terms that satisfies the four equalities.
Regardless of the different ways to solve equations, I decided to start with the easier one, use a constraint solver. It avoids a lot of issues such as isolation of terms. Moreover the complexity of these equations grows up exponentially with the size of the network and with the size of training data set. Using a constraint solver was in my mind the quickest way to experiment and get results.
I used Z3, a constraint solver developed my Microsoft, with which I have already worked on a previous project. It is fairly easy to use and has a C++ API.
However, one major issue with Z3 is that it doesn't have floats so I had to multiplied every value by 10^6 or something. Moreover, Z3 doesn't have complex functions neither, such as exp(). Wich force me to use a simple step function as activation function (if x > threshold then 1 else -1)
.
For each equation, a symbol for each node is created. Here I have chosen this synthax
(nodetype)(nodeId)_(trainingSetId) What gives us for instance for the hidden nodes: h0_1 h1_1 h2_1 h0_2 h1_2 h2_2 h0_3 h1_3 h2_3 h0_4 h1_4 h2_4
Then a symbol is created for each link with the syntax:
w(nodeFrom)_(nodeTo) w0_2 w0_3 w0_4 w1_2 w1_3 w1_4 w2_5 w3_5 w4_5
Notice that the weight symbols do not depend on the training set number, there are common to all the training set because they have to fit all the conditions.
The activation function is simply a step function. Z3 has ite()
, a function which implement branches. In our example it gives us something like:
condition then else ite(weigthedSum > 0, -1, 1);
Ofcourse the weightedSum
term is a z3::expr
made of the basic equation of a perceptron:
In the end we only need to add all these z3::expr
in the solver, check if the equation can be solved, then get the values of all the wx_y
terms.
Once the problem is solved by z3 the weights values are given by z3:
w0_2 = (/ 1.0 8.0) w1_2 = (/ 1.0 2.0) w0_3 = (- (/ 1.0 2.0)) w1_3 = (- (/ 1.0 2.0)) w0_4 = 1.0 w1_4 = 0.0 w2_5 = (- (/ 1.0 8.0)) w3_5 = (- 1.0) w4_5 = (- (/ 7.0 8.0))
I have made a tool which does all the job described before. It also contains a backend for C language which produces a basic function with the prototype void propagate(float *input, float *output)
.
This tool also evaluate the expression given by z3 (w0_3 = (- (/ 1.0 2.0))
-> -0.5
) and is built to receive different frontend in order to get various training data files as input.
A basic main function for our example looks like this (no frontend is used in this example):
int main() { std::vector<trainingData> data; data.resize(4); data[0].input.resize(2); data[1].input.resize(2); data[2].input.resize(2); data[3].input.resize(2); data[0].output.resize(1); data[1].output.resize(1); data[2].output.resize(1); data[3].output.resize(1); // Training data for Xor data[0].input[0] = -1; data[0].input[1] = -1; data[0].output[0] = -1; data[1].input[0] = 1; data[1].input[1] = -1; data[1].output[0] = 1; data[2].input[0] = -1; data[2].input[1] = 1; data[2].output[0] = 1; data[3].input[0] = 1; data[3].input[1] = 1; data[3].output[0] = -1; // Create a ANN with 2 Inputs, 3 Hiddens, 1 Output, -1 as min value, 1 as max value z3ann ann(2, 3, 1, -1, 1); // Remove all links ann.setAllLink(false); // Input0 ann.setLink(0, 2, true); ann.setLink(0, 3, true); ann.setLink(0, 4, true); // Input1 ann.setLink(1, 2, true); ann.setLink(1, 3, true); ann.setLink(1, 4, true); // hidden to output layer ann.setLink(2, 5, true); ann.setLink(3, 5, true); ann.setLink(4, 5, true); ann.loadTrainingDataSet(data); ann.solve(); std::cout << "========= C Backend Output ==========" << std::endl; BackendC output(ann); std::cout << output.getOutput(); return 0; }
I didn't succeed to use this method on a big enough neural network and so I can't precisely compare this method to the others. The conclusion is, therefore, still a little to much hypothetical. But I have noticed some points:
It ensures the accuracy of the neural network. We know for sure that at least for all the input data the network will answer the correct answer. An usual training can produce the same result as well if we consider to not stop the training until we get 100% of guess succeed.
It also lets us to know if the current shape of the neural network can solve our problem. If not, the constraint is unsatisfiable. The shape of the network can then automatically be mutated, until the equation is solvable.
Z3 is probably not the better solution in order to solve this kind of equation. In fact, experimentation shows that it consumes way to much memory because of the high complexity of problems to solve for a reasonable neural network.
# ./solver +-------------------------------+ | XOR example | +-------------------------------+ | Set 1: Input{-1, -1} | | Set 2: Input{ 1, -1} | | Set 3: Input{-1, 1} | | Set 4: Input{ 1, 1} | +-------------------------------+ | 0 -> -1 | activation : | | 1 -> 1 | -> Sum(Ii* Wij) > 0 | | | then 1 else -1 | +---------+---------------------+ | o2 | | / \ | | Input 1 -> o0 \ | | \ \ | | 3o---o5 -> Output| | / / | | Input 2 -> o1 / | | \ / | | o4 | +-------------------------------+ ===================================== ============== SOLUTION ============= =========== (on std error) ========== w0_2 = (/ 1.0 8.0) = 0.125 w1_2 = (/ 1.0 2.0) = 0.5 w0_3 = (- (/ 1.0 2.0)) = -0.5 w1_3 = (- (/ 1.0 2.0)) = -0.5 w0_4 = 1.0 = 1 w1_4 = 0.0 = 0 w2_5 = (- (/ 1.0 8.0)) = -0.125 w3_5 = (- 1.0) = -1 w4_5 = (- (/ 7.0 8.0)) = -0.875 ========= C Backend Output ========== void propagate(float *input, float *output) { output[0] = ((((((input[0]) > 0.f) ? 1.f : -1.f) * 0.125 + (((input[1]) > 0.f) ? 1.f : -1.f) * 0.5) > 0.f) ? 1.f : -1.f) * -0.125 + ((((((input[0]) > 0.f) ? 1.f : -1.f) * -0.5 + (((input[1]) > 0.f) ? 1.f : -1.f) * -0.5) > 0.f) ? 1.f : -1.f) * -1 + ((((((input[0]) > 0.f) ? 1.f : -1.f) * 1 + (((input[1]) > 0.f) ? 1.f : -1.f) * 0) > 0.f) ? 1.f : -1.f) * -0.875;}
Then the test of created C backend's function:
void propagate(float *input, float *output) { output[0] = ((((((input[0]) > 0.f) ? 1.f : -1.f) * 0.125 + (((input[1]) > 0.f) ? 1.f : -1.f) * 0.5) > 0.f) ? 1.f : -1.f) * -0.125 + ((((((input[0]) > 0.f) ? 1.f : -1.f) * -0.5 + (((input[1]) > 0.f) ? 1.f : -1.f) * -0.5) > 0.f) ? 1.f : -1.f) * -1 + ((((((input[0]) > 0.f) ? 1.f : -1.f) * 1 + (((input[1]) > 0.f) ? 1.f : -1.f) * 0) > 0.f) ? 1.f : -1.f) * -0.875; } int main() { float input[2]; float output; input[0] = 0; input[1] = 0; propagate(input, &output); printf("%f ^ %f = %f\n", input[0], input[1], output); input[0] = 0; input[1] = 1; propagate(input, &output); printf("%f ^ %f = %f\n", input[0], input[1], output); input[0] = 1; input[1] = 0; propagate(input, &output); printf("%f ^ %f = %f\n", input[0], input[1], output); input[0] = 1; input[1] = 1; propagate(input, &output); printf("%f ^ %f = %f\n", input[0], input[1], output); return 0; }
This program gives us:
# gcc main.c # ./a.out 0.000000 ^ 0.000000 = 0.000000 // <= 0 -> 0 [Ok] 0.000000 ^ 1.000000 = 1.750000 // > 0 -> 1 [Ok] 1.000000 ^ 0.000000 = 0.250000 // > 0 -> 1 [Ok] 1.000000 ^ 1.000000 = 0.000000 // <= 0 -> 0 [Ok] #