MACHINE LEARNING: How Black is This Black Box

Software is much better now than it was a few decades ago. Machine learning is better than it was. The machine learning software libraries are now mature and well tested. The technology that surrounds it is also much better. As an engineer I became used to working with more high level and abstract libraries, frameworks that obscured the underlying operations and then complete black boxes. By black boxes I mean tools where I have little concern about what is going on inside.

I am still happy to use a calculator on my phone to find the square root of a number or use a built in math function of whatever programing language I am using at the time. All of these use different approximations to solve the square root problem. The truth of the matter however is that, for most purposes, I am happy if the method I use to find a square root comes up with a close estimate. Should we put our faith in black boxes?

Artificial neural networks generate rules by learning from examples. To learn they use a process of adjusting weights and measures to match a known result.  They create their own example based rules on the fly. The way they operate does not detail how they evaluate and adjust the weights and measures.

nn-layers10

\Delta W=-\lambda \frac { \partial E }{ \delta W }

Neural networks are systems that arrive at complex answers. To do so they generate their own questions and rules.  The lack of transparency on how the rules come about is seen by some as an issue. They argue that the lack of transparency, the Black Box problem, could impede their usefulness. ANNs need to be able to explain themselves and tell us how they arrived at their answers.

Shining light into the boxes.

One way of doing this is to probe the network using test inputs and measuring the impact of input variables on the outputs. A bit like the interrogation technique of asking a subject a question to which you know the answer to test the veracity of the subject.

With ANNs that use backpropagation this could be done by tracking error terms during the backpropagation step and then measuring the amount that each input impacts the output.

Another way to learn to trust network derived knowledge is to extract the rules that the network uses. The weights arise from transformations along a non linear path between inputs and outputs. All the values that comprise the knowledge are just series of numerical values. One process of extracting rules could be to translate the numerical values into symbolic forms.

The most obvious method of extracting rules would be to turn the learning process on itself. Not examining the weight matrices directly but instead rendering the rules symbolically.  Extracting rules by comparing the input / output mappings and forming decision trees. Decision trees based on classifier searches seeking a series of rules that intelligently organize a given dataset.

A neural network takes inputs and outputs knowledge. There may be a concern that we do not understand the way it did this nor can we see the rules it used. The network uses a series inputs and then creates outputs, together they provide a new dataset. So there is a possibility of forming both a symbolic and a visual representation of the process. By passing the inputs and the new outputs to a decision tree classifier we would then discloses how the network derived its knowledge. As well as disclosing the rules the network used and it gives us a way of visualising those rules.

Why look for the way the network builds it’s rules at all. Is it redundant effort if the network has been trained properly in the first place? Yet the ability to extract symbolic knowledge has some potential advantages. The knowledge obtained from the network can lead to new insights into patterns and dependencies within the data. From symbolic knowledge, it is easier to see which features of the data are important to the results.

It is not too difficult to program a machine learning model to represent what it is doing as another model. The trouble with doing this is it would tend to exist at a mundane or trivial level. It would be time consuming and not really suitable for understanding large complex structures. There is also the problem that the derived model would not provide a clearer understanding. If it reduced the rules, the target network created, it may be too terse or give an incomplete description but the suspected danger is it would amplify the rules and further obfuscate the understanding sought.

Getting boxes to keep telling us what we know

The other way of dealing with black boxes is to make the network even more dense by adding more layers. This is entering the place where deep networks reside. The land of the varied architectures of deep networks including: Convolutional Neural Networks (CNN’s), Recurrent Neural Networks (RNN), Recursive Neural Network, and the newer models of Decoupled Neural Interfaces (DNI) and Differentiable Neural Computers(DNCs). Deep networks are very different from their traditional ANN relatives. One way to visualise how different deep networks are is that they, by their very nature, overfit to the data they learn from. This is the opposite of more traditional ANN’s which tend to aim for a more close fitting curve.

Obtaining a good fit works when there are not too many parameters. Deep learning can take multiple inputs and compute them through multiple layers. Even with random weights the hidden layers a deep network is very powerful and able to represent highly nonlinear functions.

Most deep learning algorithms employ some form of optimisation to minimise or maximise a function f(x) by adjusting the value for x. The function to be adjusted is known as the objective, loss, error or cost function. Deep learning employs these functions to measure and adjust how close results (predictions, classifications, etc) are to the inputs. The aim is to minimise the loss function.

y\quad =\quad f(x)

The derivative of the function

 \frac { dy }{ dx }

gives the slope of f(x) at the point which specifies the scale of a small change in the input to match a corresponding change in the output.

Simple solutions to reduce loss functions work well when a large number of repetitions and layers are available. One recurring problem is that large training sets are required to provide reliable generalisations and thus require a lot of compute resource.

W=\quad { W }_{ 1 }+\eta (t-f(z)){ X }_{ 1 }

Where X is the set of input values of Xi ,W is set of the importance factors(weights) of every value Xi. A positive weight means that that risk factor increases the probability of the outcome, while a negative weight means that that risk factor decreases the probability of that outcome. t is the target output value, η is the learning rate(the role of the learning rate is to control the level to which the weights are modified at every iteration and f(z) is the output generated by the function that maps large input domain to a small set of output values in this case. The function f(z) in this case is the logistic function:

\int { (z)=\frac { 1 }{ 1+{ e }^{ -\approx  } }  } 

z={ x }_{ 0 }{ w }_{ 0 }+{ x }_{ 1 }{ w }_{ 1 }+{ x }_{ 2 }{ w }+...+{ x }_{ k }{ w }_{ k }

The Stochastic Gradient Descent (SGD) method is used for many machine learning models and is the prime algorithm used on ANN’s, it also provides a simple way to produce accurate results in deep networks. The stochastic gradient descent is able to give an approximate estimate of loss using a small set of samples lessening the amount of compute required.

In the objective (loss) function
Q(w)\quad =\quad \frac { 1 }{ n } { \sum _{ 1=1 }^{ n }{ { Q }_{ 1 } } }{ (w) }

looking to estimate the parameter w which minimises Q(w).

The gradient descent is

w :=w-\eta \nabla Q(w)\quad =\quad w-\eta \sum _{ n-1 }^{ n }{ \nabla { Q }_{ 1 } } (w)

where η is the learning rate.

A stochastic gradient descent at any single point is approximated as

w:=w-\eta \nabla { Q }_{ 1 }(w)

So the initial vector w at learning rate η is iterated i number times with random shuffles of the training data. By making the connections random faster training times are possible as there is no need to train the weights of the hidden layers.

While the level of understanding of how convergence occurs, or how sound generalizations are derived, is vague deep learning is best applied to tasks where we know what answers to expect but it is tiresome to find them. Common examples of such tasks are; speech recognition, image classification, anomaly detection and character recognition are not dealing with unknowns.

If a network recognises a character as an ‘a’ or a drawing as a fish we can make an immediate assessment of its potential correctness. If the network recognises a random grey texture as a parrot we can quickly see it has been fooled and is foolish. We do not need to know how it was fooled because we use empiricism to test the networks accuracy. Knowing it happens allows us to look at ways to add such features as weight regularisation.

The reassurance of knowing how the hidden layers are working can be replaced by another knowledge. The performance of deep learning models and their scope replaces; the need for knowing the innards with; pragmatic evaluation. Being able to view the results and performance achieved from using deep reinforcement learning gives a somewhat pragmatic way of seeing what is occurring.

The solution to overcoming this fear of the unknown is then to develop a model that has knowledge of itself, a model that can recall, recount, reason and reflect. One that is self aware and learns from what it has done historically. A model that has the ability to represent variables and data structures and to store data over long timescales. While more complex to build architecturally by providing the model with an independent read and writable memory DNCs would be able to reveal more about their dark parts. So the way to deal with black boxes is to make them blacker. Science has a long tradition of learning from doing; using experiments to form theory.

Consider the two ends of sub molecular physics Gauge Theory and the Large Hadron Collider. Both are used to seek to find the same enlightenment. Scientists are discovering surprising results using the the LHC and the string theorists predict surprising particles. Both converge, at times, to seek and at times reveal joined up insights, with one informing the other. Theoretical understanding and practical applications are paths which are not always identical yet they may lead to the same places. The architectures, algorithms and models of machine learning follow a similar paths.

 
%d bloggers like this: