Below is an extract from previous blog ‘Do Data Scientists Dream of Elecronic Sheep’, at the end of it I said I would revisit the topics within it further. So it is put before you again reworked. I have repeated sections of it (because I can) but mainly because the application I was working when I first wrote it is still occupying my thoughts. The same thoughts keep popping up. A recurring dream really, so I have updated the post to reflect a few things that have altered and a few new thoughts on the subject. The first time I asked this ( … do they dream of electric sheep) question I ended by saying I would look further into “Why do Philip K. Dick , Ridley Scott and Jean Baudrillard have some appeal for Data Scientists”. Some may say it is just Spinning an old post but really it is more of a director’s cut: Rachael Rosen approved of this extended version of the original.
Looking to learn, image analytics
The application development has now reached the point where we are implementing Machine Learning on image data, using TensorFlow™. which is an open source software library for numerical computation using data flow graphs this technology was recently released, as open source, by Google. Now one thing this technology allows you to do is use images as data entities in their own right. so we can search a dataset of images for an image that is like another image. The main implication of this is that we are able to build a lexicon of images, images exist as units of meaning in themselves and as a collection take on different meaning. Programing methods of looking at the entities as a part of collection makes classifications possible and thereby allows a semiotic of images to emerge. This lexicon of images can be mapped to a number of meanings, (equivalent to or expressed as labels) but at base each image is a unit of meaning, it is a container of information.
The reason I find this exciting is because we can search for an image, not an image name but what the image is and what its meaning is. Imagine ‘Get me all of the .’
The other consequence is that we could take a whole set of images and from them discover a system of meaning or signs. The root of recorded (written) language can be found in semanto-phonetic writing systems such as Hieroglyphics, Etruscan paintings or even Paleolithic cave painting. To once more look at image based language is not so strange, even if has been strange to be able to do it at the data level. Processing images at the data level, where they can be structurally analysed, has been difficult due to the processing requirements. Emerging technology in both the programing aspects and the physical hardware is changing all this. At the hardware level GPU’s (Graphical Processing Unit’s) composed of hundreds of cores that can handle thousands of threads simultaneously and are very good at processing huge batches of data and performing the same operation over and over very quickly. This makes handling image data less computationally demanding. Data held on a software defined storage via an object based drive architecture is well suited to large-scale, data-driven applications. It is particularly suited to applications performing machine learning and deep neural networks. With open source machine learning application such as Spark and TensoFlow, the emergence of the Open Kinetic Project, noSql and graph datastores, Python and R libraries, expressive languages such as Closure and Go; there is a brand new ecosystem for developers of both software and hardware systems.
Some may find this trivial, but I am so excited by the idea of a corpus of images being possible as a system of meaning that comprised of signs can be searched, analysed and used to make predictions. Why? Well let’s consider an application of this that we are working on at the moment. We capture images as raw data sources, we then apply ML process to them, this leads to discovery and predictions, the result of this application is clinical diagnostics that will lead to many lives being saved. Not a trivial aim.
So, do data scientists dream of electric sheep? Yes!
Well I do, or I certainly did last night. Wondering how to do things, laying in bed in the middle of the night, is the time when you recap the day’s work; sometimes you think of the solution you were searching for just before you drop off. Yesterday I was deciding how to gather all the bits of unstructured data, used in the application I am working on, together.
Living in the foothills of the Pennines you get used to seeing sheep scattered across the hills. On significant occasions they are gathered together. Those of you who have kept, or, are familiar with sheep, will appreciate that gathering them together is made simpler due to the innate characteristic of sheep to stick together. If they break out of an enclosure by breaching the perimeter fence they will always use the same exit; unlike bullocks who go through the boundaries as if at random and leave many parts to be patched up, sheep stick together and follow each other.
Now, if I could get my data to be a bit more sheep like, it would make the part of the application I am currently designing much easier. In other words if the data stuck together in a flock I would be able to move and organize it much more easily. If the data was gregarious like the Ovis Aries I could shepherd it into the most useful places much more easily.
If the question I need to ask is; “get me all captured data that deals with ‘structure’, ‘shared attributes’, ‘counting’ close to ‘attributes’, ‘different sources’, ‘flocking’, ‘characteristics’, ‘flocking’ very close to ‘characteristics’, ‘segments’, ‘text’” bringing back answers that behave as a flock could be very useful.
The answer may be to search through the varied segments of data looking for the structural and textual attributes that these segments share; by counting these attributes you could assign flocking characteristics.
Let me illustrate this by example. The application has captured data from different sources. Suppose we do a full text search on terms from one segment across all of the segments and then count the number of common terms. (Try Googling “The answer may be to search through the various segments of data looking for the structural and textual shared attributes by counting these attributes you could assign flocking characteristics. Let me illustrate this by example. The application has captured data from different sources.” to get a feel for this).
The result set is indexed, missing out stop words etc, and looking at the proximity between words, sorting the result set by occurrences and proximity of words it starts to join bits of data to the flock.
Data flocks, surrealism and mad machines.
The generation of surreal and, quite frankly unattractive images may make ‘DeepDream’ generated images look a bit artistic, it does have a sort of mad Max Ernst’s ‘Robing of the Bride’ quality about it.
As attractive as Surrealism is, it lacks the beauty and poise of Dada. It certainly lacks the insights. The significance for students of ML is that while it can work much quicker and process much more information it is a still a very crude dumb machine compared to the slow old human intelligence box. I suspect this will always be the case and while I would never want to underestimate machine based data processing it is still a bit pre-historic, and, like Surrealism, it lacks the beauty and poise of Dada.
Looking to train – Machine Learning
So now we have our data behaving like sheep. The next stage is to mark the segments as members of a flock and add attributes to describe the flock, in other words add (or extract) meta data (classifications) to the segments. Now when we move a segment into a particular area the rest of the data segments flock and follow it. We have electronic sheep!
We may need a sheep dog and such a dog will need training. And there are many dogs we can choose from. Fortunately many gifted people shared decades of experience in deep Content Analysis, Natural Language Processing, Information Retrieval, Machine Learning and Artificial Intelligence. IMB managed to get Watson to thrash humans in a quiz game. More interesting than winning game shows is the API for Watson that we have implemented and Alchemyapi and in particular vision API’s. There are also some very good Scientific Computing toolsets available through Scipy including; scikit-image and scikit-learn.
What does all this mean? The specification documents for our application describe it thus
The initial tasks are to
- Capture and store the source data,
- Extract and define visual attributes to form groups within the data set (based on image similarity)
- Match classifiers or labels to these groups.
The results of these tasks together will be used to form a learning algorithm that we will build into the application to enable predictive analysis of the data.
The next step will be to apply the learning algorithm to a data set (where we know the results), extract and evaluate the data groups and predict a score (representing treatment success), if we are not happy with the outcome of this we will refine the algorithm until such point as we are. This will form our first learning model.
We will continually repeat the learning process to improve the model as more data is collected.