The open source tools that are making a devs life easier
Not so long ago the difficulty in working with data stemmed from the fact it came from different places in different forms, much of it was unstructured or at best semi structured. Getting this data into a shape where it could be analysed and used to provide insights was a tedious process, data cleaning and preparation can be time consuming processes. I have mentioned elsewhere about the “The dirty little secret of big data,” being that fact , “that most data analysts spend the vast majority of their time cleaning and integrating data — not actually analysing it.” The secret also covers the demands of the data analysts who continually go back to the database admins and the coders and beg them to run this SQL and then “just one more query please”. In the recent past the routines to get inside the data were repetitive and time consuming, people were doing those very tasks that software was good at. The trouble was the software tools we tended to use did not have high levels of flexibility. In the main part due to open source applications the landscape is very different these days.
At the infrastructure level a massive influx of data, or even a gradually growing data set needed physical and manual operations, commissioning a new server, adding memory, adding another disk needed John to get his screwdriver out and do it. The cloud changed all this and Peter programmed more memory, more disk space, and more processors on demand. Now it is true that none of this is that new, containers are just like the old Unix jails really: it is the scale that they can be used at that matters. There are now new tricks that can be performed with new combinations of tools and ways of holding data. Given that times have moved on how would we look to solve these problems now? First off let us look at what has changed.
devOps exists in three domains; the cultural, the tooling and technologies, and, the architectural.
The cultural shift was a big leap for back end engineers. They were used to programmers using agile techniques but servers and networks needed screws and cables and engineers to connect them. Breaking down the demarcation between systems engineers, system administrators, operations staff, release engineers, DBAs, network engineers, security professionals and the programmers, requirements engineers, application designers, testers, UX designers was the key to this cultural shift.
In the technology domain available tooling and technologies such as Jenkins and Codeship gave continuous integration and continuous deployment. Configuration management tools such as (Puppet and Chef) opened up orchestration — making manual processes automated. Deployment has been eased by a clutch of tools, with Docker and Kubernetes being at the forefront.
devOps is driven by a number of motivations. To regulate, reproduce and roll back. It comprises:
- A desire to rationalise dependency management, to provision and configure all the constituents of the stack.
- The requirement for reproducible deployments so that the whole application can be erased and then accurately reproduced.
- The ability to create multiple instances of the same application and significantly to recover from failure at any point during provisioning.
At the database level the relational databases worked with structured data, the core component, the ‘schema’, was not easy to change once it held any amount of data, the schema was after all the overarching structure. There is a new generation of data stores, the noSQL and graph varieties handle unstructured data much better than the RDBMS data bases did. The column varieties such as Cassandra and HBase The document stores such as CouchDB and MongoDB. The Key Value stores Couchbase, Redis, Riak. And the graph databases such as Neo4J.
The big daddy of big data also can not be missed out of this consideration namely Apache Hadoop which comes with the Hadoop Distributed File System (HDFS), YARN a job scheduler and cluster resource manager, andMapReduce a parallel processing system designed to work with massive amounts of data. There are also a host of other packages that can sit alongside Hadoop. There are a number of vendors who provide cloud based Hadoop platforms including AWS Elastic MapReduce and IBM BigInsights.
Why is noSQL such a game changer? Well it opens up ways of working with structured, semi-structured, and unstructured data. As a result of being able to work with polymorphic data; the time it takes to design and provision data stores is dramatically reduced. The flexible structure allows rapid deployment. The ramping up of the provision to handle high velocity, quickly growing volumes and varied data is just so much easier.
Before the adoption of noSql data stores it was extremely difficult to get a distributed database up and running satisfactorily. Replication gave a way of distributing read only data widely and easily but getting transactional data to sit across even two servers was a nightmare. The only ways of overcoming the monsters implicit in master master replication were partitioning, fragmentation and sharding, neither of which techniques led to flexible deployment using relational databases. Sharding is now a trivial matter with noSql data stores automatically writing data across an arbitrary number of servers. noSql data stores are designed to work in the cloud, designed to be flexible and agile and designed for modern data analytic and delivery requirements.
It is not just changes at the data store that mark new ways of doing things. Platforms have changed and significantly the advances in PaaS (platform as a service) offerings relieve developers from the complexities of the infrastructure providing; scalable and inexpensive cloud computing services. Database as a Service DaaS such as Mlab and Googles BigQuery provide similar platform and database functionality.
AWS made having inexpensive, scalable distributed cloud computing very possible, even if the interface is a bit of a muddle. In the past this encouraged the writing of a custom interface and routines to navigate through and use it, now third party suppliers are filling this gap. For development environments there are ‘AWS made easy’ products such as Heroku and CloudFoundry as well as a platform for Elasticsearch & Kibana on AWS.
Googles App Engine is a platform (which as well as being software controlled infrastructure provides a noSql data store with its own query language) that supports the Spring Framework and the Django web framework.
OpenStack is in some ways aimed at platform providers in that it delivers an IaaS (Infrastructure as a Service). OpenShift, on the other hand, is a true PaaS; it comes with Docker packaging and Kubernetes container cluster management, it gives application lifecycle management functionality and operation tooling.
Platforms are pay as you go, the tools you run on them are mainly open source. The question of which platform to choose and the answer always comes from the “what do you want to with it?”. It may be cheaper for production deployments to manage your own platform directly, but in a production environments where throwing instances for development and testing the PaaS providers have some appeal. Depending on scale an infrastructure foundation IaaS may be the place to build from or a true platform foundation PaaS may be the right choice. For smaller scale production environments (and development) the third party platform managed solution could be cost effective.
Boxes of tricks
Containerisation has come of age mainly through Docker which is having a massive impact on architectural design, development strategies and PaaS. Google provide Google Container Engine (GKE) which is a cluster management and container orchestration system developed to run and manage Docker containers. It is in turn powered by Kubernetes which is designed to deploy applications quickly; allow scaling on the fly, making it easy to roll out updates and reduce resource usage.
The regulate, reproduce and roll back capability that comes the Docker / Kubernetes combo span the whole from ‘development to production’ box landscape. Of all the technologies mentioned on this page they are the ones that enable the architecture to make devOps possible. In this sense they complete the devOps circle.
Now we have looked at the changes to the tools and architecture behind the operational sphere; it is time to look at how open source is enabling new functionality and capability. We have the system what are we going to use it for, we have the building what are we going to put in it and how are we going to do it.
Machines for understanding
Open source Python and R libraries make statistical analysis far easier. Languages such as Clojure and Docjure are well suited for processing information contained in feeds or documents. Simple API access to machine learning technology such as IBM’s Watsonand Amazons AML now are at our disposal. Other Machine Learning as a Service MLaaS include: DataRobot, BigML, Rapidminer and Algorithma.
Aside from API access the there are other flexible ways of utilising machine learning. Algorithma, DataRobot and BigMl provides platforms where algorithm developers and application developers interface. Developers can simply incorporate open source learning algorithms in their applications. Again this leverages the power of open source fully. Most well know algorithms are peer reviewed, well documented and optimised for speed and efficiency: they are be available in the most useful libraries and languages (Java, R, Python, Spark, etc.).
Lingua franca and polyglots
While the OO programming languages encompassed flexibility, though concepts such as: Encapsulation, Inheritance, Aggregation, Duck Typing, Late Static Binding and Dynamic Polymorphism: code became more flexible. This flexibility was at the cost of speed because lot of objects were being individually created and destroyed. These concepts work fine on the parts of the program that present logic to the user, the view layer or human interface but they are very inefficient when called on to work with the large amounts of data that the logic was being applied to.
A return to more expressive languages and functional programing paradigms now permit wading through large amounts of data with low computational overheads, in other words there are languages that are just so well suited to working with big lists of lists and data, and yes they have been around in one form or another for a long time.
The MEAN stack provides an opening for Node as well as the tools come alongside it such as NPM, while similar to other package managers such as Bower which encourage efficient production management.
As we now have these new ways of working with old and emerging technologies in our hands we can look at new ways of providing solutions for working with new combinations of ideas and data.
The above page is a part of the correspondence from Musée d’Art Moderne, Département des Aigles . Created by Marcel Broodthaers who was the director of this fictive museum that he had opened at his home in Brussels. The sentence ‘NOUVEAUX TRUCS, NOUVELLES COMBINES’ repeatedly appears in the two-volume catalogue he had produced for his last self-curated retrospective, L’Angelus de Daumier (The Angelus of Daumier, 1975).