Machine Learning Demystified
Posted on 2nd November 2016, by Jean-Cyril Schütterlé
Posted on 2nd November 2016, by Jean-Cyril Schütterlé
Machine Learning (ML) can seem impenetrable to anyone unfamiliar with it. A lack of understanding as to what ML actually is sometimes gives rise to outlandish ideas about intelligent machines poised to take over mankind. But ML, basically, is a major advance in the development of Information Technology (IT). The way it works, and its limitations, have to be fully understood by all decision-makers who want to use it to the full benefit of their organizations.
ML, admittedly, requires the use of specific statistical and IT skills that few people yet have. But its principles are quite simple – and even intuitive to grasp. For me, it was what I once considered to be a rather mundane language translation online service – namely Google Translate – that helped me realize the transformative potential of ML.
To put it simply, language translation software has long been based on programming dictionaries, grammatical rules and their numerous exceptions. This approach involves considerable effort.
The new methodology stemmed from a simpler idea: don’t try to define rules and lexical tables from scratch, let the software ‘discover’ them. How?
In three steps:
Relevant statistical patterns found in the data, therefore, replace translations rules. Instead of having to be painstakingly programmed, they are automatically ‘learned’ by the software. It is easy to see the cost-saving value of this approach, compared to the traditional one, especially since the quality of the resulting translation is usually at a par with it.
In those areas of work less complex than translation between human languages, the productivity gains are compounded by a substantial quality improvement. Anyone who has ever specified automation processes knows just how complex it can be to anticipate all the possible situations the software will have to face once it is in production. This is even when functional domain experts are involved. The software’s functional rules are based on assumptions that themselves rely on a limited number of observations. But reality often proves to be far more complex than expected, meaning that automation is eventually suboptimal or that the software ends up requiring expensive corrections.
Machine Learning, on the other hand, absorbs and develops itself using all available data, regardless of the volume. That means the risk of patterns or a use case being left out of the picture is therefore limited.
The machine also avoids human intelligence’s ‘cognitive biases’ that translate into imperfect selections of available data and inappropriate decision making.
A good example is that of the automated processing of loan requests received by banks. An algorithm parses the archives of previous requests where each borrower’s key information is recorded (age, revenues, wealth, family status, etc.) along with reimbursement information (whether they did pay the bank back or defaulted). It therefore highlights the likely relationship between a borrower profile and a default risk. Applied to a new loan request, the algorithm will predict, with an accuracy level considered as sufficient, whether the borrower will pay back. This means the risk of a bad decision, triggered by prejudice or the bank operative’s mood, is removed.
Nonetheless, it is crucial that humans remain the ultimate decision-makers.
First, because the software is obviously not perfect. It is governed by settings made by humans. For instance, it may have been optimized to avoid ‘false-positives’ (where the loan is granted to a borrower who will default) and so will lean towards rejecting certain loan applications. It may also discard observations that don’t fit with its criteria. Therefore, a user must check that the system’s recommendations are legitimate and, if necessary, reject them. If he grants a loan to an applicant despite the system recommendation and it eventually turns out that the borrower meets the payment schedule, the new learned criteria will have to be introduced so that the algorithm accepts applications from similar profiles next time.
Another key reason is that only humans should make sure ethical standards are met, especially when a decision concerns an individual’s rights. The automated processing of non-anonymized data is already strictly regulated. And the law pertaining to this is likely to evolve further to protect citizens and consumers against the potential harmful effects of excessive statistical generalization.
It is essential to choose and set up an algorithmic model that fits the process at stake and the type of data sustaining it. Gauging a company’s default risk requires a different methodology to that of identifying a face on a picture. However, the automation performance will depend on meeting two imperatives:
Access to data is crucial for ML projects’ success because, ultimately, no level of algorithmic sophistication will ever make up for a poor data set.
With the growing power of computers and digitization, it has become possible and probably essential to leverage a data-driven approach to design more efficient automated processes. Beyond the required scientific skills, the success of these solutions lies in the collection of relevant data and the monitoring of their operations by humans. Machine Learning has a tendency to dismiss arbitrary behaviors. It is up to us to make sure it does not replace these with inappropriate over-generalizations.