The new methodology stemmed from a simpler idea: don’t try to define rules and lexical tables from scratch, let the software ‘discover’ them. How?
In three steps:
- A corpus of millions of pages, already translated from one language to another, are collected from international organizations. These include documentation available online from, for example, the UN or European institutions.
- When a user submits a text for translation, the software slices it into basic elements and then searches for identical or similar ones in the same language within the corpus.
- The most likely translation is extracted from the bilingual corpus to be suggested to the user.
Relevant statistical patterns found in the data, therefore, replace translations rules. Instead of having to be painstakingly programmed, they are automatically ‘learned’ by the software. It is easy to see the cost-saving value of this approach, compared to the traditional one, especially since the quality of the resulting translation is usually at a par with it.
In those areas of work less complex than translation between human languages, the productivity gains are compounded by a substantial quality improvement. Anyone who has ever specified automation processes knows just how complex it can be to anticipate all the possible situations the software will have to face once it is in production. This is even when functional domain experts are involved. The software’s functional rules are based on assumptions that themselves rely on a limited number of observations. But reality often proves to be far more complex than expected, meaning that automation is eventually suboptimal or that the software ends up requiring expensive corrections.
Machine Learning, on the other hand, absorbs and develops itself using all available data, regardless of the volume. That means the risk of patterns or a use case being left out of the picture is therefore limited.
Humans must remain in charge
The machine also avoids human intelligence’s ‘cognitive biases’ that translate into imperfect selections of available data and inappropriate decision making.
A good example is that of the automated processing of loan requests received by banks. An algorithm parses the archives of previous requests where each borrower’s key information is recorded (age, revenues, wealth, family status, etc.) along with reimbursement information (whether they did pay the bank back or defaulted). It therefore highlights the likely relationship between a borrower profile and a default risk. Applied to a new loan request, the algorithm will predict, with an accuracy level considered as sufficient, whether the borrower will pay back. This means the risk of a bad decision, triggered by prejudice or the bank operative’s mood, is removed.
Nonetheless, it is crucial that humans remain the ultimate decision-makers.
First, because the software is obviously not perfect. It is governed by settings made by humans. For instance, it may have been optimized to avoid ‘false-positives’ (where the loan is granted to a borrower who will default) and so will lean towards rejecting certain loan applications. It may also discard observations that don’t fit with its criteria. Therefore, a user must check that the system’s recommendations are legitimate and, if necessary, reject them. If he grants a loan to an applicant despite the system recommendation and it eventually turns out that the borrower meets the payment schedule, the new learned criteria will have to be introduced so that the algorithm accepts applications from similar profiles next time.
Another key reason is that only humans should make sure ethical standards are met, especially when a decision concerns an individual’s rights. The automated processing of non-anonymized data is already strictly regulated. And the law pertaining to this is likely to evolve further to protect citizens and consumers against the potential harmful effects of excessive statistical generalization.
Data über alles
It is essential to choose and set up an algorithmic model that fits the process at stake and the type of data sustaining it. Gauging a company’s default risk requires a different methodology to that of identifying a face on a picture. However, the automation performance will depend on meeting two imperatives:
- Data quality – Many cleansing and formatting activities are required to guarantee that rules discovered during the training phase are not based on false observations. This task is usually huge compared to the effort needed to set up the model.
- Training set Representativeness – The automation will be all the more efficient and accurate when ML is carried out on the basis of unbiased observations. These need to be similar enough to the real life cases the software will have to deal with. For instance, if I want to predict companies’ payment behavior but all the data I have is specific to companies within a given revenue range, the system may not be accurate when I will apply it to companies belonging to another range.
Access to data is crucial for ML projects’ success because, ultimately, no level of algorithmic sophistication will ever make up for a poor data set.