Customer Logins

Obtain the data you need to make the most informed decisions by accessing our extensive portfolio of information, analytics, and expertise. Sign in to the product or service center of your choice.

Customer Logins

Know the limitations of your machine

11 July 2019 Johannes Bauer, Ph.D

More and more data is collected today at faster and faster rates, currently more than 2.5 exabytes (the equivalent of 5M laptop hard drives) per day, according to recent estimates by domo, an American software company. Together with these ever-growing data resources, technologies and methodologies are being developed to extract "actionable insight". One successful approach is machine learning (ML) which extracts structure and relationships from the data. One common setting, so-called 'supervised learning', establishes a mapping from independent variables, typically called features, to a dependent variable, typically called target.

There have been considerable developments in the field of ML. For instance, in 2013, speech recognition technology still delivered error rates of 1 in 5 words, but today it gives largely accurate results (< 5% error), as we ourselves can easily experience through our smartphones. Image classification is on-a-par with human performance. Tech optimists argue that ML and Artificial Intelligence (AI) will soon be solving most of our problems.

I will argue here that whilst it is certainly true that considerable progress has been made in the field, certain challenges will remain. These are fundamental in nature and cannot be easily overcome.

Structuring problems and providing interpretations

How to frame problems in a certain context, what data to use, which relationships to expect? Answers to these questions are typically provided by humans with relevant domain expertise, and it is currently not clear how these questions could realistically be answered by ML. Complex machine learning models deliver accurate performance in many contexts. This does, however, not mean that it is easy to understand and communicate how predictions are made. For human decision makers, it is often important to understand why a certain result is the outcome of an algorithm. In certain cases, like in the credit space, it is even mandatory to provide explanations, for example, for the rejection of a mortgage application. There are tools which locally approximate complex models by simpler ones or use other criteria to measure the impact of features on predictions. Interpretability however remains a challenge that is unlikely to be easily overcome.

Separating signal from noise

How many people came to a large event? What is the true value of a company? How many aeroplanes are currently in the air? Complex questions like these often produce unprecise answers, manifested as noisy data. Thus, noise is not restricted to unprecise sensor measurements, but rather a ubiquitous feature accompanying many data sets. In certain settings, such as financial data, noise even dominates the signal. This doesn't mean, however, that nothing can be done. Indeed, a small edge in understanding can lead to formidable returns as demonstrated by certain hedge funds. In situations where the objective is to anticipate a share price one month ahead, filtering techniques and a careful selection of features can help substantially to extract signal. However, the challenge of separating signal from noise will persist.

Managing low frequency or small data sets and rare events

Machine learning is a greedy animal hungry for data. If not fed properly, it can behave in an uncontrolled fashion; this is referred to as a model with high variance or simply overfitting. Some argue that this no longer occurs in the information age, where data is plentiful, but this is certainly not true. Sometimes data naturally comes in low-frequent intervals (quarterly or annually) such as economic figures, company accounts, defects/accidents or default events. In such small data situations, ML researchers should remember that this is exactly the setting where many techniques in statistics were developed. In the past, statisticians were not blessed with large data sets and had to come up with an ingenious tool box to deal with this challenge. For instance, Bayesian approaches can include prior knowledge or carefully framed assumptions, which are updated based on few data points.

Dealing with regime shifts or non-stationarity

Many machine learning applications implicitly assume static relationships between a set of features and a target. Even if time is explicitly included in the modelling process, the underlying assumption is that of reoccurring patterns. Indeed, it is hard to deal with regime shifts and non-stationarity. There are techniques to identify and model such situations, such as the use of latent state variables to encode regime switches, however, no generic solution exists.

As we develop and apply machine learning at IHS Markit more and more, we too face these challenges as we further expand our product portfolio, whether it's focused on forecasting dividends, predicting global vehicle prices or demand, or estimating oil production curves, just to name a few examples. Boundaries are being pushed and many exciting developments are on the way, but these fundamental issues will remain. For me personally, this is a welcome challenge and it makes working in this space more interesting, since creative thought and critical thinking, in combination with ML, can lead to very powerful applications. A machine can only be operated safely and successfully when its limitations are clear.

Posted 11 July 2019 by Johannes Bauer, Ph.D, Data Analytics Director, Advanced Analytics, IHS Markit

Explore

RELATED INDUSTRIES & TOPICS

Follow Us

Filter Sort