Machine Learning - Is your information in formation?

More and more business’ are starting to take advantage of their big data sets and use Machine Learning to gain a strategic advantage over their competitors.

But, as our previous article (Harnessing the power of business data for Machine Learning) explains, Machine Learning is not a silver bullet and requires some careful consideration on how to use the data. Applying a Machine Learning method is only a small part of the process. The key is in the Data Science. It is an exploratory process.

To boldly go!

Business’ are different from one domain to another, it goes without saying that this is reflected in their data too. Even if we compare one business’ data set with one of their competitors, they are going to differ vastly in their shape, size and spread.

It is important to recognise that data exploration is a key part in the process of applying Machine Learning tasks to your data. A good data science team would spend more time analysing your data first and design a set series of experiments before delving into writing any Machine Learning code.

Data Vanity

So your data is stored securely in the cloud and your business has easy access to it. It is growing vastly and is scalable. But is this enough? Are there any questions that you can ask about your data before you take that leap to Machine Learning?

Answer: What is my size and shape?


Ultimately the more data you have the more data you can train a Machine Learning task with and therefore the more accurate it is likely to be when exploiting a pattern in your data.

But let’s just think about what we are trying to do. We have a big data set and this data is growing on a day by day basis. We want Machine Learning to learn from historic data to make predictions about our future, unseen, data to contribute towards key daily business decisions.

So how much of this data should we feed into our Machine Learning task to learn from?

Well, the more data the more accurate the model will be, so surely the answer is all of it right?

Part of the Machine Learning exploration is to evaluate how the training is going/has gone. So if we were to approach this by providing all of the data for training we would have to wait until tomorrow to get our hands on any unseen data. This just isn’t practical and leaves us with some thumb twiddling.

Rewind the clock, what we need is a time machine.

If we are in a position whereby we have access to years worth of data, we can conduct a Time Machine experiment. We can give our Machine Learning method all of the data that is older than 1 year to train on and we can evaluate it on the data that is younger than that. I have chosen an example of 1 year here because I think from a business’ perspective, seeing how something performs over the course of a whole year gives us a full representation and not a seasonal representation. In some data sets it may be the case that 30 days is sufficient.

This gives us some metrics to report back to the key business stakeholders. We can answer questions like, if we had applied Machine Learning a year ago, it would have made these decisions in the past year. We also have the luxury of knowing how accurate it would have been when comparing it to what decisions the business actually made. This gives us a true representation on how successful Machine Learning will be when applied to the business.

In this scenario we are actually in the garden of eden due to the volume of data that we have. With large historical data we can slice and dice the data between training and evaluation many different ways so we can experiment.

What about the poisonous fruit?

The important thing to bear in mind here is how important and pertinent is that historical data going to be when considering business decisions in the future. For example markets may have changed considerably and we might not want to our Machine Learning to be biased towards much older data which is now irrelevant in the current markets. Therefore we might only want to train our Machine Learning task on data from the past 3 years and not the last 10.

By reducing the window of data it also has the added benefit of reducing the amount of time and resources it takes for the Machine Learning task to learn.


If you invest some time in doing research and reading up about Machine Learning you will undoubtedly come across public data sets. These data sets are popular for experimenting with Machine Learning. For example Kaggle has many, and even runs competitions on to see who can create the most accurate results. The chosen data sets have been carefully constructed and in some cases have been pre-processed which make them completely amenable to Machine Learning.

Beautiful Botanical Data

Here's an example in Kaggle which contains feature data for 3 species of Iris. We have 150 examples aggregated as follows:

Iris-setosa - 50 examples

Iris-versicolor - 50 examples

Iris-virginica - 50 examples

As you can see, this is an even spread of examples across the whole data set. In the real world though, business’ big data sets are going to look very different and are more likely to be completely uneven and bent out of a perfect shape.

Back to the real world

Using my fictitious business scenario from our previous article, ABC Aviation happens to have 850,000 issues reported by aircraft engineers over a 10 year period. After the QA analysts have allocated them to a phase on the assembly line the breakdown is as follows

  1. High-pressure core assembly - 135,621 issues
  2. Low-pressure Turbine assembly - 1,298 issues
  3. Accessory Gearbox assembly - 395,234 issues
  4. Equipment and accessories assembly - 317,047 issues
  5. Total visual inspection - 800 issues

With an uneven spread like this, we have to approach the Machine Learning task in a different way and, more importantly, we have to manage the expectations of ABC Aviation.

Whilst they have lots of historical data, that data is heavily populated in two, arguably three, of the phases on the assembly line (3,4 and potentially 1).

The is a real world business scenario so we cannot source any further data from somewhere, it is what it is. It just so happens that there have only been 800 issues with the Total visual inspection team, this is likely down to the fact that they’re are doing a sterling job.

So how can ABC Aviation level make the ground even?

If we were to adopt an approach of making the data even we would have to take the lowest denominator (800) from each of the phases. By doing this our Machine Learning task can only learn from 4000 (800 x 5) examples. This is only 0.4% of the original data. By evening out the data, we are losing a lot of pertinent data here which is likely to be crucial to the business.

Fortunately for ABC Aviation, they have hired some good professional Data Scientists that understand this problem and are able to communicate back to the business to help manage their expectations.

Low-pressure Turbine assembly and Total visual inspection are skewing the data somewhat. So what we can do is remove them and greatly reduce the problem to only learn how to classify the other three. If the Machine learning task was to produce a 80% accurate model then that would mean that the model would be able to allocate 678,321 out of 850,000 issues to the correct assembly line phase. From the business’ perspective, Machine Learning has given them a model that would make 79.8% of the issues allocated to the correct Assembly line phase and would mean that the QA analysts would only have to review and allocate the remaining 20.2%. That has huge business value, even when that means Machine Learning doesn’t even consider 2 out of the 3 phases on the assembly line.

The Data Scientists can explore further and look at creating another Machine Learning model that can allocate Total visual inspection issues against everything else, then compose the two models to achieve a final solution for the business.

The clear and very interesting point here is that the Data Scientists can compose different Machine Learning tasks and combine the results to form a solution that is a best-fit for the business’ data.


The size and shape of your business’ data has a huge effect on the Data Science exploration and how Machine Learning tasks can be carried out with your data. There is a clear correlation between the data and how successful Machine Learning will be within the context of your business.

To determine whether your data is fit for Machine Learning, you must first ask yourself the following:

Have I got enough historical data?

Data is key! Obviously the more data you have, the larger the exploration space for Data Science and therefore a larger scale of experiments can be run against your data set to find and exploit patterns in your data. The historical aspect is equally important. If you have a ton of data but it is only covering the past few days, Machine Learning will only be able to exploit patterns that are pertinent to the current day to day business. Think about the future and how well it is likely to perform in 3 months.

What is the current shape of my data?

Understanding how even or uneven your data set is will greatly help you to manage your expectations on how successful applying Machine Learning tasks to your business’ data will be. Machine Learning tasks perform best when the data is even or as close to even as possible. That doesn’t mean that uneven data is not suitable though, It may be the case that several Machine Learning tasks will need to be applied to your data for form a composite solution for your business.

Oh no, my data is uneven :-( but is there enough spread?

The spread of the data is key to identifying the relationships between the data and the key business decision. If the data is completely dominant and skewed in one area then Machine Learning will not be amenable to the task. If 99% of all of ABC Aviators issues were with one phase of the assembly line and the 1% was distributed across the others, then Machine Learning would not provide any business benefit. It would make more sense to allocate all issues the that one phase of the assembly line and any that they are unable to manage, they can re-allocate them to the correct assembly line phase. Machine Learning would simply be a waste of time and money.

Hopefully this gives your business some useful into some of the data exploration activities that are associated with Machine Learning and has provided some useful prerequisite activities that can be performed first before considering appointing a Data Science team and building a proof of concept Machine Learning solution.

This site uses cookies. Continue to use the site as normal if you are happy with this, or read more about cookies and how to manage them here.