Anthony Kleerekoper: Machine Learning on Big Data: Rethinking Volume

I've just been at the IEEE Big Data Conference in Santa Clara, CA and not surprisingly there were a lot of presentations about how to handle Big Data and what it can be used for. There was far less discussion of what qualifies as Big Data and it seems to be generally accepted that the three Vs (Volume, Velocity and Variety) fully characterise what is and isn't Big Data. For a very brief history of the three Vs please see this post.

I want to briefly discuss the issue of Volume from a machine learning perspective and why I think that we need to be more precise when we talk about the volume of Big Data.

Let's start with a typical view of a data set from a machine learning perspective (Figure 1). The data consists of a collection of examples where each example provides us with some value for a number of features and one or more class labels. The typical task is to generate a model using the examples so that in the future we can guess the class label from the features. Now, if we talk about Big Data as being a dataset which is terabytes or petabytes or whatever this doesn't tell us anything about the shape of the matrix. And knowing the shape is, I think, really important.

Figure 1: Data can be viewed as a collection of examples or instances where each example has some value for various features and class labels.

Big Data or "Lots of Data"?

A data set can be very large because it contains a large number of examples or lots of features (or both). There is a significant qualitative difference between these two. As more examples are gathered each new one becomes less valuable whereas the value of a new feature is completely independent of the number of existing features.

Figure 2: Data sets can be made very large by adding lots more examples but each new example brings little added value and the number of examples we need to train learners with remains constant.

This point was made back in 2001 when Doug Laney first coined the 3Vs. In his original report he wrote:

"... as data volume increases, the relative value of each data point decreases proportionately —resulting in a poor financial justification for merely incrementing online storage. Viable alternates/supplements to hanging new disk include:

Limiting certain analytic structures to a percentage of statistically valid sample data

Profiling data sources to identify and subsequently eliminate redundancies

Monitoring data usage to determine “cold spots” of unused data that can be eliminated or offloaded to tape"

In other words, it really isn't worth storing a huge number of examples for building models. Of course companies need to store as much information as they can in order to apply the models but from a machine learning perspective most of the examples are not needed. This point was well made in a blog post by Joseph Rickert of Revolution Analytics who looked at one specific example showing that a model generated using 12 thousand examples was almost identical to one using 120 million.

This is not new or surprising and in fact it's what has been done in machine learning for a long time, even without thinking of Big Data, when we use cross validation. The implications, though, are that if Big Data is characterised only by lots of examples then we really don't have anything new. We have more examples to choose from but the amount of data we want to feed into our learners remains about the same. As Rickert put it, this is "lots of data" and not Big Data.

Big Data is Wide Data

In my opinion, the challenge posed to machine learning from Big Data is when the data is inherently Big. That is, when the number of features (and the cardinality of the features as well) grows massively. This is wide data or (if you insist on maintaining Vs) high vertex data. The key point here is that as we add more features the value of each new feature is unknown but independent of the number of existing features. There may be redundancy in the data set (probably will be) and almost certainly the vast majority of features are not helpful when trying to predict a given class. But a priori there is no way to know which features are or are not important and so we must deal with them all when learning. We cannot simply take a sample and be confident that our model will be good. In fact, as we have more features the chances that a random sample will contain none of the useful features increases.

Figure 3: Big Data is wide - it has more features and each feature is of a higher cardinality. The added value of each new feature is unknown and independent of the number of features.

When the data contains a large number of high cardinality features it becomes complex and time-consuming to process it all in order to build a good model and we're in real danger of including features in the model that really are not important at all. It becomes imperative to preprocess the data in some way to reduce its dimensionality, remove redundant data and combine features where necessary. In other words, the tasks of feature selection and extraction become essential. Fortunately for me this is the area that interests me most (OK, caveat - I probably think all this because this is the area that interests me most).

The take-away message here, I think, is that when we talk about Big Data from a machine learning perspective we ought to focus far less on how many bytes there are and far more on how many features there are.

1 comment:

Anonymous30 March 2022 at 00:34
Merkur 15c Safety Razor - Barber Pole - Deccasino
Merkur 15C Safety febcasino Razor - Merkur - 15C for ford escape titanium Barber ventureberg.com/ Pole is poormansguidetocasinogambling the perfect https://deccasino.com/review/merit-casino/ introduction to the Merkur Safety Razor.

Wednesday, 9 October 2013

Machine Learning on Big Data: Rethinking Volume

Big Data or "Lots of Data"?

Big Data is Wide Data

1 comment: