Introduction to Coresets

Motivation

In recent times, the term "Big Data" is being widely used in various domains, including deep learning, machine learning, and robotics, among others. The achievements of machine learning have resulted in increasingly intricate and complex models that require massive amounts of data for training. The availability of these large and diverse datasets has played a crucial role in many of the recent advancements in machine learning.

But do you know “howBIGis Big Data”?

Well, the answer to this question depends on the specific context.

Limited Hardware

Limited Hardware

In some scenarios, there is a need to implement machine learning algorithms on edge devices, like smartphones and IoT devices. These devices have limited computational resources and memory capacity, which makes it challenging to run machine learning algorithms even on small data. For edge devices, even a small dataset of 1000 images can be considered 'Big Data'.
Limited Time

Limited Time

Small data may be considered as big data when there are time constraints to solve a problem. This is because quick processing and analysis of data is needed within the limited time frame. For example, in self-driving cars, real-time decision-making requires quick processing and analysis of small amounts of data generated by multiple sensors. Although this data is small, it can be considered as big data given the limited time constraints imposed by the need for real-time decision-making.

New computation models

Big data is not always just a concept - sometimes it is actually very big. Data sets have grown rapidly with the advent of devices such as mobile devices, IoT devices, aerial and remote sensing, software logs, cameras, microphones, RFID readers, and wireless sensor networks. The world's storage capacity for information has doubled roughly every 40 months since the 1980s, and as of 2012, 2.5 exabytes of data are generated every day. IDC (International Data Group) predicts that global data volume will continue to grow exponentially, reaching 163 zettabytes by 2025.

For the massive amounts of data created every day which can be considered infinite data, new computation models may be required to support the processing.

The streaming model

The Streaming Model

A streaming model is a dynamic computational model that processes data in real-time as it is generated, enabling faster and more efficient analysis compared. To maintain an optimal or approximate solution for infinite data streams, a memory-limited system is used, processing data items one-by-one or in batches. The system aims to optimise the solution for the entire data stream despite being able to store only a maximum of m points at a time.
The distributed model

The Distributed Model

The distributed model involves the processing and storage of data across multiple nodes or computing devices. This approach enables the distribution of computational tasks, making it possible to process and analyse vast amounts of data in a more efficient and scalable manner. The data is partitioned across multiple nodes, with each node processing its subset of data and then communicating the results with other nodes. The goal is to maintain the optimal solution for the entire data set distributed over all nodes.

Mislabeled input data

Although modern computation models can handle vast amounts of data, the utilisation of such datasets for machine learning applications often necessitates human intervention. For instance, consider the task of training a machine learning model to differentiate between images of cats and dogs, which requires labelling the images. This process typically entails requesting humans to provide judgments on unlabelled data, such as categorising images as either cats or dogs.

However, the use of vast datasets in machine learning introduces the possibility of human errors during the labelling process, which can result in mislabelled input data and negatively impact the performance of the machine learning algorithm. Thus, data scientists must identify mislabelled data items before applying the machine learning algorithm to ensure a more accurate reflection of the real world.

Mislabeled input data

Data quality

Data quality is a critical aspect of any data-driven application or analysis, and it requires ongoing attention and investment to ensure that data is accurate, complete, and consistent. Even when the data is accurately labelled, data scientists must carefully curate the dataset to extract the most important images for their application. For instance, when faced with a classification problem for cows or dogs given the following dataset:

Data quality

We can see that the 4th image looks odd as it is unusual to see a cow indoors. Selecting an image of a cow indoors may be crucial to train the model effectively, or conversely, it may need to be discarded as an outlier that doesn't accurately represent the typical use case. The decision to keep or discard the image depends on the specific problem and the goals of the machine learning model.

Coresets - Data quality

Coresets

An effective solution to tackle the challenges described above is through the use of "coresets". Recently, this approach has gained more attention in both practical and theoretical domains. In simple terms, a coreset is generally a weighted and smaller subset of the initial dataset, which provides the same solution as solving the problem on the complete and larger dataset without affecting performance metrics.

Coresets

Defining a problem

To precisely define and construct coresets for a machine learning problem, it is imperative to formally establish all the fundamental components needed to formulate a machine learning problem. The fundamental components that are required to define an optimization problem are:

  1. The input data (denoted by Coresets): This refers to a dataset, typically large, on which we aim to optimise some cost function.
  2. The query set (denoted by Coresets): This is the set of potential solutions or models, from which machine learning algorithms aim to derive an optimal model that minimises a given cost function in relation to the input dataset.
  3. The cost function Coresets we aim to optimise.

These three components serve to precisely define the optimization problem under consideration.

For example, in linear regression:

  • The input data comprises a set Coresets and their corresponding label function Coresets (as depicted by the red points in the diagram).
  • The query set Coresets is composed of every vector in Coresets (the green line is an example of a query/candidate solution).
  • The cost function we aim to minimise is Coresets
Defining coreset

Defining a coreset

Given the set of components that collectively formulate an optimization problem, i.e.

  • Let Coresets be an input set.
  • Let Coresets be the query set.
  • Let Coresets be a cost function.

Then, a coreset is a pair Coresets where Coresets and Coresets and such that for every query Coresets

  • Coresets
Linear Regression

Coresets Properties

A coreset has several interesting properties beyond serving as a close approximation of the optimal solution derived from the original big data. In this blog, we shall outline three properties of a coreset.

  1. The loss of the input data Coresets can be approximated by a coreset Coresets with respect to any candidate solution, not just the optimal one, for all Coresets. This property is of great significance as it supports the processing of streaming data and validates the efficacy of heuristics derived from the coreset in achieving good results for the entire data set.

    Coresets Properties
  2. A union of coresets is a coreset. Suppose you are given a set Coresets that was partitioned into 5 subsets Coresets. For each of these subsets, we computed a coreset, i.e., for every set Coresets we have computed its own coreset Coresets - which satisfies for every Coresets.

    Define the set Coresets and the weights function Coresets as the union of the five weight functions, we get that Coresets is a coreset for Coresets, i.e., for every Coresets

    Union of coresets
  3. Coreset of a Coreset is a Coreset. Recall the set Coresets which is a coreset for the input data Coresets. We can further compute a coreset (once more) for Coresets to obtain a new smaller coreset Coresets. Now, for every Coresets

    1. Coresets and
    2. Coresets

Thus, Coresets is a (smaller) coreset for Coresets. We note that now the approximation error of Coresets is larger than the one of Coresets, as we computed a coreset for a coreset.

Computed a Coresets

How to Compute a Coreset?

Coresets are a form of data summarization that approximate the original data and reduce the computational costs while maintaining similar performance.

Now how can we construct a coreset for a given input dataset?

To compute a coreset, there are various methods and algorithms available, each tailored to a specific machine learning problem. These techniques can broadly be categorised into two classes

  1. Deterministic coresets: For deterministic coresets, the algorithms are designed to produce a set that is certain to provide a precise approximation of the original input data. In other words, there is a 100% guarantee that the output set will accurately represent the entirety of the input data.
  2. Sampling-based coresets: For this class, algorithms use probabilistic methods to select a subset of data that satisfies certain conditions. The computed subset of data is likely to be a coreset with a high probability, rather than being guaranteed to be a coreset as in the deterministic algorithms.

Each of these classes has its own advantages and disadvantages. For instance, the second category involves probabilistic methods that might fail on some occasions. However, these methods are typically very efficient, resulting in rapid approximations of the initial dataset. Additionally, a common importance sampling framework is frequently employed to calculate a sampling-based coreset, which has been widely used in recent publications. The framework outlines a set of guidelines that involve the following:

  1. Given the tuple Coresets which defines our ML problem at hand.
  2. For every item Coresets be a number defining its importance/sensitivity - we will explain how to compute it next.

Having defined and computed the importance weight of each item in the set, the importance sampling framework proceeds to derive a coreset by applying the following steps

  1. Set Coresets, i.e., t is the total (sum of) importance of all points.
  2. Build the set Coresets as an i.i.d Sample of Coresets points according to Coresets. The larger the m the better the approximation error of the final coreset.
  3. For every Coresets define its weight as: Coresets

Then, with high probability, Coresets is a coreset for Coresets, i.e., Coresets and Coresets, such that for every query Coresets

Coresets

Note. The sample (coreset) size is a function of the desired approximation error, probability of failure, total sensitivity t, and another problem-dependent measure of complexity. We will provide more theoretical detail and insights in future blog posts.

But how to compute the importance Coresets ?

Informally, the significance or weight assigned to an item/point Coresets is defined as its maximum impact on the loss or cost function over all potential candidate solutions, i.e., if there exists Coresets such that Coresets is large with respect to Coresets then, Coresets is important to the query Coresets

The formal definition is given as follows:

  • For every Coresets:

    Coresets

  • To explain the definition of sensitivity/importance let’s look at the following two cases, when Coresets is high and when Coresets is low and see the difference between them.

If Coresets is high

There exists a solution Coresets, such that Coresets affects the loss too much (Coresets is important).

Coresets affects Coresets

If Coresets is low

For every solution Coresets, Coresets does not affect the loss, hence, it is not important.

Coresets does not affect Coresets

Therefore, we perform sampling based on this criterion since it reflects the existence of a solution where our point has a significant contribution to the cost, thus making it critical concerning this solution.

Applications

Let us revisit the initial justifications for the application of a coreset. Subsequently, we shall demonstrate the ways in which coresets can be used to tackle the aforementioned challenges.

Limited hardware/time

One approach to tackling computational or time constraints in applying machine learning algorithms is by constructing a coreset for the data using an efficient sensitivity sampling framework. The coreset can then be used as a representative subset to apply classic machine learning algorithms, resulting in a more efficient approximation with reduced memory and energy consumption.

Coresets - Applications

“Magically” supports the pre-discussed computational settings:

  1. The streaming setting, which operates on a limited memory with only one pass over an infinite input stream.
  2. The parallel/distributed setting, which employs multiple threads, such as in GPUs, or more broadly, distributed data across a network, cloud or AWS. The approach may minimise or even eliminate communication between the machines.
  3. The combined utilisation of (i) and (ii), which allows simultaneous handling of an unbounded stream of data and distributed data.

In the streaming setting, where the data is received in an infinite stream, computing a coreset can help in maintaining a small representation of the Big Dataset. This is done by merging the newly received data chunk/batch with the pre-computed coreset, and then computing a new coreset when the size of the coreset exceeds a predefined limit. By leveraging the property that the coreset of a coreset is a coreset, we can maintain a small coreset for the entire dataset seen so far.

Coresets - Streaming Setting

It is important to note that as we compute more and more coresets, the approximation error gets larger. To address this issue, we can use the streaming tree to obtain smaller approximation errors and improve the overall accuracy of the results.

Coresets - Distributed Setting

In the distributed setting, where the data is distributed across multiple machines, we can compute a coreset for each dataset separately and merge these coresets into one, allowing for the computation of a solution for the entire dataset. This approach can be further optimised by computing a new coreset for the merged coreset and then solving the problem on the result. Additionally, the streaming tree can be utilised to improve the approximation error and obtain better results.

Identifying Incorrectly Labelled Data and High-Quality Data

Let us revisit the concept of importance defined earlier, which pertains to the sensitivity/importance framework used for sampling points based on their relevance with respect to the loss/cost function.

Coresets

According to the importance definition, an item is deemed significant due to its contribution to the cost/loss of a specific query, whereas other items hardly affect the same query. In other words, other items seem to conform well to the given solution, e.g., being correctly classified, while this specific item does not.

This observation naturally leads to the conclusion that the identified item (e.g., input image) might have been mislabelled at the outset since it exhibits high loss, while other items have a low one. Consequently, one could hypothesise that modifying the label of this item would result in a lower loss for the same query. To detect mislabelled items in a smart way, we can utilise an algorithm based on the importance/sensitivity of the items.

Distributed Setting

The identified point (e.g., input image) exhibits high quality, that is, the point carries a high degree of sensitivity or importance and is correctly labelled but there are features present in the image, such as the background, surroundings, or colour, that suggest it should be labelled with a different label. e.g., an image of a cow indoors - this is very rare.

Therefore, by applying a smart searching algorithm based on the importance or sensitivity of data points, we can detect important high-quality images. This discovery process can either allow us to identify rare and unusual data that require more focused attention during training to accurately classify them or these images could be misleading to our model and do not represent typical cases. Thus, they should be excluded to prevent their confounding effect on our model's overall performance.

Coresets

Coresets possess a diverse range of potential applications, and we would like to direct the attention of the interested readers towards the following surveys on the subject matter: Introduction to Core-sets: an Updated Survey and Overview of accurate coresets.