Decoding AI Deception: Poisoning Attack

Hi! Welcome to my series of blogposts, “Decoding AI Deception” wherein we will take a closer look into each kind of adversarial AI attack. This post covers the details of poisoning attack comprising common types of poisoning attacks, their applicable cases, vulnerabilitiesof models that are exploited by these attacks, and remedial measures.

Poisoning Attack and its Types

As we all know from previous post that poisoning attack is the form of adversarial AI attack that is used to corrupt data intended for either training or retraining of a model. It has few common forms which are as follows: –

Label flipping: In this type of attack, an adversary introduces malicious samples into the training dataset by changing the labels of existing data points. The goal is to cause the model to misclassify similar instances during inference. For example, an attacker could flip the labels of spam and non-spam emails in a spam filter training dataset, leading the model to misclassify future emails.
Outlier injection: Outliers are the most odd-one out sort of data in a given set of data. The attacker adds carefully crafted outlier data points to the training dataset. These data points do not resemble normal instances and are designed to manipulate the model’s decision boundaries. By injecting these outliers, the attacker aims to make the model more sensitive to specific inputs, causing misclassification or reduced performance on specific instances.
Gradient ascent/descent attacks: In these attacks, the adversary computes the gradients of the target model with respect to the malicious input samples. The gradients are then used to craft poisoning samples that either maximize (gradient ascent) or minimize (gradient descent) the model’s loss function. By adding these samples to the training dataset, the attacker influences the model’s training process, causing it to learn incorrect or biased decision boundaries.
Targeted misclassification: The attacker seeks to manipulate the model’s behavior for specific target instances, without affecting the overall performance of the model. This is often done by injecting a small number of malicious samples into the training dataset that are designed to cause the model to misclassify the target instances. For example, an attacker might poison a facial recognition system to misclassify a specific individual as someone else, while maintaining the model’s overall accuracy for other individuals.

Applicable Scenarios

Label Flipping Attack Scenario: An attacker infiltrates a facial recognition system’s training data and changes the labels of certain individuals. By doing so, the attacker can cause misidentification, potentially allowing unauthorized access to restricted areas or enabling criminal activities.
Outlier Injection Attack Scenario: An attacker manipulates a financial fraud detection system’s training data by injecting numerous transactions with extreme values. The model then overfits to these outliers, leading to poor detection of genuine fraud cases and an increase in false positives, impacting the system’s overall effectiveness.
Gradient Ascent Attack Scenario: An attacker increases the loss function, causing a spam filtering model to misclassify genuine emails as spam. By maximizing the loss function, the attacker creates poisoned samples that make the model more sensitive to specific email patterns, which can lead to false positives (i.e., genuine emails classified as spam).
Gradient Descent Attack Scenario: An attacker decreases the loss function, causing the spam filtering model to misclassify spam emails as genuine. By minimizing the loss function, the attacker creates poisoned samples that blur the boundaries between spam and genuine emails, which can lead to false negatives (i.e., spam emails classified as genuine).
Targeted Misclassification Attack Scenario: An attacker wants to disrupt a product recommendation system used by an e-commerce platform. They manipulate the training data to create poisoned samples that cause the AI model to consistently misclassify certain items, recommending unrelated or irrelevant products to users, which can result in a loss of revenue and user dissatisfaction.

Model Vulnerabilities

The root cause of poisoning attacks, lies in the vulnerabilities of machine learning models and the reliance on the training data. These vulnerabilities can be broadly categorized into three main areas:

Model sensitivity to training data: Machine learning models, especially deep neural networks, are highly sensitive to the data they are trained on. They can learn intricate patterns and adapt their internal parameters based on the training dataset. Attackers exploit this sensitivity by introducing malicious samples into the training data, which can alter the model’s decision boundaries and cause misclassification or unintended behavior during inference.
Overfitting and lack of robustness: Machine learning models can sometimes overfit to the training data, meaning they capture not only the underlying patterns but also the noise and peculiarities of the training dataset. As a result, they may not generalize well to new, unseen data. Adversarial attacks exploit this lack of robustness by crafting malicious samples that cause the model to overfit to the incorrect representations or associations, leading to misclassification or other undesirable outcomes.
Gradient-based optimization: Most machine learning models, particularly deep learning models, rely on gradient-based optimization methods to update their internal parameters during training. By computing the gradients of the loss function with respect to the input data or model parameters, attackers can create carefully crafted malicious samples that exploit the model’s learning process. Gradient ascent and descent attacks are prime examples of this, as they manipulate the loss function in a way that alters the model’s decision boundaries, resulting in misclassification or other unintended consequences.

Remedial Measures

There are various remedial measures that can be applied to defend against poisoning attacks, especially based on the specific use case, data characteristics, and potential risks associated with each attack type. This post lists out some basic remedial measures against all kinds of poisoning attacks: –

Access Control: Implement strong access controls and authentication mechanisms to protect training data and prevent unauthorized tampering.
Data Validation and Verification: Regularly validate and verify the integrity and accuracy of the data in the training dataset to identify any inconsistencies, tampering, or label manipulation.
Data Sanitization: Apply data preprocessing techniques, such as outlier detection and removal, to reduce the impact of anomalous or malicious data points on the model’s performance.
Adversarial Training: Train the model with adversarial examples or use techniques like adversarial robustness to improve its resilience against poisoning attacks and adversarial perturbations.
Robust Models: Employ machine learning algorithms that are less sensitive to noisy data, mislabeled samples, or outliers, such as tree-based methods, models with regularization, or noise-tolerant algorithms.
Input Validation: Implement input validation or preprocessing techniques to reduce the impact of adversarial perturbations on the input features during inference.
Monitoring and Auditing: Continuously monitor and audit the model’s performance and the training data pipeline to detect any unusual patterns, anomalies, or performance degradation, which might indicate a poisoning attack.
Retraining and Updating: Periodically retrain and update the AI model with new and clean data to minimize the impact of any poisoning attacks that may have occurred during previous training sessions.
Defense-in-Depth: Use a combination of the aforementioned strategies to create a defense-in-depth approach, which increases the likelihood of detecting and mitigating poisoning attacks.

SimplifAIng