Federated learning (FL) is revolutionizing how we learn from data. With its growing popularity, it is now being used in many safety-critical domains such as autonomous vehicles and healthcare. Since thousands of participants can contribute in this collaborative setting, it is, however, challenging to ensure the security and reliability of such systems. This highlights the need to design FL systems that are secure and robust against malicious participants’ actions while also ensuring high utility, privacy of local data, and efficiency. In this paper, we propose a novel FL framework dubbed FLShield that utilizes benign data from FL participants to validate the local models before taking them into account for generating the global model. This is in stark contrast with existing defenses relying on the server’s access to clean datasets — an assumption often impractical in real-life scenarios and conflicting with the fundamentals of FL. We conduct extensive experiments to evaluate our FLShield framework in different settings and demonstrate its effectiveness in thwarting various types of poisoning and backdoor attacks including a defense-aware one. FLShield also preserves the privacy of local data against gradient inversion attacks.
In recent years, pre-trained language models (e.g., BERT and GPT) have shown the superior capability of textual representation learning, benefiting from their large architectures and massive training corpora. The industry has also quickly embraced language models to develop various downstream NLP applications. For example, Google has already used BERT to improve its search system. The utility of the language embeddings also brings about potential privacy risks. Prior works have revealed that an adversary can either identify whether a keyword exists or gather a set of possible candidates for each word in a sentence embedding. However, these attacks cannot recover coherent sentences which leak high-level semantic information from the original text. To demonstrate that the adversary can go beyond the word-level attack, we present a novel decoder-based attack, which can reconstruct coherent and fluent text from private embeddings after being pre-trained on a public dataset of the same domain. This attack is more challenging than a word-level attack due to the complexity of sentence structures. We comprehensively evaluate our attack in two domains and with different settings to show its superiority over the baseline attacks. Quantitative experimental results show that our attack can identify up to 3.5X of the number of keywords identified by the baseline attacks. Furthermore, qualitative results are also provided to help readers better understand the attack.
Steganography ensures secure transmission of digital messages, including image steganography where a secret image is hidden within a non-secret cover image. Deep learning-based methods in image steganography have recently gained popularity but are vulnerable to various attacks. An adversary with varying levels of access to the vanilla deep steganography model can train a surrogate model using another dataset and retrieve hidden images. Moreover, even when uncertain about the presence of hidden information, the adversary with access to the surrogate model can distinguish the carrier image from the unperturbed one. Our paper includes such attack demonstrations that confirm the inherent vulnerabilities present in deep learning-based steganography. Deep learning-based steganography lacks lossless transmission assurance, rendering sophisticated image encryption techniques unsuitable. Furthermore, key concatenation-based techniques for text data steganography fall short in the case of image data. In this paper, we introduce a simple yet effective keyed shuffling approach for encrypting secret images. We employ keyed pixel shuffling, multi-level block shuffling, and a combination of key concatenation and block shuffling, embedded within the model architecture. Our findings demonstrate that the block shuffling-based deep image steganography has negligible error overhead compared to conventional methods while providing effective security against adversaries with different levels of access to the model. We extensively evaluate our approach and compare it with existing methods in terms of human perceptibility, key sensitivity, adaptivity, cover image availability, keyspace, and robustness against steganalysis.
In this paper, we study model inversion attribute inference (MIAI), a machine learning (ML) privacy attack that aims to infer sensitive information about the training data given access to the target ML model. We design a novel black-box MIAI attack that assumes the least adversary knowledge/capabilities to date while still performing similar to the state-of-the-art attacks. Further, we extensively analyze the disparate vulnerability property of our proposed MIAI attack, i.e., elevated vulnerabilities of specific groups in the training dataset (grouped by gender, race, etc.) to model inversion attacks. First, we investigate existing ML privacy defense techniques– (1) mutual information regularization, and (2) fairness constraints, and show that none of these techniques can mitigate MIAI disparity. Second, we empirically identify possible disparity factors and discuss potential ways to mitigate disparity in MIAI attacks. Finally, we demonstrate our findings by extensively evaluating our attack in estimating binary and multi-class sensitive attributes on three different target models trained on three real datasets.
Increasing use of machine learning (ML) technologies in privacy-sensitive domains such as medical diagnoses, lifestyle predictions, and business decisions highlights the need to better understand if these ML technologies are introducing leakage of sensitive and proprietary training data. In this paper, we focus on model inversion attacks where the adversary knows non-sensitive attributes about records in the training data and aims to infer the value of a sensitive attribute unknown to the adversary, using only black-box access to the target classification model. We first devise a novel confidence score-based model inversion attribute inference attack that significantly outperforms the state-of-the-art. We then introduce a label-only model inversion attack that relies only on the model’s predicted labels but still matches our confidence score-based attack in terms of attack effectiveness. We also extend our attacks to the scenario where some of the other (non-sensitive) attributes of a target record are unknown to the adversary. We evaluate our attacks on two types of machine learning models, decision tree and deep neural network, trained on three real datasets. Moreover, we empirically demonstrate the disparate vulnerability of model inversion attacks, i.e., specific groups in the training dataset (grouped by gender, race, etc.) could be more vulnerable to model inversion attacks.
Anomaly detection on data collected by devices, such as sensors and IoT objects, is inevitable for many critical systems, e.g., an anomaly in the data of a patient’s health monitoring device may indicate a medical emergency situation. Because of the resource-constrained nature of these devices, data collected by such devices are usually off-loaded to the cloud/edge for storage and/or further analysis. However, to ensure data privacy it is critical that the data be transferred to and managed by the cloud/edge in an encrypted form which necessitates efficient processing of such encrypted data for real-time anomaly detection. Motivated by the simultaneous demands for data privacy and real-time data processing, in this paper, we investigate the problem of a privacy-preserving real-time anomaly detection service on sensitive, time series, streaming data. We propose a privacy-preserving framework that enables efficient anomaly detection on encrypted data by leveraging a lightweight and aggregation optimized encryption scheme to encrypt the data before off-loading the data to the edge. We demonstrate our solution for a widely used anomaly detection algorithm, windowed Gaussian anomaly detector and evaluate the performance of the solution in terms of the obtained model privacy, accuracy, latency, and communication cost.
Ransomware has recently (re)emerged as a popular malware that targets a wide range of victims – from individual users to corporate ones for monetary gain. Our key observation on the existing ransomware detection mechanisms is that they fail to provide an early warning in real-time which results in irreversible encryption of a significant number of files while the post-encryption techniques (e.g., key extraction, file restoration) suffer from several limitations. Also, the existing detection mechanisms result in high false positives being unable to determine the original intent of file changes, i.e., they fail to distinguish whether a significant change in a file is due to a ransomware encryption or due to a file operation by the user herself (e.g., benign encryption or compression). To address these challenges, in this paper, we introduce a ransomware detection mechanism, RWGuard, which is able to detect crypto-ransomware in real-time on a user’s machine by (1) deploying decoy techniques, (2) carefully monitoring both the running processes and the file system for malicious activities, and (3) omitting benign file changes from being flagged through the learning of users’ encryption behavior. We evaluate our system against samples from 14 most prevalent ransomware families to date. Our experiments show that RWGuard is effective in real-time detection of ransomware with zero false negative and negligible false positive (∼0.1%) rates while incurring an overhead of only ∼1.9%.
In this paper, we investigate the security and privacy of the three critical procedures of the 4G LTE protocol (i.e., attach, detach, and paging), and in the process, uncover potential design flaws of the protocol and unsafe practices employed by the stakeholders. For exposing vulnerabilities, we propose a model based testing approach LTEInspector which lazily combines a symbolic model checker and a cryptographic protocol verifier in the symbolic attacker model. Using LTEInspector, we have uncovered 10 new attacks along with 9 prior attacks, categorized into three abstract classes (i.e., security, user privacy, and disruption of service), in the three procedures of 4G LTE. Notable among our findings is the authentication relay attack that enables an adversary to spoof the location of a legitimate user to the core network without possessing appropriate credentials. To ensure that the exposed attacks pose real threats and are indeed realizable in practice, we have validated 8 of the 10 new attacks and their accompanying adversarial assumptions through experimentation in a real testbed.
At present, Bluetooth Low Energy (BLE) is dominantly used in commercially available Internet of Things (IoT) devices – such as smart watches, fitness trackers, and smart appliances. Compared to classic Bluetooth, BLE has been simplified in many ways that include its connection establishment, data exchange, and encryption processes. Unfortunately, this simplification comes at a cost. For example, only a star topology is supported in BLE environments and a peripheral (an IoT device) can communicate with only one gateway (e.g. a smartphone, or a BLE hub) at a set time. When a peripheral goes out of range, it loses connectivity to a gateway, and cannot connect and seamlessly communicate with another gateway without user interventions. In other words, BLE connections are not automatically migrated or handed-off to another gateway. In this paper, we propose SeamBlue , which brings seamless connectivity to BLE-capable mobile IoT devices in an environment that consists of a network of gateways. Our framework ensures that unmodified, commercial off-the-shelf BLE devices seamlessly and securely connect to a nearby gateway without any user intervention.
Data-driven business processes are gaining popularity among enterprises now-a-days. In many situations, multiple parties would share data towards a common goal if it were possible to simultaneously protect the privacy of the individuals and organizations described in the data. Existing solutions for multi-party analytics require parties to transfer their raw data to a trusted mediator, who then performs the desired analysis on the global data, and shares the results with the parties. Unfortunately, such a solution does not fit many applications where privacy is a strong concern such as healthcare, finance and the internet-of-things. Motivated by the increasing demands for data privacy, in this paper, we study the problem of privacy-preserving multi-party analytics, where the goal is to enable analytics on multi-party data without compromising the data privacy of each individual party. We propose a secure gradient descent algorithm that enables analytics on data that is arbitrarily partitioned among multiple parties. The proposed algorithm is generic and applies to a wide class of machine learning problems. We demonstrate our solution for a popular use-case (i.e., regression), and evaluate the performance of the proposed secure solution in terms of accuracy, latency and communication cost. We also perform a scalability analysis to evaluate the performance of the proposed solution as the data size and the number of parties increase.
Protecting sensitive data against malicious or compromised insiders is a challenging problem. Access control mechanisms are not always able to prevent authorized users from misusing or stealing sensitive data as insiders often have access permissions to the data. Also, security vulnerabilities and phishing attacks make it possible for external malicious parties to compromise identity credentials of users who have access to the data. Therefore, solutions for protection from insider threat require combining access control mechanisms and other security techniques, such as encryption, with techniques for detecting anomalies in data accesses. In this paper, we propose a novel approach to create fine-grained profiles of the users’ normal file access behaviors. Our approach is based on the key observation that even if a user’s access to a file seems legitimate, only a fine-grained analysis of the access (size of access, timestamp, etc.) can help understanding the original intention of the user. We exploit the users’ file access information at block level and develop a feature-extraction method to model the users’ normal file access patterns (user profiles). Such profiles are then used in the detection phase for identifying anomalous file system accesses. Finally, through performance evaluations we demonstrate that our approach has an accuracy of 98.64% in detecting anomalies and incurs an overhead of only 2%.
Protecting sensitive data against malicious or compromised insiders is a big concern. In most cases, insiders have authorized access in file systems containing such data which they misuse or exfiltrate for financial profit. Moreover, external parties can compromise identity credentials of valid file system users by means of exploiting security vulnerabilities, phishing attacks etc. Therefore, in order to protect sensitive information from such attackers, security measures, e.g., access control and encryption are often combined with anomaly detection. Anomaly detection is based on the key observation that the access behavior of an attacker is significantly different from the regular access pattern of a benign user. However, due to the complexity of users’ interactions with a file system, the modeling of user profiles is a challenging problem. As a result, most of the existing anomaly detection techniques suffer from poor user profiles that contribute to high false positive and high false negative rates. In this paper, we propose an approach that as a first step discovers the users’ tasks (sets of file accesses that represent distinct file system activities) by applying frequent sequence mining on the access log. In the next step, our approach builds robust temporal user profiles by extensively analyzing the timestamp information of users’ file system accesses and thus precisely models the relation between the users’ tasks and their temporal properties using a multilevel temporal data structure. Finally, we evaluate the performance of our approach on a real dataset.