Federated learning (FL) is a machine learning (ML) mechanism where different parties participate in a machine learning task and build a global model without sharing training data with any other parties. While there are several different training modes, a typical setting consists of two types of computing nodes: (1) trainer and (2) aggregator. The trainer node processes a dataset locally to build a model; and a set of trainer nodes share their model parameters with the aggregator node. Upon receiving these model updates, the aggregator node builds a global model by aggregating the model parameters. The global model is then shared with all the trainer nodes. This process can be repeated for multiple rounds.
The primary goal of federated learning is to preserve data privacy. For that, datasets in a trainer node are not shared with any other node, and only model parameters of the locally trained model are shared via a secure connection. Note that there is still a risk of leaking private information via reverse engineering of model parameters. Hence, new techniques (e.g., differential privacy, homomorphic encryption, etc.) have been proposed to further enhance privacy preservation, but we leave the discussion of these topics out of this post.
Federated learning differs from distributed learning. In the distributed learning, privacy is not a main concern. Instead, a key goal of distributed learning is to maximize the parallelism of computation over a large dataset so that a model can be trained as quickly as possible. To leverage this technique, a dataset is often owned by one organization and located in a centralized store. And trainer nodes fetch an equal-size subset from the dataset and carry out an ML training task in parallel. In contrast, in federated learning, datasets are heterogenous by nature because they are collected and curated by different organizations. Thus, these datasets tend to exhibit non-IID characteristics as opposed to the datasets used in distributed learning.
FL can be broadly applied for ML training tasks where data movement is highly discouraged or prohibited due to data privacy or operational costs. Thus, training ML models in healthcare makes FL a perfect method. For example, consider an ML application to detect heart diseases (e.g., aortic stenosis) by using Electrocardiogram (ECG) signals of a patient. Training such a model accurately, a broad set of patients’ data is needed from various hospitals and sharing patients’ private data is not an option. Clearly, FL can work well under these kinds of constraints. The insurance industry can benefit from FL too. For example, ML training for insurability or risk assessment of insurance underwriting can take place without sharing customers data across different insurance institutions. ML training tasks in remote areas (e.g., fault prediction in an offshore wind turbine farm) with limited network access are also another good example. In this case, the volume of transferred data may greatly slow down ML training in a centralized location; and FL can render itself viable under the situation.
Given the broad applicability of FL, democratizing FL is key to its success. However, there are still open challenges and missing blocks from several aspects such as systems, communication cost, security, bias, etc. We touch upon several of them in the remainder of this post.
Ease of Use: First and foremost, ease of use is often neglected in developing technology. FL is not an exception. Several existing FL frameworks pay little attention to the complexity involved in managing underlying heterogeneous infrastructures, especially given the fact that those infrastructures may be owned by different organizations or entities (e.g., different hospitals). The ease of use also means that an FL framework should holistically support a set of core functionalities such as model lineage tracking, training observability, etc. While there are many isolated solutions for individual features, no solution approaches FL’s requirements holistically. A holistic FL framework may involve rethinking architectural designs and extensive system integration with existing solutions.
Incentives and trust: The next issue relates to incentives and trust. Since FL involves different parties of which interests and motivations may or may not be aligned, it is crucial to ensure that multiple parties participate in the FL training process honestly and genuinely. What would be good ways (or incentives) to keep the participants in the FL training? What would be the right ways to detect and discourage cheaters who want to take advantage of a global model with a little contribution? These are the questions that we need to answer for a meaningful FL framework.
Data management: Data management is also a bigger issue in FL than in other ML training settings. In many cases, training datasets are private, and a model training module must not leak private data directly and indirectly. An FL system needs to provide some way of assurance that data is at least not leaked in an apparent way. Also, if the private data needs to be loaded (or streamed) into a trainer node from a data source over the network, the FL framework should be able to offer secure means to access the private data.
Bias detection and management: Needless to say, datasets in FL are likely to have a non-IID nature because participants can have different size of datasets from different populations. Therefore, bias detection and management mechanisms should be incorporated into the system throughout the entire lifecycle of data management and training. In addition, it’s equally important to track how bias creeps into a model version by lineage tracking.
In order to truly democratize FL, the systems challenges must be completely out of the equation so that data scientists can solely focus on the ML parts and not worry about the systems issues. While some of the challenges are not unique in FL, they are more challenging because of the heterogenous nature of FL. Therefore, building a holistic FL system is an absolute necessity to ensure that FL can be truly at the disposal of data scientists and machine learning engineers.