Federated learning (FL) emerged as one of promising practices to train machine learning (ML) models in a privacy preserved manner. FL essentially makes it possible by allowing a central server to build a global model out of ML models built by local training in participants’ devices without sharing their data. The most well-known practice follows a client-server architecture; a server (parameter server) distributes a global model to the clients (participants), each of which carries out training with its own dataset and returns the training results to the server; then the server aggregates individual training results. The process repeats to improve the performance of a global model.
There are several reasons for the growing interest in FL across the research community and industry. First and foremost, privacy preservation is becoming a crucial property of any technology. Regulations such as GDPR and CCPA enforce that personal data should be protected from unauthorized or illegal processing, accidental data leakage and so on, which signifies the importance of protecting data and privacy. ML training typically needs data samples and thus these regulations apply to it. Second, enterprises are facing new challenges in dealing with rare events such as cyberattacks, anomalies or system failures. They must respond quickly to those events in order to protect their data, trade secrets and reputation. At the same time they need to build new insights from small sets of data on the rare events. Detecting or predicting those rare events motivates the need for collaboratively building ML models across enterprises without revealing trade secrets. Third, data gravity is also a key driving force. As the mass of data at the edge is growing, the gravity of data increases and starts to pull applications. Thus, the capability of data processing near data sources is becoming imperative and FL fits well with the data gravity trend.
Cisco Research published a blog post on the idea of democratizing federated learning a while ago. We then embarked on a project called Flame (now an open-source project) with the mission of realizing the vision of democratized federated learning. This blog post walks through how one can use Flame to conduct FL training and to extend it for enabling various FL mechanisms and algorithms.
There are several federated learning libraries and frameworks out there. Why would the research community and industry need yet another one?
To answer the question, we need to focus on the fast pace of FL technology evolution. When the technology was proposed by Google researchers, its architecture and algorithm were relatively straightforward; under a conventional client-server architecture, geo-distributed clients conduct local training and share local model weights with a central server, which subsequently applies an aggregation algorithm such as FedAvg.
Since it was first introduced in 2016, there have been numerous research publications. A simple keyword search on federated learning in Google Scholar returns about 0.49 million research articles as of May 2, 2023. The search results cover a wide variety of topics such as algorithm, mechanism, topology, system, security, privacy, fairness, use cases/applications, and so forth. Consequently, this fast pace of research activities in the community led to many FL variants and thus the one-size-fits-all approach can’t be applicable in federated learning.
For example, if clients are co-located across private data center infrastructures or clouds, running a vanilla FL mechanism (i.e., client-server architecture) is sub-optimal in terms of communication. Instead, model training may converge faster by applying a hybrid approach where clients in the same infrastructure form a group and conduct a distributed learning (with ring-reduce, a communication-efficient aggregation method) and a leader in each group shares per-group models with a server for global model aggregation (Fig. 1a). In some other cases, an asymmetrical topology may make more sense. If some clients cannot form a group because they don’t belong to the same infrastructure, enforcing a symmetrical hybrid topology would result in even slower convergence time due to lower bandwidth among clients. In such a case, an asymmetrical hybrid topology may be the right configuration (Fig. 1b). Regardless of whether a hybrid topology is symmetrical or not, a synchronous aggregation process is susceptible to stragglers. Hence, an asynchronous federated learning method may perform better than its synchronous counterpart for per-group model sharing (Fig. 1c).
One thing to note on these variants is that one scheme is not strictly better than another. These variants impose different level of complexity and challenges. For instance, the ring-reduce approach can fail if one client crashes because each client maintains only partial updates until the reduce process is complete.
A key question here is how to enable these kinds of variants quickly and easily so that new innovative approaches can be tested and tweaked to meet users’ use cases and constraints. Many libraries and frameworks provide a set of API, which allows users to freely implement and extend algorithms and mechanisms. However, the libraries and frameworks often assume one FL topology (horizontal) and synchronous aggregation, or do not provide a disciplined and yet intuitive programming model to facilitate fast development.
Flame consists of two major components: (1) machine learning job management service and (2) development SDK. Flame allows users to compose an FL job and deploy it to distributed computing infrastructures such as Kubernetes. Users can leverage Flame in two ways. The management service automates the deployment of an FL job. This setup is suitable for those interested in running large-scale experiments in automated fashion. The other way is to use SDK only. This is appropriate for small-scale tests in a lightweight manner.
Fig 2 shows a few key components that constitute Flame. A user can compose an FL job by using SDK. The SDK supports major ML frameworks such as PyTorch and TensorFlow. It also offers two communication backends – MQTT and gRPC. By using a CLI tool called flamectl, a user can interact with Flame’s management service which allows them to take care of various tasks of their FL job; for example, the user can design a topology (e.g., 2-tier client-server topology, hierarchical topology) for an FL job, configure the job’s specification (e.g., hyperparameters, aggregation algorithms), manage user’s dataset (e.g., dataset registration), start/stop a job, and so forth. The management service, among many other things it does, schedules and manages FL jobs. A job built with SDK is deployed through the management service. A job consists of several tasks such as aggregation task and training task. These tasks log metrics as well as artifacts to a model registry, for which Flame uses mlflow.
Flame can enable fast development with its abstraction interface (role and channel) and programming model. In what follows, we discuss these foundations that are essential in building an FL job within Flame.
Flame abstracts an FL job’s design as a graph which is comprised of roles and channels. We propose these abstractions because low-level primitives such as server and client are ill-suited for expressing and implementing tasks conducted by constituents in various FL approaches.
A role, a vertex in the graph, is an abstraction that is associated with a task. For example, a conventional FL consists of an aggregator (server) and trainers (clients). In Flame, aggregator is a role that carries out the task of distributing a global model to trainers and aggregating local model updates from the trainers. A channel abstracts underlying communication mechanisms that connect the roles. In Flame, a role is not mapped into either server or client because server and client are communication-oriented terms, not task-oriented ones. Also, the channel abstraction eliminates the need for maintaining the notion of server and client.
Fig 3 shows a visual representation of role and channel. With role and channel, Flame allows users to express different FL topologies flexibly and easily.
We walk through examples to describe how to use role and channel for expressing topologies next. We present an example in Fig 4 to facilitate the understanding of role and channel abstractions. The configuration given in Fig 4 is one for a conventional two-tier FL topology.
The example is interpreted in the following: there exists a channel called “param-channel” that connects two roles: aggregator and trainer. The channel has one group called “red”. For each of those roles mentioned in the channel, there is a configuration. The role “aggregator” is not a data consumer and joins the channel “param-channel” with group “red”. The role “trainer”, on the other hand, is a data consumer role and joins the same channel with the same group as the aggregator role does.
Flame takes the configuration in Fig 4 and dataset information, and creates an actual topology that can be used for deployment. Fig 5 depicts a real topology that is expanded from the configuration shown in Fig 4, assuming that six datasets are used.
Here we present an example for a hierarchical FL usecase in Fig 6. The section of roles has three roles: top-aggregator, leaf-aggregator, and trainer. The top-aggregator role is associated with a channel named global-channel with the black group. Hence there is one top-aggregator worker. At leaf-aggregator role, there are two group associations, and thus two leaf-aggregator workers are created; one is associated with param-channel with red group and global-channel with black group and the other is associated with param-channel with blue group and global-channel with black group. Finally, the trainer role has two groups too; here we assume that each group is associated with three datasets. Therefore, the final expanded form looks like the right diagram in Fig 6.
Another key feature of Flame is its programming model, which is based on object-oriented programming (OOP) and directed-acyclic graph (DAG). The concept of role is naturally aligned with that of class in OOP. In other words, a role is a self-confined unit of task that can be formulated as a class. Moreover, there can be many variants of a role (e.g., aggregator) with high similarity among the variants, which makes OOP’s core functionality such as inheritance more appropriate for supporting diverse FL scenarios.
However, the adoption of OOP alone is insufficient for fast development. Many variants of FL scenarios will still share common tasks such as incrementing a round, aggregating models, and so forth while they can differ by the order of executing certain tasks, inserting new tasks between other tasks, etc. Relying on conventional OOP means that these kinds of changes require the changes in the inter-dependent methods, which is hard to track and manage. We use DAG to express inter-dependency of small tasks (which we call tasklets) in a role. Indeed, DAG is widely adopted in various software tasks and services such as workflow management, serverless function chaining, network function chaining, etc., which offers flexibility as well as easy extensibility. Flame adopts this DAG concept by overriding a right-shift operator (>>) in python.
Fig 7 depicts how tasklet chaining works in Flame. The example in the figure is for an aggregator for a synchronous FL usecase. Member methods of an aggregator’s instance are defined as tasklets within the context of composer. All the tasklets are connected via the right-shift operator. Flame also provides Loop primitive so that some of the tasklets can be repeatedly executed until a condition is met (e.g., until self._work_done becomes true). As soon as the condition is met, the program can get out of the loop from any tasklet it was in. Note that all the state of the instance is maintained as member variables and thus passing an output of one tasklet as an input to the next tasklet is unnecessary.
Given this base implementation of an aggregator for the synchronous FL, extending it for other FL scenarios becomes easy. We describe how Flame’s programming model enables fast development of one of asynchronous FL mechanisms called fedbuff next.
The first step involves inheritance of synchronous FL class (Fig 8). One of key differences between synchronous FL and asynchronous FL is the way that an aggregator distributes weights and aggregate local updates from trainers. Aggregation in asynchronous FL can be carried out upon receipt of a local update from a single trainer whereas synchronous FL needs to wait for the arrival of all the local updates from a selected set of trainers. Similarly, in asynchronous FL, distribution also takes place in a continuous fashion instead of a batch fashion. Due to these differences, _aggregate_weights() and _distribute_weights()are overridden. Also, maintaining parameters such as aggregation goal and concurrency are necessary for asynchronous FL, and thus internal_init()is overridden, too. Other than those methods, the rest of the methods in the parent class are used without any modification.
Finally, to make this new aggregator work in an asynchronous manner, a tasklet composer also needs to be updated. Fig 9 illustrates what additional changes are made from the composer example shown in Fig 7. The key aspect of this composer is that two tasklets (task_put and task_get, which internally execute _distribute_weights()and _aggregate_weights(), respectively) are executed within another loop called asyncfl_loop. The loop checks if aggregation goal is met or not. If so, the aggregator can move to the next round. The outer loop (denoted as loop) checks whether the pre-defined number of rounds are done.
One caveat with the implementation in Fig 9 is code repetition; we only need to make small changes (highlighted in red in the figure), but still need other lines of code to chain all the tasklets. The SDK also provides API in composer and tasklet class to express them in a concise manner. The API has methods such as insert_before(), insert_after(), remove() and replace(). With these methods, code redundance can be avoided. Fig 10 demonstrates that the number of lines has decreased significantly compared to the number of lines in Fig 9.
As demonstrated with above examples, Flame offers a principled and yet intuitive way for enabling new federated learning algorithms and mechanisms. In addition, for some variants of existing mechanisms, a developer even doesn’t need to change Flame’s core SDK; instead, inheriting classes available from the SDK and adding new tasklets may be sufficient.
The key design principle of Flame is easy extensibility, with which Flame can go beyond training. Cisco Research actively conducts research on AI and some of our projects such Responsible AI (RAI) have become an open source project. Flame’s abstraction and programming model can render solutions like RAI more easily pluggable into Flame. Flame is an active open-source project for federated learning under Apache 2.0 License. We are dedicated to making it easier to use and supporting new emerging federated learning algorithms and mechanisms. We welcome feedback and contributions at all levels.