reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controlling Federated Learning for Covertness

Authors: Adit Jain, Vikram Krishnamurthy

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical results show that when the learner uses the optimal policy, an eavesdropper can only achieve a validation accuracy of 52% with no information and 69% when it has a public dataset with 10% positive samples compared to 83% when the learner employs a greedy policy. The proposed methods are demonstrated on a novel application, covert federated learning (FL) on a text classification task using large language model embeddings. Our key numerical results are summarized in Table 2.
Researcher Affiliation	Academia	Adit Jain EMAIL Department of Electrical and Computer Engineering, Cornell University Vikram Krishnamurthy EMAIL Department of Electrical and Computer Engineering, Cornell University
Pseudocode	Yes	Algorithm 1 Structured Policy Gradient Algorithm Input: Initial Policy Parameters Θ0, Perturbation Parameter ω, K Iterations, Step Size κ, Scale Parameter ρ, Learning cost l, Privacy Cost c Output: Policy Parameters ΘK procedure COMPUTESTATIONARYPOLICY(ω, K, κ, ρ) for n 1 . . . K do Γ Bernoulli( 1 2) 3 \|SO\|\|SE\| i.i.d. Bernoulli random variables Θ+ n Θn + Γ ω, Θ n Θn Γ ω, ˆl AVGCOST(l, Θn) ˆ l (AVGCOST((l, Θ+ n ) AVGCOST(l, Θ n )) , ˆ c (AVGCOST((c, Θ+ n ) AVGCOST(c, Θ n )) Θn and ξn using (15) end for end procedure procedure AVGCOST(J, Θ) ˆν POLICYFROMPARAMETERS(Θ) T PT t=1 J(ˆν(yt), yt) end procedure
Open Source Code	Yes	Hate speech classification is still an open problem and the achieved accuracy is barely satisfactory but our aim was to show the application of our formulation. Our source code can be found on this anonymized link.
Open Datasets	Yes	We use Jigsaw s Unintended Bias in Toxicity Classification dataset for our experimental results. The dataset has 1.8 million public comments from the Civil Comments platform. The dataset was annotated by human raters for toxic conversational attributes, mainly rating the toxicity of each text on a scale of 0 to 1 and sub-categorizing for severe toxicity, obscene, threat, insult, identity attacks, and sexually explicit content. More information about the annotation process can be found on the Kaggle website for this dataset here.
Dataset Splits	Yes	Each client has 5443 training samples and 1443 validation samples. For the experimental results, we consider N = 45 communication rounds (or queries) and M = 16 successful model updates (which is around 34% of the total queries). The original dataset is imbalanced with 1660540 non-toxic samples and 144334 toxic samples, and for each experimental run, we take a random balanced subset with 144334 toxic and non-toxic samples. The eavesdropper accuracy is calculated using a balanced validation dataset of size 2886.
Hardware Specification	No	20 runs of the Hate Speech Classification task took around 23 hours, whereas, within the same time frame, we could do 1040 runs of the image classification task.
Software Dependencies	No	To demonstrate the versatility of our methods, we optimize our neural network using Adam (Kingma & Ba, 2017) and run Fed Avg (Mc Mahan et al., 2017) instead of Fed SGD. Using the preprocessed training data, we fine-train our model to minimize the binary cross entropy loss (BCE).
Experiment Setup	Yes	Our architecture used for training involves the following layer sequence: A pre-trained BERT layer which outputs a 128-length embedding, a fully connected 128 neurons wide linear layer with Re LU activation, a dropout layer with a rate of 10 1 and finally, a linear layer classifying the text as hate speech or not. We consider the logit loss function. We use the following hyperparameters for training: learning rate: 10 3, training batch size of 40, and validation batch size of 20.