Refinetuning Decentralized Large Language Model for Privacy-Sensitive University Data
Abstract
Kilian Lorenz, Pascal Bürklin, Jay Kim, Klemens Schnattinger, Sascha Reining, Nathan Peterson, Agha Husain
This work focuses on refining a decentralized large language model (LLM) tailored for finetuning on privacy-sensitive university data. Devolved AI models, designed to operate across multiple distributed nodes, offer a promising solution for handling sensitive information by ensuring data remains localized at its source while collaboratively training a global model. The key challenge addressed in this study is the adaptation and fine-tuning of a decentralized LLM to work effectively with heterogeneous, privacy- restricted datasets typical in university environments, such as student records, research data, and administrative information. Our approach involves enhancing the LLM’s ability to handle domain-specific language through targeted fine-tuning on anonymized university datasets. The model is further optimized for efficient decentralized learning, ensuring data privacy while improving model performance. Advanced techniques, such as differential privacy and secure aggregation, are incorporated to strengthen data protection during finetuning.
A notable innovation of our work is the development of a comprehensive Devolved AI product that not only manages decentralized finetuning but also incorporates an LLM as a judge to score model improvements. This product automates the end-to-end process—from data ingestion and model fine-tuning to evaluation—by leveraging the LLM to provide objective, detailed feedback on model performance. Initial results demonstrate that the refined LLM achieves high accuracy in downstream tasks, including automated document summarization, query answering, and policy generation, without compromising data privacy. This research highlights the potential of decentralized AI systems in privacy-sensitive domains and paves the way for scalable, secure AI solutions in academic institutions. Future work will focus on expanding the model’s applicability to broader educational datasets and further optimizing the finetuning frameworks and evaluation methods employed by the Devolved AI product.