Abstract
The convergence of multi-modal sensing and constrained computing on personal devices demands efficient algorithmic frameworks for fusing heterogeneous data streams—visual, depth, inertial, and biometric—under stringent latency, memory, and power constraints. This paper presents a unified mathematical framework for on-device multi-modal fusion, grounded in constrained convex optimization and operator splitting theory. We formulate the fusion problem as minimization of a composite objective function, coupling modality-specific loss terms through linear consensus constraints, regularized by structured sparsity and low-rank priors. The central contribution is a Preconditioned Asynchronously Parallel Alternating Direction Method of Multipliers (PAP-ADMM), tailored for architectures with heterogeneous computational loads across modalities. We derive closed-form solutions for proximal operators associated with logistic regression, group lasso, and nuclear norm regularization, which are ubiquitous in personalization and security tasks. Convergence analysis establishes a non-asymptotic rate of O(1/k) under bounded delay conditions. Extensive experiments on synthetic benchmarks and a concrete application—fusing RGB and depth features for contactless palm-print authentication—validate the framework's efficacy. The proposed solver achieves up to 3.5x speedup over synchronous ADMM and reduces memory footprint by 45% on embedded hardware, while maintaining or improving accuracy. This work provides both theoretical foundations and practical tools for developing efficient, private, and robust on-device intelligent systems.



![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)