Hernandez Morales JJ, Mentzos G, Hannig F, Balaskas K, Zervakis G, Henkel J, Teich J (2026)
Publication Language: English
Publication Status: In review
Publication Type: Unpublished / Preprint
Future Publication Type: Journal article
Publication year: 2026
Publisher: arXiv
URI: https://arxiv.org/abs/2604.16113v2
DOI: 10.48550
Open Access Link: https://arxiv.org/abs/2604.16113v2
The paradigm shift towards local and on-device inference under stringent
resource constraints is represented by the tiny machine learning
(TinyML) domain. The primary goal of TinyML is to integrate intelligence
into tiny, low-cost devices under strict resource, energy, and latency
constraints. However, the ultra-resource-constrained nature of these
devices can lead to increased inference execution time, which can be
detrimental in latency critical applications. At the same time, TinyML
applications are often associated with sensitive data. As such, latency
optimization approaches that rely on training samples are infeasible
when such data is unavailable, proprietary, or sensitive, highlighting a
pressing need for optimization approaches that do not require access to
the training dataset and can be applied directly to pre-trained models.
Replacing costly multiplications with more hardware-efficient
operations, such as shifts and additions, has been proposed as an
effective method for reducing inference latency. However, post-training
power-of-two (Po2) approaches are scarce and, in many cases, lead to
unacceptable accuracy loss.
In this work, we propose a framework that applies approximate matrix
decomposition to a given CNN in order to optimize hardware
implementations subject to strict constraints and without any need of
re-training or fine-tuning steps. The genetic algorithm-driven framework
explores different matrix decompositions and resulting multiplier-less
CNN accelerator designs for FPGA targets. A comprehensive evaluation of
different TinyML benchmarks demonstrates our framework's efficacy in
generating latency-optimized implementations that satisfy strict
accuracy and resource constraints, achieving an average 33% latency
improvement with an average accuracy loss of 1.3% compared to typical
systolic array-based FPGA accelerators.
APA:
Hernandez Morales, J.J., Mentzos, G., Hannig, F., Balaskas, K., Zervakis, G., Henkel, J., & Teich, J. (2026). Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition. (Unpublished, In review).
MLA:
Hernandez Morales, Jose Juan, et al. Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition. Unpublished, In review. 2026.
BibTeX: Download