Ensembling diverse neural architectures consistently improves predictive performance, but traditional stacking methods that aggregate raw feature maps introduce massive parameter overheads that lead to severe overfitting in data-scarce domains. While using output logits reduces this dimensionality, it strips away the rich representational context needed for dynamic, sample-aware model weighting. Bridging the gap between feature-rich meta-learning and extreme architectural constraint is essential for deploying robust ensembles in specialized fields like medical imaging and neural decoding.
Approach
We propose Spatially Pooled Feature Stacking (SPFS), a meta-learning framework that aggregates base-learner representations using Global Average Pooling (GAP) prior to ensemble weighting. Instead of concatenating high-dimensional feature maps or relying solely on output logits, SPFS extracts the pre-classification feature maps from diverse base models and applies GAP to generate compact, spatially invariant summaries. These pooled vectors are then concatenated and fed into a lightweight multi-layer perceptron meta-learner, adapting the feature-based weighting strategy from [Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models](/paper/art_98b9d176d3dc4047b40fa89db19e8207) but with drastically reduced fan-in requirements. This approach provides the meta-learner with rich, label-agnostic visual context to dynamically weight base models while acting as a strong structural regularizer against overfitting, echoing the architectural constraints seen in [Zhejiang University at ImageCLEF 2019 Visual Question Answering in the Medical Domain](/paper/art_0010063fae5649b19c6255cfe8595ab3).
Experimental Plan
We will evaluate SPFS on data-scarce medical imaging benchmarks, specifically the ISIC 2017 skin lesion classification dataset and the ImageCLEF VQA-Med challenge. The primary hypothesis is that SPFS will outperform both simple logit-averaging and dense feature-concatenation stacking by preventing meta-learner overfitting while maintaining sample-aware dynamic weighting. Baselines will include standard logit-based SVM stacking as used in [RECOD Titans at ISIC Challenge 2017](/paper/art_4a60769bd2de488a894141545a1b2875), simple unweighted averaging, and dense feature-based stacking without GAP. Performance will be measured using AUC-ROC for classification and BLEU/accuracy for VQA, alongside a comparative analysis of meta-learner parameter counts and validation loss trajectories to explicitly quantify the regularization benefits of the GAP integration.