🔗 https://t.co/2GqskP6kQE
Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting
Crowd counting presents enormous challenges in the form of large variation in
scales within images and across the dataset. These issues are further
exacerbated in highly congested scenes. Approaches based on straightforward
fusion of multi-scale features from a deep network seem to be obvious solutions
to this problem. However, these fusion approaches do not yield significant
improvements in the case of crowd counting in congested scenes. This is usually
due to their limited abilities in effectively combining the multi-scale
features for problems like crowd counting. To overcome this, we focus on how to
efficiently leverage information present in different layers of the network.
Specifically, we present a network that involves: (i) a multi-level bottom-top
and top-bottom fusion (MBTTBF) method to combine information from shallower to
deeper layers and vice versa at multiple levels, (ii) scale complementary
feature extraction blocks (SCFB) involving cross-scale residual functions to
explicitly enable flow of complementary features from adjacent conv layers
along the fusion paths. Furthermore, in order to increase the effectiveness of
the multi-scale fusion, we employ a principled way of generating scale-aware
ground-truth density maps for training. Experiments conducted on three datasets
that contain highly congested scenes (ShanghaiTech, UCF_CC_50, and UCF-QNRF)
demonstrate that the proposed method is able to outperform several recent
methods in all the datasets.