Thesis Defense

Date: May 16, 2022

The main challenge in computer vision field is how to obtain semantically meaningful information from visual input signals such as digital images or video frames, so that a computer or machine can fully leverage such a kind of visual inputs to take action according to a pre-defined scenario setting. Recently, object detection and semantic segmentation, as two representative tasks in computer vision field, have made great progress in terms of model recognition accuracy and network inference speed. However, almost all current models of object detection and semantic segmentation tasks are evaluated under a closed environment or within a single dataset. More importantly, algorithm evaluation process within a single dataset only demonstrates whether the algorithm is effective under the closed environment, but such an evaluation process cannot show the algorithm’s effectiveness for an open environment, which means that there are new data-level and label-level distributions in the test scenarios that have never been seen during the training phase. Therefore, when data-level and label-level distribution differences inevitably exist, object detection and semantic segmentation methods under an open environment are becoming the focus of current and even future researches.

It is challenging to achieve object detection and semantic segmentation under an open environment, mainly due to that object detection and semantic segmentation tasks need to recognize and locate multiple instance objects under a scene-setting with more complex data distributions and more variable object scales. Besides, the existing research works still have the following four drawbacks: Firstly, as representative downstream recognition tasks in computer vision field, object detection and semantic segmentation usually employ high-resolution images as the model input. Thus, alleviating the intra-domain data distribution differences caused by high-resolution input data is the key to the successful deployment of a source-domain model to the target domain. Secondly, the object scale and semantic relationship of foreground instances on different domains are inconsistent. When pre-trained knowledge is transferred from one domain to another, such inconsistency will usually significantly reduce the performance of the target scene by the pre-trained knowledge or model. Third, for different domains, the transferability inherent to each semantic category may be very different. Thus, it is necessary to design a criterion to evaluate the model transferability of each category, in order to perform a class-wise knowledge transferability process. Fourthly, almost all models assume that training and test data share the same category-label distribution. However, in some real applications, there will also be novel classes that have never been seen during the training phase. Overall, the main contents and innovations of the research work are as follows:

A curriculum-style local-to-global cross-domain adaptation strategy is proposed. The proposed curriculum-style adaptation performs the adaptation process in an easy-to-hard way according to the adaptation difficulties that can be obtained using an entropy-based score for each patch of the target domain, and thus well aligns the local patches in a domain image. The proposed local-to-global adaptation performs the feature alignment process from the locally semantic to globally structural feature discrepancies, and consists of a semantic-level domain classifier and an entropy-level domain classifier that can reduce the above cross-domain feature discrepancies.
An adversarial-based foreground-aware Densely Semantic Enhancement Module (DSEM) is proposed. The DSEM is pluggable into different region-free detectors, ultimately achieving the densely semantic feature matching via adversarial learning strategy. Besides, to emphasize the important regions of image, the DSEM learns to predict a transferable foreground enhancement mask that can be utilized to suppress the background disturbance in an image. Meanwhile, considering that region-free detectors recognize objects of different scales using multi-layer feature maps, the DSEM encodes multi-scale representations across different domains.
A class-wise transferable Joint Adaptive Detection Framework (JADF) is proposed. First, previous works mainly align marginal distribution by unsupervised cross-domain feature matching, and ignore each feature’s categorical and positional information that can be exploited for conditional alignment. In contrast, the JADF aligns both marginal and conditional distributions between domains without introducing any extra hyper-parameter. Next, to consider the transferability of each object class, a metric for class-wise transferability assessment is proposed, which is incorporated into the JADF objective for domain adaptation.
A Sample-centric Feature Generation (SFG) approach is proposed. Our studies provide a new direction for how to effectively exploit the widely-available unlabeled data, namely, pseudo-labeled sample-centric feature-level generation way. First, a semi-supervised meta-generator is utilized to produce derivative features centering around each pseudo-labeled sample, enriching the intra-class feature diversity. Besides, the sample-centric generation constrains the generated features to be compact and close to the pseudo-labeled sample, ensuring the inter-class feature discriminability. Further, a reliability assessment (RA) metric is developed to weaken the influence of generated outliers on model learning.

Key words: Object detection, semantic segmentation, data distribution difference, class-label distribution difference, unsupervised learning, semi-supervised learning

Share on

Twitter Facebook LinkedIn

Bo Zhang

Share on