FINE-TUNING OF TRANSDUCTIVE FEW-SHOT LEARNING METHODS USING MARGIN-BASED UNCERTAINTY WEIGHTING AND PROBABILITY REGULARIZATION

Title:

FINE-TUNING OF TRANSDUCTIVE FEW-SHOT LEARNING METHODS USING MARGIN-BASED UNCERTAINTY WEIGHTING AND PROBABILITY REGULARIZATION

Document Type and Number:

WIPO Patent Application WO/2024/091317

Kind Code:

Abstract:

Disclosed herein is a novel method for improving transductive fine-tuning for few-shot learning using margin-based uncertainty weighting and probability regularization. Margin-based uncertainty is designed to assign low loss weights for wrongly predicted samples and high loss weights for the correct ones. Probability regularization provides for the probability of each testing sample being adjusted by a scale vector, which quantifies the difference between the class marginal distribution and the uniform.

Inventors:

SAVVIDES MARIOS (US)
TAO RAN (US)
CHEN HAO (US)

Application Number:

PCT/US2023/029854

Publication Date:

May 02, 2024

Filing Date:

August 09, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV CARNEGIE MELLON (US)

International Classes:

G06N3/08

Attorney, Agent or Firm:

CARLETON, Dennis M. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

Attorney Docket: 8350.2023-016WO [0039] In doing so, each sample from the query set obtains a unique scale vector b, which allows per-sample probability regularization. Meanwhile, aligning the estimated marginal probability of ^ ∪ ^_^ to uniform avoids direct regularization on the class marginal probability of the whole query set. This allows the probability regularization to be theoretically effective when the actual testing set is not uniform. The uniform prior serves as a solid regularization to enforce the class balance during fine-tuning. [0040] By solving the issue of class-imbalanced predictions in few-shot learning, TF-MP enhances real-world few-shot applications. The margin-based uncertainty weighting provides a better measurement of the uncertainty in predictions with theoretical and empirical analysis. [0041] As would be realized by one of skill in the art, many variations in the designs discussed herein fall within the intended scope of the invention. Moreover, it is to be understood that the features of the various embodiments described herein are not mutually exclusive and can exist in various combinations and permutations, even if such combinations or permutations were not made express herein, without departing from the spirit and scope of the invention. Accordingly, the method and system disclosed herein are not to be taken as limitations on the invention but as an illustration thereof. The scope of the invention is defined by the claims which follow. Attorney Docket: 8350.2023-016WO Claims 1. A method of improving few shot learning for a machine learning model comprising: pretraining a feature extractor of the machine learning model on a training dataset performing transductive fine-tuning of the machine learning model using a test dataset; training the model on a new class having few samples; predicting a probability of a correct classification for each class for each sample; wherein wrongly-predicted samples from the new class are assigned low loss weights and correctly-predicted samples from the new class are assigned high loss weights. 2. The method of claim 1 wherein the assigned per-sample loss weights are entropy-based. 3. The method of claim 2 wherein the entropy quantifies an uncertainty of a probability of a correct prediction for the sample, wherein larger uncertainties implies a lower confidence level, resulting in a lower loss weight for the Attorney Docket: 8350.2023-016WO sample. 4. The method of clam 2 further comprising: determining a margin between probabilities for two classes for each sample; wherein the two classes are the classes having the highest and second-highest probability of a correct prediction for the sample. 5. The method of claim 4 wherein a smaller margin indicates a larger uncertainty of a correct prediction for the sample. 6. The method of claim 4 wherein the highest and second highest probabilities are normalized. 7. The method of claim 6 wherein the entropy-based loss weight is a function of the margin for each sample. 8. The method of claim 7 further comprising: regularizing the probabilities for each testing sample. 9. The method of claim 8 further comprising: obtaining a scale vector for each testing sample.

Description:

APPLICATION FILED UNDER THE PATENT COOPERATION TREATY AT THE UNITED STATES RECEIVING OFFICE for Fine-tuning of Transductive Few-Shot Learning Methods Using Margin-Based Uncertainty Weighting and Probability Regularization Applicant Carnegie Mellon University ID2023-016 Inventors Marios Savvides Ran Tao Hao Chen Prepared By: . M Dennis M. Carleton, Principal KDW Firm PLLC 2601 Weston Pkwy. Suite 103 Cary, NC 27513 919-396-5643 Attorney Docket: 8350.2023-016WO Fine-tuning of Transductive Few-Shot Learning Methods Using Margin-Based Uncertainty Weighting and Probability Regularization Related Applications [0001] This application is a non-provisional of, and claims the benefit of U.S. Provisional Patent Applications No. 63/396,655, filed August 10, 2022, entitled “Transductive Few-Shot Classification With Decisive Weighting and Fairness Scaling”, the contents of which are incorporated herein in their entirety. Government Interest [0002] This invention was made with United States Government support under contract AW911NF20D0002 awarded by the U.S. Army. The U.S. Government has certain rights in the invention. Background [0003] Deep learning has gained vital progress in various architecture designs, optimization techniques, data augmentation, and learning strategies and has demonstrated great potential in applications to real-world scenarios. However, applications with deep learning generally require a large amount of labeled data, which is time-consuming to collect and costly on manual labeling force. Attorney Docket: 8350.2023-016WO [0004] Few-shot learning (FSL) is a machine Learning framework that enables a pre-trained model to generalize over new categories of data not seen during training and using only a few labeled samples per class. FSL becomes increasingly essential to significantly alleviate the dependence on data acquisition. Recent attention on FSL over out-of-distribution datum poses a challenge in obtaining efficient algorithms that can perform well on cross-domain situations. Finetuning a pre-trained feature extractor with a few samples has the potential to solve this challenge. [0005] However, a few training samples leads to a biased estimation of the true data distribution. The biased learning during few-shot fine-tuning could further mislead the model to learn an imbalanced class marginal distribution. To verify this, the largest difference (LD) between the number of per-class predictions with a uniform testing set is quantified. If the fine-tuned model learns a balanced class marginal distribution, with a uniform testing set, LD should approach zero. However, empirical results show the opposite. As shown in FIG.1, even with prior art methods, LD could be significantly over 10 in practice. The fine-tuned models in FSL suffer from severely imbalanced categorical performance. In other words, the learned class marginal distribution of few-shot fine-tuned models is largely imbalanced and biased. Solving this issue is critical to maintaining the algorithm’s robustness to different testing scenarios. Classes with fewer predictions would carry low accuracy, and this issue of fine-tuned Attorney Docket: 8350.2023-016WO models could yield a fatal failure for testing scenarios in favor of these classes. [0006] FIG.1 shows that fine-tuned models with current state-of-the-art methods learned an imbalanced class marginal distribution. In the empirical experiments, a uniform testing set is utilized, and the Largest Difference LD between per-class predictions is used to quantify whether the learned class marginal probability is balanced. Data are from sub- datasets in Meta-Dataset with 100 episodes for each dataset and 10 per- class testing samples. With prior art methods, LD is over 10. Summary of the Invention [0007] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended as an aid in determining the scope of the claimed subject matter. [0008] Disclosed herein is a method providing an improvement to transductive fine-tuning for few-shot learning which effectively uses unlabeled testing data. The imbalanced categorical performance in FSL motivates two solutions. Attorney Docket: 8350.2023-016WO [0009] The first solution uses per-sample loss weighting through margin-based uncertainty. As shown in FIG.2, for per-sample loss weighting, using the same number of per-class training data achieves extremely imbalanced prediction results. It indicates that each sample contributes to the final performance differently. Therefore, the unlabeled testing samples are weighted according to their uncertainty scores. The use of margin in the entropy computation is disclosed to compress the utilization of wrong predictions. [0010] The second solution is probability regularization. As the ideal performance should be categorically balanced, the probability for each testing data is explicitly regularized. Precisely, the categorical probability of each testing sample is adjusted by a scale vector, which quantifies the difference between the class marginal distribution and the uniform. The class marginal distribution is estimated by combining each query sample with the complete support set. [0011] The method for improving the fine-tuning of transductive few-shot learning using margin-based uncertainty weighting and probability regularization (TF-MP) effectively reduces the largest difference between per-class predictions by around 5 samples and further improves per-class accuracy approximately 2.1%. This is shown in FIG.1. Meanwhile, TF-MP shows robust cross-domain performance boosts on Meta-Dataset, demonstrating its potential in real applications. Attorney Docket: 8350.2023-016WO [0012] There are thus two novel aspects disclosed herein for improving transductive fine-tuning. These are utilizing margin-based uncertainty to weigh each unlabeled testing data in the loss objective to compress the utilization of possibly wrong predictions and regularizing the categorical probability for each testing sample to pursue a more balanced class marginal during finetuning. [0013] FIG.2 is an illustration of TF-MP. The results of a 1-shot 10-way classification is empirically evaluated on the correct/predicted number of per-class predictions. The model without TF-MP presents a severely imbalanced categorical performance even with the same number of per- class training samples. Using the margin-based uncertainty disclosed herein, the loss of each unlabeled testing data is weighted during finetuning, compressing the utilization of wrongly predicted testing data. The categorical probability for each testing data is regularized to pursue balanced class-wise learning during finetuning. Using TF-MP, the difference between per-class predictions reduces from 21.3% to 14.4% with per-class accuracy improved from 4.5% to 4.9%. Results are averaged over 100 episodes in the meta-dataset.

Attorney Docket: 8350.2023-016WO Brief Description of the Drawings [0014] By way of example, specific exemplary embodiments of the disclosed systems and methods will now be described, with reference to the accompanying drawings, in which: [0015] FIG.1 is a graph showing the largest difference between the number of per-class predictions with a uniform testing set for prior art methods and the method of the present invention. [0016] FIG.2 are graphs illustrating the benefit of eth disclosed methods. [0017] FIG.3 are graphs showing a 3-class illustration of uncertainty scores computed by both margin-based entropy and regular entropy. Detailed Description [0018] Transductive few shot learning uses the unlabeled query set (testing images) along with the support set (training images) to make up for the lack of training data. Disclosed herein is a framework for performing transductive fine tuning and a disclosure of TF-MP. [0019] First, the terminology and episode setting in FSL will be formally described. For one episode in FSL, the training and testing set is referred to as the support and query set, respectively. Let ^^, ^^ denote the pair of an input ^ with its ground-truth one-hot label ^ ∈ ℝ ^{^}, where ^ is the number of classes. The support set is then represented as ^ _^ = ^ ^{^}^ _^, ^ _^ ^{^^^} ^ _^ ^{^} ^ . The query set is denoted as ^ ^{^} ^ = ^{^^}^ _^ ^{^^ ^} ^ _^^ where the Attorney Docket: 8350.2023-016WO ground-truth labels are unknown if used in a transductive manner and ^ _^ and ^ _^ are the total number of samples in support set and query set, respectively. [0020] A feature extractor ^ _^ is first pre-trained on a meta-training dataset, and transductive fine-tuning is conducted on the meta-test dataset within each episode. We denote ^ _^ ^{^}^|^ ^{^} as the categorical probabilities on ^ classes, which is the output from the softmax layer in the model: ^ ^{^^} ^{^} _^ ^{^^ = ^|^^ =} _{^ ^^} (1) where: ! _^ = 〈# _^ , ^ _^^^^〉; % ∈ ^; and the dot product between # _^ and ^ _^ ^{^}^ ^{^} is the logit for class %. [0021] # _^ is the novel class prototype that is initialized as the mean feature from the support set ^ _^ for every iteration. A model with parameter & is learned to classify ^ _^ and ^ _^ as measured by the following criterion: & ^∗(^ _^, ^ _^) = *+,-%. _^ / ¹ ^ 1 ℒ _^^^, ^^ + ¹ 1 ℒ ^^^ 6 , _{4 ∈^^ ^ ^} _{^ ^3 ^ ^ ^3^∈^^} (2) Attorney Docket: 8350.2023-016WO [0022] The loss ℒ _^^^, ^^ for the labeled support set is the cross entropy loss and the loss ℒ _^ ^{^}^ ^{^} for the unlabeled query set is constructed as entropy ℒ _^^^^ = 7(^ _^^^|^^) × 9(^ _^^^|^^) (3) where: 7 denotes the per-sample loss weight; and 9(^ _^ ^{^}^|^ ^{^}) = −^ _^ ^{^}^|^ ^{^};<,(^ _^ ^{^}^|^ ^{^}) is the entropy loss. [0023] data can be generally represented as 9(^ _^ ^{^}^|^ ^{^}) = −^=;<,(^ _^ ^{^}^|^ ^{^}). As widely used in semi- supervised learning, there are two types of ^=: when ^= = *+,-*^(^ _^^^|^^) it is referred to as pseudo-label, whereas when ^= = ^ _^ ^{^}^|^ ^{^}, it is noted as soft-label. [0024] In prior art transductive fine-tuning, soft-label is utilized with 7 = 1 for every testing image, and the entropy minimization is conducted on the logit space. Different from the prior art, ℒ _^ ^{^}^ ^{^} is directly optimized on the feature space and 7(^ _^ ^{^}^|^ ^{^}) is compress the utilization of wrong predictions. Probability regularization is applied on ^ _^ ^{^}^|^ ^{^} before forwarding it to 9(^ _^^^|^^). Attorney Docket: 8350.2023-016WO Margin-based Uncertainty Weighting [0025] Margin-based uncertainty is designed to assign low loss weights for wrongly predicted samples and high loss weights for the correct ones. However, the generally used entropy-based weighting may not truly reflect whether the sample has the wrong prediction. Furthermore, margin-based uncertainty weighting is used to compress the utilization of wrongly predicted testing data. [0026] The class with the maximum probability ^ _>?3 is assigned as the predicted class. Thus, ^ _>?3 is referred to as the confidence, which indicates the confidence level of the categorical prediction. The other index used to indicate the confidence level of the prediction is the entropy of the predicted probabilities. In semi-supervised learning, an entropy-based per-sample loss weight is used as: 7^Ρ^ = 1 − ^^Ρ^ (4) where: Ρ = ^ _^ ^{^}^|^ ^{^}; and ^^Ρ^ refers to the normalized entropy: − ^{∑^} ^{^ ^^^ , ;<, ^^^} (5) where: Attorney Docket: 8350.2023-016WO ∑ ^{^} ^ ^ _^ = 1; Ρ = E^ _^, ^ _F, … , ^ _^H; and ^ is the number of classes. [0027] ^^Ρ^ is normalized to E0,1H as the entropy ∑ ^{^} ^ ^^ _^, ;<, ^ _^^ is scaled by its maximum value log ^. Entropy on ^ ^{^}^|^ ^{^} quantifies the uncertainty of probabilities. Larger uncertainty generally refers to a lower confidence level the sample carries towards its class prediction, consequently leading to lower loss weight 7 ^{^}Ρ ^{^}. However, Eq. (5) indicates that the uncertainty on the whole probability distribution may not be ideal for distinguishing whether the predictions are wrong. [0028] Intuitively, wrong predictions are more likely to be made when the model produces similar probabilities between two classes. In other words, the margin between the maximum and second maximum probability Δ^ can largely reflect how uncertain an example is with its prediction. A smaller margin indicates larger uncertainty on the prediction, which indicates a higher possibility that the prediction is wrong. [0029] Margin information is reflected in the entropy-based uncertainty measurement. When ^ _>?3 is fixed, the margin Δ^ is in the range of: min ^{^}Δ^ ^{^} = ^ ^{^PQ} > _?3 − ^{^}1 − ^ _>?3 ^{^}, max ^{^}Δ^ ^{^} = ^ _>?3 − ^RST U _P^ . Attorney Docket: 8350.2023-016WO [0030] Samples with the largest margin ^Δ^^ _>?3 are expected to be assigned with the least uncertainty on decisions. However, the entropy score gives the opposite answer. For -*^ ^{^}Δ^ ^{^}, the entropy is: ^ _{VWX^YQ^ = ^VZ[^YQ^ +} ^{^1 − ^>?3^ log^^ − 1^} l _{og ^} (6) [0031] As ^{^^PQRST^ \]^^UP^^} \ _{]^ U} is non-negative, Eq. (6) reveals that samples with largest margin max ^{^}Δ^ ^{^} carry larger entropy-based uncertainty scores than samples with min^Δ^^, which is contradictory to the information implied by the margin. [0032] To solve this contradiction, it is important to use only top-2 probabilities in Eq. (5). The maximum and second maximum probabilities are first normalized by dividing the sum to satisfy the requirement of ∑ ^U ^ ^ _^ = 1 in Eq. (5). ^= _>?3 and Δ^= are referred to as the normalized further used in Eq. (7). The margin-based uncertainty is defined as: ^= ^{^}Ρ ^{^} = − ¹ ^{^}^= _>?3 ^= _>?3 + ^{^}^= _>?3 − Δ^= ^{^} − Δ^= ^{^^} (7) [0033] and entropy. When margin Δ^= is fixed, ^=^Ρ^ is non-decreasing with confidence ^= _>?3. When confidence ^= _>?3 is fixed, ^= ^{^}Ρ ^{^} is non-decreasing with Δ^= as well. In doing so, the margin-based entropy score could Attorney Docket: 8350.2023-016WO consistently reflect the confidence level ^ _>?3 as well as the margin Δ^, as shown in FIG.3. By focusing on the uncertainty delivered by the margin in Ρ, it achieves more substantial compression on utilization of wrong predictions compared with entropy-based loss weights. [0034] FIG.3 shows a 3-class illustration of uncertainty scores computed by both margin-based entropy and entropy. The change in uncertainty scores with respect to confidence and margin are plotted. Entropy assigns lower uncertainty scores over the minimum margin area (lighter red), while margin-based entropy assigns uncertainty scores consistent with the information conveyed by confidence and margin: higher uncertainty scores (darker red) over the low confidence (0.4 - 0.5) and small margin areas. Compared with entropy, margin-based entropy increases the uncertainty score of p = [0.6, 0.4, 0] (margin = 0.2) from 0.61 to 0.98 and decreases the uncertainty score of p = [0.6, 0.2, 0.2] (margin = 0.4) from 0.86 to 0.81. Probability Regularization [0035] As previously discussed, the learned class marginal distribution from a few-shot fine-tuned model is severally imbalanced. Therefore, the importance of explicitly regularizing the categorical probability for each testing sample, as will now be disclosed, is emphasized. Attorney Docket: 8350.2023-016WO [0036] The probability regularization is explicitly conducted on the predicted probability ^ ^{^}^|^ ^{^} for each testing data. First, with ^ ∈ ^ _^, the learned class marginal distribution is estimated using the set ^ ∪ ^ _^, which is constructed by combining each testing data with the whole support set. A unique scale vector b ∈ ℝ ^{^} is obtained for each testing sample by aligning the estimated marginal probability with a uniform prior: c ^{b =} (8) where: c ∈ ℝ ^{^} represents the uniform prior; and b is a scale vector quantifying the difference between estimated marginal distribution with the uniform prior. [0037] Furthermore, b is used to conduct probability regularization on f = ^ _^^^|^^ as: fg = ^<+-*;%!^^f ∗ b^ (9) where: ^<+-*;%!^ ^{^}^ ³ ^ ^{^} = ; and ∗ [0038] ^f ∗ b^ applies re-scaling on f to reduce the difference between estimated marginal distribution with the uniform prior.

Previous Patent: SELF-CLEANING THERMAL CONDUCTIVITY SENSOR FOR USE IN A MUD PIT

Next Patent: TIME-RESOLVED OES DATA COLLECTION