Tiya Vaj
1 min readApr 9, 2024

Pseudo-labeling, also known as self-training, is a semi-supervised learning technique used when labeled data is scarce or expensive to obtain. It involves leveraging a small amount of labeled data along with a larger amount of unlabeled data to improve model performance.

Here’s how pseudo-labeling generally works:

1. Initial Training: The model is first trained on the small labeled dataset, learning to make predictions based on the provided labels.

2. Prediction on Unlabeled Data: After the initial training, the trained model is used to make predictions on the unlabeled data. These predictions are treated as pseudo-labels.

3. Combining Labeled and Pseudo-labeled Data: The pseudo-labeled data (unlabeled data with predicted labels) is added to the original labeled dataset.

4. Re-training: The model is then re-trained using the combined dataset, which now includes both the original labeled data and the pseudo-labeled data.

5. Iterative Process: Steps 2–4 are repeated iteratively, with the model being re-trained on the expanded dataset in each iteration.

The idea behind pseudo-labeling is that the model’s predictions on the unlabeled data can provide useful information, effectively increasing the amount of labeled data available for training. This process can help improve the model’s performance, especially when there is a scarcity of labeled data. However, it requires careful consideration of the reliability of the pseudo-labels and the potential for error propagation.



Tiya Vaj

Ph.D. Research Scholar in NLP and my passionate towards data-driven for social good.Let's connect here