Exploring Word Importance in Language Models: A Case Study Using Perturbation-based Explainability Techniques

4 min readFeb 26, 2024

This article delves into the inner workings of language models by employing perturbation-based explainability techniques to analyze the importance of individual words within a given sentence.

Through a step-by-step exploration of the processing pipeline, readers will gain insights into how language models interpret text data and make predictions.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics.pairwise import cosine_similarity

# Function to generate perturbations of the input sentence
def generate_perturbations(sentence):
    words = sentence.split()
    perturbations = []
    for i in range(len(words)):
        perturbation = words[:i] + ['[MASK]'] + words[i+1:]
        perturbations.append(' '.join(perturbation))
    return perturbations

# Function to get embeddings for a given sentence (replace this with actual implementation)
def get_embedding(sentence):
    # Dummy function, replace with actual implementation
    return np.random.rand(300)  # Return a random embedding vector of size 300 as an example

# Create a dataset with a single sentence
sentence = "The sun is shining brightly in the clear blue sky."

# Define a language-based model (replace this with your model)
# Assume 'model' is your language-based model that predicts sentiment or some other classification
def language_model_predict(sentence):
    # Dummy function, replace with your actual language-based model
    return np.random.rand()  # Return a random prediction as an example

# Generate perturbations for the sentence
perturbations = generate_perturbations(sentence)

# Get embeddings for the original sentence and perturbations
original_embedding = get_embedding(sentence)
perturbations_embedding = [get_embedding(p) for p in perturbations]

# Calculate predictions for perturbations
predictions = [language_model_predict(p) for p in perturbations]

# Calculate similarity between original instance and perturbations
similarity_scores = cosine_similarity(original_embedding.reshape(1, -1), np.array(perturbations_embedding))

# Weighted samples
weighted_predictions = np.array(predictions) * similarity_scores.flatten()

# Fit explainable model
explainable_model = LinearRegression()
explainable_model.fit(perturbations_embedding, weighted_predictions)

# Interpret coefficients
feature_importances = explainable_model.coef_

Let me explain the processing steps involved in the code you provided:

1. Generating Perturbations: The `generate_perturbations` function takes a sentence as input and creates perturbations by replacing each word with a special token (`[MASK]`). This results in a list of perturbations, where each perturbation represents the absence of a specific word from the original sentence.

2. Getting Embeddings: The `get_embedding` function is a placeholder for obtaining embeddings (numerical representations) for a given sentence. In the provided example, a random embedding vector of size 300 is returned for demonstration purposes. In a real-world scenario, you would replace this function with an actual implementation that provides meaningful embeddings based on a language model or other techniques.

3. Predicting with the Language Model: The `language_model_predict` function simulates predictions made by a language-based model (such as a sentiment analysis model) for a given sentence. In the provided example, a random prediction value is returned as a placeholder.

4. Calculating Similarity Scores: Cosine similarity is computed between the embedding of the original sentence and the embeddings of the perturbations. This measures the similarity between the original instance and each perturbation in terms of their embeddings.

5. Weighting Predictions: The predictions made by the language model for each perturbation are multiplied by the corresponding cosine similarity score. This results in weighted predictions, where the importance of each perturbation is adjusted based on its similarity to the original sentence.

6. Fitting Explainable Model: A linear regression model (in this case, `LinearRegression` from scikit-learn) is fitted to the embeddings of the perturbations and their weighted predictions. This model aims to explain the relationship between the perturbation embeddings and the weighted predictions.

7. Interpreting Coefficients: The coefficients of the fitted linear regression model represent the feature importances or contributions of each perturbation (word) to the model’s predictions. Higher coefficient values indicate greater importance, suggesting that the corresponding word has a stronger influence on the model’s decisions.

Overall, this processing pipeline enables us to analyze the importance of individual words in the original sentence and understand how their absence affects the predictions of the language-based model. It provides insights into the model’s behavior and helps in explaining its decisions.

When masking the word “sun” generates the highest feature importance, it suggests that the word “sun” plays a significant role in influencing the model’s predictions. However, it’s essential to interpret this result cautiously and consider other factors. While high feature importance indicates that the absence of the word “sun” has a substantial impact on the model’s decisions, it doesn’t necessarily mean that “sun” is the most important word in the sentence overall. Other words or context-specific factors may also contribute to the model’s predictions. Therefore, while feature importance provides valuable insights, it’s essential to consider it in conjunction with other analyses and domain knowledge to draw accurate conclusions about the importance of specific words in a sentence.

If masking the word “sun” results in the lowest cosine similarity compared to masking other words, while also showing the highest feature importance, it suggests an interesting discrepancy between the two metrics.

On one hand, the high feature importance of “sun” indicates that its absence has a significant impact on the model’s predictions. This implies that the word “sun” plays a crucial role in influencing the model’s decisions.

On the other hand, the low cosine similarity when masking “sun” suggests that its absence has a minimal effect on the overall similarity between the original sentence and its perturbations. This implies that the word “sun” may not contribute significantly to the semantic context or similarity of the sentence compared to other words.

This apparent contradiction highlights the complexity of interpreting different explainability metrics and underscores the importance of considering multiple factors when analyzing model behavior. Further investigation and analysis may be needed to reconcile these findings and gain a deeper understanding of the model’s behavior.

Exploring Word Importance in Language Models: A Case Study Using Perturbation-based Explainability Techniques

Written by Tiya Vaj