Attention-based topic modeling in Thai language.

4 min readJul 31, 2023

The attention-based topic model is a neural network model that is designed to identify and extract topics from a given text. It is based on a combination of two main components: word embeddings and attention mechanism.

1. Word Embeddings: Word embeddings are a way to represent words in a numerical format that captures the semantic meaning of words. In this model, each word in the input text is converted into a high-dimensional vector called a word embedding. The word embeddings capture the contextual information of words, allowing similar words to be closer in the vector space.

2. Attention Mechanism: The attention mechanism is a way to focus on specific parts of the input sequence while processing the data. It gives more weight or attention to certain words based on their relevance to the current context. In the context of the attention-based topic model, the attention mechanism is used to highlight important words that are most relevant to a particular topic.

Now, let’s break down how the model works:

1. Preprocessing: The input text is preprocessed by tokenizing the paragraph into words and removing Thai stopwords (common words that do not add much meaning to the text).

2. Word Embeddings: The filtered paragraph is then tokenized again to create a vocabulary of unique words. Each word is assigned a unique index, and these indices are used to create word embeddings. The word embeddings are initialized randomly and adjusted during training to better represent the semantic meaning of words.

3. LSTM: The word embeddings are fed into a Long Short-Term Memory (LSTM) layer. The LSTM is a type of recurrent neural network that can capture the sequential nature of the input text and retain information from previous words.

4. Attention Layer: The output of the LSTM is passed through an attention layer. This layer computes attention scores for each word in the sequence, indicating how important each word is for identifying the topic. The attention scores are calculated based on the hidden states of the LSTM.

5. Softmax: The attention scores are then converted into probabilities using the softmax function. This converts the scores into a probability distribution, where the sum of all probabilities is equal to 1.

6. Topic Identification: The final output of the model is a set of attention-based topic probabilities. Each probability corresponds to a different topic, and together, they indicate the likelihood of the input text belonging to each topic.

7. Visualization: To better understand the model’s word embeddings, t-SNE is applied to reduce the high-dimensional word embeddings to 2D, allowing us to visualize the words in a 2D space. This helps identify clusters of words that are related in meaning.

By training the model on a larger dataset, it can learn to associate specific words with different topics and identify the dominant topics in a given text. The visualization helps us explore how different words are distributed and clustered based on their meaning and relevance to various topics.

code

import torch
import torch.nn as nn
from pythainlp.tokenize import word_tokenize
from pythainlp.corpus.common import thai_stopwords
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib as mpl

# Sample Thai healthcare paragraph
paragraph = """ มะเร็งเป็นกลุ่มของโรคที่ซับซ้อนและน่ากลัวที่สามารถกระทำต่อส่วนใดๆ ของร่างกาย โรคนี้เกิดขึ้นเมื่อเซลล์ปกติมีการเกิดการกลายพันธุ์ทางพันธุกรรม นำไปสู่การเจริญเติบโตและแบ่งเบาของเซลล์ที่ไม่ควบคุม การเจริญเติบโตที่ไม่ควบคุมนี้ทำให้เกิดก้อนหรือเนื้องอก ซึ่งอาจแพร่กระจายไปสู่เนื้อเยื่อรอบข้างและอาจกระทำการกระจายไปยังส่วนอื่นของร่างกายผ่านกระบวนการขยายอาการที่เรียกว่า การมีลัทธิ์"""

# Get the list of Thai stopwords
stopwords = thai_stopwords()

# Tokenize the paragraph into words
words = word_tokenize(paragraph)

# Remove stopwords from the paragraph
filtered_words = [word for word in words if word not in stopwords]

# Tokenize the filtered paragraph into words
tokens = word_tokenize(" ".join(filtered_words))

# Create a vocabulary and mapping for the tokens
vocab = {word: idx for idx, word in enumerate(set(tokens))}  # Use set() to get unique tokens
num_vocab = len(vocab)

# Convert words to indices in the paragraph
paragraph_indices = torch.tensor([vocab[word] for word in tokens], dtype=torch.long)

# Define the attention-based topic model
class AttentionTopicModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_topics):
        super(AttentionTopicModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.attention = nn.Linear(hidden_dim, num_topics)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        embedded = self.embeddings(x)
        lstm_out, _ = self.lstm(embedded)
        attention_scores = self.attention(lstm_out[:, -1, :])
        attention_probs = self.softmax(attention_scores)
        return attention_probs

# Rest of the code remains unchanged
# Hyperparameters
embedding_dim = 300
hidden_dim = 64
num_topics = 5

# Create the model
model = AttentionTopicModel(num_vocab, embedding_dim, hidden_dim, num_topics)

# Sample input to the model
input_paragraph = paragraph_indices.unsqueeze(0)  # Add batch dimension

# Get the attention-based topic probabilities for the input paragraph
topic_probs = model(input_paragraph)

print("Input Paragraph:", paragraph)
print("Topic Probabilities:", topic_probs)

# Print the top words associated with each topic
top_words_per_topic = torch.argsort(topic_probs, descending=True, axis=1)[:, :5]
for i, topic_words in enumerate(top_words_per_topic):
    print(f"Topic {i+1}:")
    for word_idx in topic_words:
        print(tokens[word_idx.item()], end=" ")
    print("\n")

Result (As I have used the paragraph regarding cancer and the topic modeling has also predicted as cancer as well. )

Notes : The code need to be designed better regarding the huge amount of data.

Attention-based topic modeling in Thai language.

Written by Tiya Vaj