Disease Classification Using NLP and Machine Learning: A Comparative Analysis | by Waleed Ahmad

In today’s healthcare landscape, the ability to automatically categorize diseases based on their characteristics can provide significant value for medical research, clinical decision support systems, and healthcare IT. In this article, I’ll walk through a comprehensive analysis comparing different approaches to disease classification using Natural Language Processing and Machine Learning techniques.

Medical knowledge is complex and multifaceted. Diseases have various symptoms, risk factors, signs, and subtypes that collectively define their characteristics. My objective was to determine whether diseases could be effectively categorized into broad groups (Neurological/Endocrine, Respiratory/Infectious, Cardiovascular, etc.) using machine learning models trained on these features.

I approached this problem by comparing two distinct methods of representing disease characteristics:

TF-IDF (Term Frequency-Inverse Document Frequency): This NLP technique weighs the importance of terms in a document relative to a corpus. For our dataset, we treated each disease as a document and its characteristics (symptoms, risk factors, etc.) as terms.
One-Hot Encoding: A more explicit representation where each feature is represented as a binary value (present/absent).

The TF-IDF representation produced a sparser matrix (more zeros) than the One-Hot approach, with the TF-IDF matrix capturing more nuanced relationships between terms and diseases.

Working with high-dimensional data presents challenges for visualization and modeling. I applied two dimensionality reduction techniques:

Principal Component Analysis (PCA): Works well for dense data matrices
Truncated Singular Value Decomposition (SVD): Better suited for sparse matrices like the TF-IDF representations

One of the most striking findings was that TF-IDF features reduced via Truncated SVD produced more separable clusters than One-Hot encoding, as visualized in our 2D projection:

This visual separation suggested that TF-IDF might be better at capturing the underlying relationships between diseases in the same category.

I evaluated two classification models:

K-Nearest Neighbors (KNN): A non-parametric method that classifies based on proximity to training examples
Logistic Regression: A probabilistic model that estimates the probability of class membership

For KNN, I experimented with different distance metrics (Euclidean, Manhattan, Cosine) and values of k (3, 5, 7). For both models, I performed 5-fold cross-validation to ensure robust evaluation.

The results showed that:

TF-IDF consistently outperformed One-Hot encoding across most metrics
For KNN, the Cosine distance metric generally performed better, suggesting the importance of directional similarity rather than absolute distance
Logistic Regression with TF-IDF features achieved the highest overall F1-score

Feature Representation Matters: TF-IDF’s superior performance indicates that capturing term importance across the corpus provides more discriminative power than binary presence/absence encoding.
Dimensionality Reduction Choice Is Critical: Truncated SVD preserved more explained variance for our sparse TF-IDF matrices compared to PCA, leading to better separation of disease categories.
Moderate Classification Performance: While the models showed promise (with F1-scores above 0.7), there’s still room for improvement, suggesting the inherent complexity of medical categorization.

Sample Size: Some disease categories had limited samples, requiring careful filtering to ensure valid cross-validation
Feature Quality: The quality and completeness of disease characteristics varied across the dataset
Categorization Subjectivity: The boundaries between disease categories can be inherently fuzzy

My analysis demonstrates that NLP techniques like TF-IDF, combined with appropriate dimensionality reduction and classification models, can provide meaningful categorization of diseases based on their characteristics. This approach has potential applications in medical knowledge organization, clinical decision support, and healthcare data systems.

For future work, incorporating more sophisticated text embedding techniques such as medical-domain-specific word embeddings or transformer models could potentially improve classification performance further.

Disease Classification Using NLP and Machine Learning: A Comparative Analysis | by Waleed Ahmad | Apr, 2025

Why I Stopped Teaching Math the Way I Was Taught in School

Vana is letting users own a piece of the AI models trained on their data | MIT News

softbliss

Related Posts

Introducing Veo and Imagen 3 generative AI tools

5 Error Handling Patterns in Python (Beyond Try-Except)

How I Automated My Machine Learning Workflow with Just 10 Lines of Python

What It Is and Why It Matters—Part 3 – O’Reilly

New AI Innovation Hub in Tunisia Drives Technological Advancement Across Africa

Vana is letting users own a piece of the AI models trained on their data | MIT News

Premium Content

Salesforce Development Services for Scalable Digital Transformation

How AI Challenges Notions of Authorship (opinion)

Higher Ed Wins a SEVIS Battle, Not the Visa War

Browse by Category

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Disease Classification Using NLP and Machine Learning: A Comparative Analysis | by Waleed Ahmad | Apr, 2025

Why I Stopped Teaching Math the Way I Was Taught in School

Vana is letting users own a piece of the AI models trained on their data | MIT News

Related Posts

Premium Content

Browse by Category

Browse by Tags

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?