In today’s healthcare landscape, the ability to automatically categorize diseases based on their characteristics can provide significant value for medical research, clinical decision support systems, and healthcare IT. In this article, I’ll walk through a comprehensive analysis comparing different approaches to disease classification using Natural Language Processing and Machine Learning techniques.
Medical knowledge is complex and multifaceted. Diseases have various symptoms, risk factors, signs, and subtypes that collectively define their characteristics. My objective was to determine whether diseases could be effectively categorized into broad groups (Neurological/Endocrine, Respiratory/Infectious, Cardiovascular, etc.) using machine learning models trained on these features.
I approached this problem by comparing two distinct methods of representing disease characteristics:
- TF-IDF (Term Frequency-Inverse Document Frequency): This NLP technique weighs the importance of terms in a document relative to a corpus. For our dataset, we treated each disease as a document and its characteristics (symptoms, risk factors, etc.) as terms.
- One-Hot Encoding: A more explicit representation where each feature is represented as a binary value (present/absent).
The TF-IDF representation produced a sparser matrix (more zeros) than the One-Hot approach, with the TF-IDF matrix capturing more nuanced relationships between terms and diseases.
Working with high-dimensional data presents challenges for visualization and modeling. I applied two dimensionality reduction techniques:
- Principal Component Analysis (PCA): Works well for dense data matrices
- Truncated Singular Value Decomposition (SVD): Better suited for sparse matrices like the TF-IDF representations
One of the most striking findings was that TF-IDF features reduced via Truncated SVD produced more separable clusters than One-Hot encoding, as visualized in our 2D projection:
This visual separation suggested that TF-IDF might be better at capturing the underlying relationships between diseases in the same category.
I evaluated two classification models:
- K-Nearest Neighbors (KNN): A non-parametric method that classifies based on proximity to training examples
- Logistic Regression: A probabilistic model that estimates the probability of class membership
For KNN, I experimented with different distance metrics (Euclidean, Manhattan, Cosine) and values of k (3, 5, 7). For both models, I performed 5-fold cross-validation to ensure robust evaluation.
The results showed that:
- TF-IDF consistently outperformed One-Hot encoding across most metrics
- For KNN, the Cosine distance metric generally performed better, suggesting the importance of directional similarity rather than absolute distance
- Logistic Regression with TF-IDF features achieved the highest overall F1-score
- Feature Representation Matters: TF-IDF’s superior performance indicates that capturing term importance across the corpus provides more discriminative power than binary presence/absence encoding.
- Dimensionality Reduction Choice Is Critical: Truncated SVD preserved more explained variance for our sparse TF-IDF matrices compared to PCA, leading to better separation of disease categories.
- Moderate Classification Performance: While the models showed promise (with F1-scores above 0.7), there’s still room for improvement, suggesting the inherent complexity of medical categorization.
- Sample Size: Some disease categories had limited samples, requiring careful filtering to ensure valid cross-validation
- Feature Quality: The quality and completeness of disease characteristics varied across the dataset
- Categorization Subjectivity: The boundaries between disease categories can be inherently fuzzy
My analysis demonstrates that NLP techniques like TF-IDF, combined with appropriate dimensionality reduction and classification models, can provide meaningful categorization of diseases based on their characteristics. This approach has potential applications in medical knowledge organization, clinical decision support, and healthcare data systems.
For future work, incorporating more sophisticated text embedding techniques such as medical-domain-specific word embeddings or transformer models could potentially improve classification performance further.