Challenges of Heterogeneity in Big Data: A Comparative Study of Classification in Large-Scale Structured and Unstructured Domains

2512.00298v1 cs.LG, cs.CL, cs.DC 2025-12-02
Авторы:

González Trigueros Jesús Eduardo, Alonso Sánchez Alejandro, Muñoz Rivera Emilio, Peñarán Prieto Mariana Jaqueline, Mendoza González Camila Natalia

Abstract

This study analyzes the impact of heterogeneity ("Variety") in Big Data by comparing classification strategies across structured (Epsilon) and unstructured (Rest-Mex, IMDB) domains. A dual methodology was implemented: evolutionary and Bayesian hyperparameter optimization (Genetic Algorithms, Optuna) in Python for numerical data, and distributed processing in Apache Spark for massive textual corpora. The results reveal a "complexity paradox": in high-dimensional spaces, optimized linear models (SVM, Logistic Regression) outperformed deep architectures and Gradient Boosting. Conversely, in text-based domains, the constraints of distributed fine-tuning led to overfitting in complex models, whereas robust feature engineering -- specifically Transformer-based embeddings (ROBERTa) and Bayesian Target Encoding -- enabled simpler models to generalize effectively. This work provides a unified framework for algorithm selection based on data nature and infrastructure constraints.

Ссылки и действия

Связанные статьи

X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures ...

#### Контекст Во всемирных исследованиях в области обработки и анализа данных, нейросетевые модели, основанные на архит...

2025-08-21