The Emily18 Com Full Sets – 2021 archive constitutes a rich, multimodal corpus whose internal structure can be systematically described and analysed. Our exploratory pipeline reveals three well‑defined thematic clusters that correspond to distinct temporal phases of production. By openly sharing our processing scripts (GitHub: github.com/Emily18/2021‑full‑sets‑analysis) and the derived feature matrices, we invite the broader research community to build upon this foundational work.


To capture cross‑modal relationships, we concatenated the three modality‑specific embeddings (image + audio + text) and applied Principal Component Analysis (PCA) to retain 95 % variance, resulting in a 128‑dimensional fused representation.

We computed basic statistics (counts, sizes, temporal distribution) using pandas.

We employed Hierarchical Density‑Based Spatial Clustering of Applications with Noise (HDBSCAN) on the fused vectors (min_cluster_size=30). The algorithm identified four clusters; one consisted of < 1 % of items and was labelled “Noise”.

Esta web utiliza cookies propias y de terceros para su correcto funcionamiento y para fines analíticos. Contiene enlaces a sitios web de terceros con políticas de privacidad ajenas que podrás aceptar o no cuando accedas a ellos. Al hacer clic en el botón Aceptar, acepta el uso de estas tecnologías y el procesamiento de tus datos para estos propósitos. Ver
Privacidad