The Emily18 Com Full Sets – 2021 archive constitutes a rich, multimodal corpus whose internal structure can be systematically described and analysed. Our exploratory pipeline reveals three well‑defined thematic clusters that correspond to distinct temporal phases of production. By openly sharing our processing scripts (GitHub: github.com/Emily18/2021‑full‑sets‑analysis) and the derived feature matrices, we invite the broader research community to build upon this foundational work.
To capture cross‑modal relationships, we concatenated the three modality‑specific embeddings (image + audio + text) and applied Principal Component Analysis (PCA) to retain 95 % variance, resulting in a 128‑dimensional fused representation.
We computed basic statistics (counts, sizes, temporal distribution) using pandas.
We employed Hierarchical Density‑Based Spatial Clustering of Applications with Noise (HDBSCAN) on the fused vectors (min_cluster_size=30). The algorithm identified four clusters; one consisted of < 1 % of items and was labelled “Noise”.