Who: Tanzir Islam Pial
Abstract: Sequences are fundamental structures underlying diverse domains, from natural language to human life-course. This thesis presents novel methodologies for analyzing sequence data in both textual and social registry contexts, aiming to enhance the representation and interpretation of sequential patterns across varied datasets.
In the first part, we address the challenge of semantic text alignment to identify similar text segments between documents. We focus on adapting sequence alignment algorithms from bioinformatics, traditionally used in DNA sequencing, to the domain of Natural Language Processing (NLP). We develop a general narrative alignment tool (GNAT) that couples the Smith-Waterman algorithm with modern text similarity metrics. We show that alignment scores from GNAT follow a Gumbel distribution, enabling rigorous p-values on the significance of any alignment. GNAT is evaluated on four problem domains—summary-to-book alignment, translated book alignment, short story alignment, and plagiarism detection—demonstrating the power of this method.
In a subsequent application, we apply GNAT to film adaptation analysis, powered by SBERT embeddings to study adaptation choices in 40 book-to-film scripts. This application uncovers insights into the screenwriting process, including narrative fidelity, dialogue significance, scene sequencing, and gender representation.
The final section expands the scope from text sequences to human life sequences using Dutch social registry data. By conceptualizing life events—such as education, employment, and familial milestones—as sequences of temporal markers, we build foundational models inspired by BERT using a synthetic vocabulary where each token represents a life event attribute. We pre-train this model with language modeling objectives akin to Masked Language Modeling, aiming to create life-course embeddings for diverse sociological predictive tasks. Preliminary results demonstrate the embeddings' utility in predicting outcomes such as income levels and marital changes. Future work will focus on enriching these embeddings through multiple pre-training objectives and fine-tuning on complex tasks such as predicting political and personal beliefs based on survey data.