Author: Ezau Faridh Torres Torres
Advisor: Dr. Adrian Pastor López Monroy and Dr. Fernando Sanchez Vega
Course: Natural Language Processing
Institution: CIMAT – Centro de Investigación en Matemáticas
Term: Spring 2025
This repository contains all course assignments and the final project from the graduate-level class Natural Language Processing at CIMAT (Spring 2025). The course covered core techniques in modern NLP, from classical text preprocessing to deep learning-based models for classification and sequence labeling. The final project involved building a hierarchical multitask model for social media user profiling using Spanish-language Twitter data.
- Repository Structure
- Technical Stack
- Datasets Used
- Overview of Assignments
- Assignment 1 – Corpus Construction and Preprocessing
- Assignment 2 – Basic Text Mining and SVM Classification
- Assignment 3 – Feature Selection and Text Visualization
- Assignment 4 – Language Modeling from Political Speeches
- Assignment 5 – Neural Language Modeling
- Assignment 6 – Hierarchical Attention Network
- Final Project – Multimodal Meme Classification with CLIP and Textual Inversion
- Tests
- Learning Outcomes
- References
- Contact
Each assignment includes:
- A single
.ipynbnotebook with code and commentary - Manually added visualizations or outputs within the notebook
- Optional supporting files (e.g., pretrained embeddings, tokenizers)
Developed and tested in Python 3.11, using the following tools across assignments and tests:
- NLP Libraries:
nltk,spaCy,gensim,torchtext,pysentimiento - Machine Learning & Deep Learning:
scikit-learn,PyTorch,transformers,lightning - Text Processing:
re,collections,TweetTokenizer,emoji,ftfy - Pretrained Models:
GloVe,Word2Vec,bert-base-multilingual-cased,PlanTL-GOB-ES/roberta-base-bne,pysentimiento/robertuito-base-uncased,CLIP - Visualization:
matplotlib,seaborn,wordcloud,t-SNE,attention heatmaps,confusion_matrix,word frequency histograms - Auxiliary:
argparse,glob,json,numpy,os,pandas,random,scipy,tqdm,xml.etree.ElementTree - Environment:
Jupyter Notebook(for interactive development)
Note: Most notebooks are self-contained and reproducible, with controlled randomness when applicable.
- Presidential Press Conferences (Scraped): Official transcripts from amlo.presidente.gob.mx and gob.mx used in Assignments 1 and 4
- MEX-A3T 2020 Subtask 1: Spanish tweets for text classification tasks in Assignments 2–3 and 5
- PAN Author Profiling (CLEF 2017): Multilingual tweet-based dataset used in Assignment 6 for nationality classification
- Hateful Memes Challenge & HarMeme: Used in the final project
The following section presents a concise overview of each task, highlighting its primary objective:
Automates the creation of a text corpus from presidential press conferences through web scraping and HTML parsing with wget and BeautifulSoup. The resulting plain-text files serve as a foundation for later NLP tasks and include basic error handling during extraction.
Explores the construction of Bag-of-Words and bigram-based representations for text classification, using custom tokenization, frequency-based filtering, and Support Vector Machines (SVM) to evaluate performance through precision, recall, and confusion matrices.
Implements frequency-based feature selection and dimensionality reduction to improve text classification, alongside visualizations such as word clouds and t-SNE for lexical exploration and representation analysis.
Builds sentence-level corpora from political transcripts and explores n-gram language models, evaluating their ability to generate coherent sequences and estimate sentence probabilities.
Implements a word-level neural language model using pretrained embeddings, trained on short tweet sequences. Includes nearest neighbor queries in embedding space, text generation, sentence likelihood estimation, and perplexity comparisons against probabilistic baselines.
Trains a hierarchical neural model with word- and tweet-level attention mechanisms for user profiling based on multilingual tweet sequences. Evaluates model performance using F1-score and interprets attention weights for qualitative analysis.
Implements the ISSUES framework for hateful meme classification by combining a frozen CLIP model with textual inversion techniques and a two-stage training strategy. The system disentangles visual and textual embeddings and fuses them via a Combiner network, achieving robust multimodal representations for classification.
Applies text preprocessing, exploratory analysis, and feature selection techniques to thousands of tourist reviews from 10 landmarks in Guanajuato. Includes sentiment classification based on rating scores, frequency-based word filtering, and TF-IDF + Chi² for identifying discriminative terms across destinations.
Implements a multitask neural pipeline for predicting both gender and nationality from Spanish-language tweets using RoBERTuito and TF-IDF features. The model is trained with joint loss, incorporates a Transformer-based encoder and sparse lexical features, and is evaluated using joint accuracy and F1 metrics across both tasks.
Through this course, I developed hands-on skills in:
- Constructing text classification pipelines with custom tokenization and feature extraction
- Training and evaluating classical models (Naive Bayes, SVMs) and neural models (BiGRU, Transformers, HAN)
- Designing hierarchical and multitask neural networks for user profiling
- Using attention mechanisms to interpret model behavior
- Applying pretrained multilingual embeddings and fine-tuning transformer-based encoders
- Implementing neural language models and evaluating them via perplexity and sentence likelihood
- Combining vision-language models (CLIP) with textual inversion for multimodal classification
- Writing reproducible research code and presenting results effectively with visualizations
- Giovanni Burbi, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo.
Mapping Memes to Words for Multimodal Hateful Meme Classification.
arXiv:2310.08368, 2023.
https://arxiv.org/abs/2310.08368
- 📧 Email: ezau.torres@cimat.mx
- 💼 LinkedIn: linkedin.com/in/ezautorres




