Workshop: Interactive Topic Analysis with Multi-Lingual Embeddings in Communalytic

Where: Groningen, University of Groningen
When: September 5, 2024, 14:00-17:00.
Registration: registration is mandatory via this link. There is a cap of 25 participants for this event.
Credits: OSL ReMAs and PhDs can obtain 1EC by preparing for the workshop, participating and submitting a short assignment (assignment instructions available here).

The workshop is organized by The Patio, The Centre for Digital Humanities, the Netherlands Research School for Literary Studies (OSL), and The Groningen Research Institute for the Study of Culture (ICOG). The workshop will be delivered by Prof. dr. Anatoliy Gruzd and Philip Mai, Co-Directors of the Social Media Lab at the Toronto Metropolitan University.

Objectives

This hands-on tutorial will introduce users to Communalytic, a research tool developed by the Social Media Lab for studying online communities and discourse. The session will include an overview of Communalytic’s features and a step-by-step guide on using Communalytic’s built-in topic analysis module.

By the end of the tutorial, participants will know how to use a large language model (LLM) to transform social media data into vectors of numbers known as embeddings. The tutorial will also show attendees how to visualize the resulting vectors via Nomic Atlas, a third-party tool that enables users to represent and explore embeddings in an interactive map with labels assigned automatically based on the semantic similarity of the posts’ content.

Considering the interdisciplinary nature of this area, we welcome participants from a wide range of disciplines, including (but not limited to) Digital Humanities, Information Science, Communication, Literary Studies, Media Studies, Education, Journalism, Management, Political Science, Psychology and Sociology.


Background

Current topic modelling techniques such as Latent Dirichlet Allocation (LDA) and BERTopic have limitations in that they often identify abstract topics that can be challenging for human analysts to interpret due to their non-descriptive nature. This is caused in part by the fact LDA and BERTopic are typically defined by a set of tokens and their probabilities (Fig 1). To overcome the limitations of current topic modelling techniques, this tutorial introduces an alternative approach using embeddings and clustering.

This method has a distinct advantage: It allows researchers to view a high-level map of posts clustered based on their semantic similarity while allowing researchers to zoom in on specific clusters and examine the underlying posts (Fig 2).



Fig 1
: Example of Topic Modelling Visualization based on LDA.


Fig 2: Example of Visualization of Social Media Posts based on Embeddings.

 

Agenda

  1. Introduction to Communalytic and Data Collection from Social Media (20 min)
  2. Representing Posts as Embeddings  (20 min)
  3. Projecting and Visualizing Embeddings (20 min)
  4. Break (15 min)
  5. Hands-on Part (60 min)

Participants need a laptop with internet access and a modern web browser to participate in the tutorial. The primary tool to be used during the tutorial is Communalytic, which runs from within a web browser and does not require any additional software.

Upon completion of the tutorial, participants should be able to: 1) collect publicly available social media data from platforms such as Reddit, Telegram and Mastodon using Communalytic, 2) conduct a topic analysis with the collected data.