SHL Open Workshop: Text Data Preparation

Date and time


Digital Humanities Lab, School of Media, Film and Music

Silverstone Building

University of Sussex



United Kingdom

Text Data Preparation

Workshop Leader: Jack Pay (University of Sussex)

The purpose of this workshop is to introduce researchers and interested parties to two key aspects of data preparation. A common problem when starting work on large scale processing of text is that it can be noisy, hard to analyse or structure in a machine readable manner.

In this workshop we will cover two common examples of problematic texts: crawled or downloaded web documents composed in html and (poorly) OCR’d texts taken from some historical corpus. The purpose of using these examples is to introduce participants to the tools and methods used in web-scraping and data wrangling.

The workshop will comprise of a presentation and semi-practical session; where the presentation will introduce the key problems and solutions to these methods and the practical session will present an illustrative example solution.

This workshop is not intended as a complete tutorial on how to prepare data, but serves as an introduction to provide participants with the information and knowledge of the potential tools to begin working on these problems themselves.

If you have any questions about this workshop please email the series convenor, Ben Roberts, in the first instance:

