Sheffield Digital Humanities Congress 2016: #dhcshef 100 Most Frequent Terms

 A view of the #dhcshef 2016 dataset with Martin Hawksey's TAGS Explorer
A view of the #dhcshef 2016 dataset created with Martin Hawksey’s TAGS Explorer

The Sheffield Digital Humanities Congress 2016 was held from the 8th to the 10th of September 2016 at the University of Sheffield. The full conference programme is available here: http://www.hrionline.ac.uk/dhc.

The event’s official hashtag was the same as in previous editions, #dhcshef.

I made a collection of Tweets tagged with #dhcshef published publicly between Monday September 05 2016 at 17:54:58 +0000 and Saturday September 10 2016 at 23:37:06 +0000. This time I used Tweepy 3.5.0, a Python wrapper for the Twitter API, for the collection. To compare results I also used, as usual, Martin Hawksey’s TAGS, with results being similar (I only collected Tweets from accounts with at least 1 follower).

As in previous occasions I extracted the text and usernames from this dataset and used VoyantTools for a basic text analysis. The dataset contained 1479 Tweets posted by 256 different accounts. 841 of those were RTs. The text of the Tweets composed a corpus with 26,094 total words and 3,057 unique word forms.

I used Voyant’s Terms tool to get the most frequent terms, applying an edited English stop words list that included Twitter and congress-specific terms (this means that words expected to be frequent like ‘digital’, ‘humanities’, ‘congress’, ‘sheffield’, as well as usernames, project’s names and people’s names were filtered out). I exported a list of 500 most frequent terms and then I manually refined the data so remaining people or project’s names were removed. (This is not case sensitive so I may have made mistakes and further disambiguation and refining would be required). If you are interested I previously detailed a similar methodology here.

Here’s my resulting list of the 100 most frequent terms.

Term Count
great

106

project

98

data

76

research

64

students

63

word

58

funding

55

work

55

just

53

spread

53

use

51

opportunity

48

text

47

historical

46

oa

46

looking

45

open

45

editions

44

pedagogy

40

academic

38

access

36

keynote

36

like

36

analysis

35

follow

35

using

34

book

33

new

33

projects

33

university

33

important

32

innovation

32

today

32

tomorrow

32

early

31

minimal

31

paper

31

south

31

content

30

excellent

30

love

30

social

30

look

29

talking

29

tools

29

discussing

28

global

28

grants

28

london

28

network

28

review

28

forward

27

libraries

27

resources

27

sudan

27

history

26

talk

26

books

25

online

25

programme

25

really

25

teach

25

teaching

25

digitisation

24

issues

24

tactical

24

archive

23

critique

23

make

23

different

22

need

22

peer

22

session

22

cultural

21

heritage

21

starts

21

studies

21

value

21

art

20

cool

20

don’t

20

good

20

live

20

press

20

start

20

arts

19

available

19

colleagues

19

delegates

19

going

19

metadata

19

presenting

19

day

18

digitised

18

let’s

18

networks

18

notes

18

person

18

started

18

begins

17

Please bear in mind that RTs count as Tweets and therefore the repetition implicit in RTs affects directly the frequent term counts. What terms made it into the top 100 reflects my own bias (I personally didn’t want to see how many times ‘digital’ or ‘humanities’ was repeated), but individual trend counts remain the same regardless.

I appreciate the stop words selection is indeed subjective (deictics like ‘tomorrow’ or ‘today’ may very well mean very little).  It’s up to the reader to judge if such a listing offers any insights at all; as Twitter moves relentlessly and as such data remains a moving a target, I’d like to believe that collecting and looking into frequent terms offers at least another point of view if not gateway into how a particular academic event is represented/discussed/reported on Twitter. Perhaps it’s my enjoyment of poetry that makes me think that seeing words out of context (or recontextualised) like this can offer some kind of food for thought or creativity.

Interestingly the dataset showed user_lang metadata other than en or en-GB: de, es, fr, it, nl and ru were also present even if in minority. The dataset also showed that some sources are clearly identified as bots.

I am fully aware this would be more interesting and useful if there were opportunities for others to replicate the text analysis through access to the source dataset I used. There are lots of interesting types of analysis that could be run and data to focus on in such a dataset as this. I am simply sharing this post right now as a quick indicative update after the event concluded.