#TheDataDebates: A Quick Twitter Data Summary

Screenshot of an interactive visualisation of a #TheDataDebates archive created with Martin Hawksey's TAGSExplorer
Screenshot of an interactive visualisation of a #TheDataDebates archive created with Martin Hawksey’s TAGSExplorer

1 October 2016 Update: I have now deposited on figshare a CSV file with timestamps, source and user_lang metadata of the archived tweets.

Priego, Ernesto (2016): #TheDataDebates Tweet Timestamps, Source, User Language. figshare.https://dx.doi.org/10.6084/m9.figshare.3976731.v1. Retrieved: 10 03, Oct 01, 2016 (GMT)

Social Media Data: What’s the use‘ was the title of a panel discussion held at The British Library, London, on Wednesday 21 September 2016, 18:00 – 20:00. The official hashtag of the event was #TheDataDebates.

I made a collection of Tweets tagged with #TheDataDebates published publicly between 12/09/2016 09:06:52 and 22/09/2016 09:55:03 (BST).

Again I used Tweepy 3.5.0, a Python wrapper for the Twitter API, for the collection. Learning to mine with Python has been fun and empowering. To compare results I also used, as usual, Martin Hawksey’s TAGS, with results being equal (I only collected Tweets from accounts with at least 1 follower). Having the collected data already in a spreadsheet saved me time. I only collected Tweets from accounts with at least one follower.

Here’s a summary of the collection:

First Tweet in Archive 12/09/2016 09:06:52
Last Tweet in Archive 22/09/2016 09:55:03
Number of Tweets 


Number of links


Number of RTs


Number of accounts


From the main archive I was able to focus on number of Tweets per source and user language setting.


source Count
Twitter for iPhone


Twitter Web Client


Twitter for Android


Twitter for iPad




UK Trends


Mobile Web (M5)




Twitter for Windows Phone


Big Data news flow










Lt RTEngine






User Language Setting (user_lang)

user_lang Count Notes






6 of it are spam






both spam


 The summary above is of the raw collection so not all the activity it reflects is either ‘human’ nor relevant, as some accounts tweeting have been identified as bots tweeting spam (a less human readable hashtag could have potentially avoided such spamming given the relatively low activity). Except where I identified spam Tweets, in this post I have not looked at the Tweets’ text data (i.e. I haven’t shared here any text or content analysis). Maybe if I have time in the near future. As Retweets were counted as Tweets in this archive a more specific and precise analysis would have to filter them from the dataset.

I am fully aware this would be more interesting and useful if there were opportunities for others to replicate the analysis through access to the source dataset I used. There are lots of interesting types of analysis that could be run and data to focus on in such a dataset as this. As in previous posts about other events, I am simply sharing this post right now as a quick indicative update published only a few hours after the event concluded.

It was pointed out last night that “social media data mining is starting but still has a way to go to catch up with hard analytical methodologies.” A post like this does not claim to employ a such methodologies, it simply seeks to contribute to the debate with evidence that may hopefully inspire other studies.  Perhaps it’s a two-way process, and  “hard analytical methodologies” (and researchers’ and users’ attitudes regarding cultural paradigms around ethics, privacy, consent, statistical significance)  have also a way to go to catch up with new/recent pervasive forms of data creation and dissemination that perhaps require different, media-community- and content-specific approaches to doing research.

Other Considerations [I am reusing my own text from previous posts here]

Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailon, Sandra, et al, 2012). Apart from the filters and limitations already declared, it cannot be guaranteed that each and every Tweet tagged with #TheDataDebates during the indicated period was analysed. The dataset was shared for archival, comparative and indicative educational research purposes only.

Only content from public accounts, obtained from the Twitter Search API, was analysed.  The source data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account. These posts and the resulting dataset contain the results of analyses of Tweets that were published openly on the Web with the queried hashtag; the content of the Tweets is responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually.This work is shared to archive, document and encourage open educational research into scholarly activity on Twitter.

No private personal information was shared. The collection, analysis and sharing of the data has been enabled and allowed by Twitter’s Privacy Policy. The sharing of the results complies with Twitter’s Developer Rules of the Road. A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag.

The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). Tweets published publicly by scholars or other professionals during academic conferences or events are often publicly tagged (labeled) with a hashtag dedicated to the event n question. This practice used to be the confined to a few ‘niche’ fields; it is increasingly becoming the norm rather than the exception. Though every reason for Tweeters’ use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences.

As Twitter users, conference Twitter hashtag contributors have agreed to Twitter’s Privacy and data sharing policies.Professional associations like the Modern Language Association and the American Pyschological Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter’s search API has well-known temporal limitations for retrospective historical search and collection. Beyond individual Tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. Though this work has limitations and might not be thoroughly systematic, it is hoped it can contribute to developing new insights into a discipline’s public concerns as expressed on Twitter over time.


González-Bailon, Sandra and Wang, Ning and Rivero, Alejandro and Borge-Holthoefer, Javier and Moreno, Yamir, Assessing the Bias in Samples of Large Online Networks (December 4, 2012).  Available at SSRN: http://dx.doi.org/10.2139/ssrn.2185134

Priego, Ernesto (2016) #WLIC2016 Most Frequent Terms Roundup. figshare.
https://dx.doi.org/10.6084/m9.figshare.3749367.v2AHRC [ahrcpress]. (2016, Sep 21).

Social media data mining is starting but still has a way to go to catch up with hard analytical methodologies #TheDataDebates [Tweet]. Retrieved from https://twitter.com/ahrcpress/status/778652767636389888

Priego, Ernesto (2016): #TheDataDebates Tweet Timestamps, Source, User Language. figshare. https://dx.doi.org/10.6084/m9.figshare.3976731.v1 Retrieved: 10 03, Oct 01, 2016 (GMT)

%d bloggers like this: