Data Critique

The dataset used for this project was pulled from user Hibrahimag1 on Kaggle. The dataset can be viewed and downloaded here.

The data was generated from the site TwitchTracker.com which provides analytics on specific streamers or games streamed on Twitch. The author of the dataset used a python script and a web-scraping software called Selenium to sort the data into a usable dataset. In other words, a code program was used to systematically extract the fields from the website, clean erroneous values, merge similar values, and standardize spellings and capitalization. The data is a mix of numerical and categorical data, including things like language of the streamer, games played, average stream duration, number of followers, and average viewers per stream, which can all be utilized to analyze any trends between games and viewership.

Twitchtracker.com’s overall purpose is to serve as an extensive breakdown of performance and engagement statistics on the Twitch platform for those interested in its data trends. Drawn directly from the site, the author ranks the top 1000 Twitch streamers based on an average of followers, views, streaming time and returning viewers. Per each row, the author includes data regarding the streamer’s name, language spoken, type of content, most streamed game, average stream duration, followers gained in the past 30 days, average viewers per stream, and average games per stream. Users can filter Twitch’s most viewed or streamed content in tabs based on channel, games, clips, or subscribers. The 30 day window of data used for this dataset falls between April 23rd to May 23rd, 2024.

As we could tell, no outside funding was evident in the collection of the dataset. Harun Ibrahimagić was the sole uploader, copyright holder, and author of the MIT-licensed scraper that harvested the figures from TwitchTracker. He runs the project all from his personal GitHub account, with no sponsors, partners, or grant acknowledgement. His LinkedIn supports his claim of being self initiated, being a second year Computer Science student at the University of Sarajevo’s Faculty of Electrical Engineering. Ibrahimagić is well versed to compile a data set like this, as he sharpens his Python and C++ skills through volunteer teaching (Teaching Assistant, Plus Ultra high-school coding academy, Sep 2024–Dec 2024) and community leadership (Youth Coordinator, Mreža mladih Islamic Community of Zenica, Jan 2023–Oct 2024). Both of those roles, which are unpaid, show that the dataset was produced with free tools, adequate know-how, and spare time rather than institutional resources of commercial backing.

Our main dataset is split up into many categories. Some of the main ones being: Stream Duration, Followers, Games/Activity, and etc. With this, smaller streamers have it harder to gain more attention and popularity over the big streamers. Another effect would be that it invisibly promotes popular or mainstream genres while marginalizing creative or niche gaming communities. Lastly, one of the major categories would be the language barrier. The top two languages within the dataset are English and Spanish, and the effect suggests that material in “globalized” languages is more popular or valued, perpetuating global language hegemonies and marginalizing artists from linguistically underrepresented backgrounds.

The way that this dataset is organized by its categories really shows what we can interpret and what we can’t see from the dataset. The dataset primarily focuses on numbers, such as followers, viewers, and stream hours. So, like all datasets, it ultimately frames success in a very number/stats driven way. If this dataset were all we had, we would’ve missed a lot of information about what actually makes Twitch streamers special. The personalities, the community vibe, the drama, and the real emotional connections between streamers and their audiences. There is nothing in the dataset itself that indicates loyalty of viewers, interaction, or type of community audiences form outside of twitch. It turns streamers into numbers on a list instead of showing them as people with unique styles or goals. Additionally, since the dataset only covers the top 1000, it completely overlooks smaller creators, who make up the majority of the platform. The way the dataset pushes us to focus solely on the biggest and most visible streamers, but if we simply follow that approach, we’re missing the broader story about who is truly on twitch and how the platform operates.

Check out our narrative next!

Narrative