BubbleTube

DOWNLOAD

In order to anonymize the data, YouTube video and channel IDs have been mapped to integers that are consistent across all datasets. See the paper for a detailed description of the methodology.

We considered ~ 650 YouTube videos as seeds.

Half are related to the 2019 European Parliement election, the other half were picked based on their popularity on Reddit and Wikipedia. The file ./seeds.csv maps the video IDs to their sample origin.

We gathered YouTube's suggestions for these seeds, every 10 minutes, at least 2000 times (~ 15 days).

Each line of ./long_crawl.csv logs the video ID of the seed, followed by the ordered list of suggestions found. This data reveals a clear plateau of highly frequent suggestions.

For each seed, we recursively crawled frequent suggestions up to depth 3.

Each line in ./recursice-crawls/{videoID}.csv logs the node's video ID, the video ID of a suggestion belonging to the plateau, and the depth at which it was suggested.

Lastly, for each visited page, we gathered a few metadata about the video.

Each line in ./video_metadata.csv logs: video ID, channel ID, nb. of subscribers, nb. of views, category ID (see ./categories_ids.csv), nb. of likes, nb. of dislikes, age of the video (in seconds).

Tubes and Bubbles

Topological confinement of recommendations on YouTube