In order to anonymize the data, YouTube video and channel IDs have been mapped to integers that are consistent across all datasets. See the paper for a detailed description of the methodology.
- We considered ~ 650 YouTube videos as seeds.
- We gathered YouTube's suggestions for these seeds, every 10 minutes, at least 2000 times (~ 15 days).
- For each seed, we recursively crawled frequent suggestions up to depth 3.
- Lastly, for each visited page, we gathered a few metadata about the video.
Half are related to the 2019 European Parliement election, the other half were picked based on their popularity on Reddit and Wikipedia. The file ./seeds.csv maps the video IDs to their sample origin.
data:image/s3,"s3://crabby-images/058ba/058ba5defcdf4f6c3168b1437591a5391d5e89b1" alt=""
Each line of ./long_crawl.csv logs the video ID of the seed, followed by the ordered list of suggestions found. This data reveals a clear plateau of highly frequent suggestions.
data:image/s3,"s3://crabby-images/2ec98/2ec98da8daaaf1479b225d1ea095c019d78ea233" alt=""
data:image/s3,"s3://crabby-images/61d59/61d59791b64f2c6013593a1a9cdb18fa2ab9a897" alt=""
Each line in ./recursice-crawls/{videoID}.csv logs the node's video ID, the video ID of a suggestion belonging to the plateau, and the depth at which it was suggested.
Each line in ./video_metadata.csv logs: video ID, channel ID, nb. of subscribers, nb. of views, category ID (see ./categories_ids.csv), nb. of likes, nb. of dislikes, age of the video (in seconds).