Reddit is a happening space to start conversations on any topical stuff with other anons in reddit community. Understandably though, it takes some initial effort on user’s part to figure how to get going in reddit aka the front page of the internet.
With thousands of user created communities out there to pick from (called subreddits in Reddit lingo), a new user will likely be at loss wondering how to pick the communities relevant to him or figure what sort of content gets discussed in communities that look interesting.
In this post, I seek to explore popular subreddits in news-space across world and later the popular ones in Indian news-space. In the consequent post, I’ll talk about personalized subreddit recommendations user-wise - as identified using patterns generalized from reddit universe of current users’ posts + comments behavior across subreddits. In both posts, I use Google BigQuery as a source for tidy dataset and also as a powerful cloud-based Compute tool.
To construct queries and shape overall data-story, I have adapted query construct from blog by Felipe Hoffa, Big Query Dev Advocate at Google. The clusters of subreddits handpicked for analysis below, are as identified from mining of post + comments activity of reddit users. The source dataset pertains to reddit user activity in the month of Sep-2017.
Note: You can access any of subreddits below by adding suffix /r/
/r/politics + /r/news + /r/worldnews + /r/inthenews + /r/The_Donald
/r/The_Donald + /r/conspiracy + /r/uncensorednews +/r/Conservative
/r/india + /r/indianews + /r/IndiaSpeaks + /r/bakchodi
/r/science + /r/EverythingScience + /r/space + /r/energy + /r/Futurology + /r/artificial + /r/technology
/r/Cricket + /r/fantasyfootball + /r/LiverpoolFC + /r/nba + /r/nfl + /r/soccer
/r/movies + /r/television + /r/boxoffice
Top subreddits where Global Mainstream media gets referred
Top subreddits where Indian Mainstream media gets referred
Top subreddits where Science+Tech publications get referred
Code-Preview
# Query to extract relevant summary from full-dataset
sql <- "
SELECT domain, subreddit, count_dom, COUNT(*) posts FROM (
SELECT id, domain, subreddit, COUNT(*) OVER(PARTITION BY domain) count_dom
FROM [fh-bigquery.reddit_posts.2017_09]
WHERE score>25
AND domain NOT IN (
'puu.sh', 'zkillboard.com', 'gifsound.com', 'youtu.be', 'bato.to', 'archive.is', 'archive.fo',
'pbs.twimg.com', 'streamable.com', 'cdn.awwni.me')
AND NOT over_18
AND subreddit IN ('politics', 'news', 'worldnews', 'inthenews', 'The_Donald')
)
WHERE count_dom>100
GROUP BY 1, 2, 3
ORDER BY 4 DESC"
out_1<- query_exec(sql, project = project, useLegacySql = FALSE)
out_1 %>%
mutate(domain = reorder(domain, count_dom)) %>%
ggplot(mapping = aes(x = domain, y = posts, fill = subreddit)) +
geom_bar(stat = "identity", width = .6) +
coord_flip() +
theme_economist()
Dataset for this analysis is available here and information on getting started with Google BigQuery here