Reddit is a happening space to start conversations on any topical stuff with other anons in reddit community. Understandably though, it takes some initial effort on user’s part to figure how to get going in reddit aka the front page of the internet.

With thousands of user created communities out there to pick from (called subreddits in Reddit lingo), a new user will likely be at loss wondering how to pick the communities relevant to him or figure what sort of content gets discussed in communities that look interesting.

In this post, I seek to explore popular subreddits in news-space across world and later the popular ones in Indian news-space. In the consequent post, I’ll talk about personalized subreddit recommendations user-wise - as identified using patterns generalized from reddit universe of current users’ posts + comments behavior across subreddits. In both posts, I use Google BigQuery as a source for tidy dataset and also as a powerful cloud-based Compute tool.

To construct queries and shape overall data-story, I have adapted query construct from blog by Felipe Hoffa, Big Query Dev Advocate at Google. The clusters of subreddits handpicked for analysis below, are as identified from mining of post + comments activity of reddit users. The source dataset pertains to reddit user activity in the month of Sep-2017.

Note: You can access any of subreddits below by adding suffix /r/ to reddit.com. E.g. /r/politics subreddit is to be accessed as www.reddit.com/r/politics


/r/politics + /r/news + /r/worldnews + /r/inthenews + /r/The_Donald

  Posts with score > 25, Domains with > 100 posts

Figure 1: Posts with score > 25, Domains with > 100 posts


/r/The_Donald + /r/conspiracy + /r/uncensorednews +/r/Conservative

  Posts with score > 25, Domains with > 100 posts

Figure 2: Posts with score > 25, Domains with > 100 posts


/r/india + /r/indianews + /r/IndiaSpeaks + /r/bakchodi

  Posts with score > 10, Domains with > 20 posts

Figure 3: Posts with score > 10, Domains with > 20 posts


/r/science + /r/EverythingScience + /r/space + /r/energy + /r/Futurology + /r/artificial + /r/technology

  Posts with score > 10, Domains with > 15 posts

Figure 4: Posts with score > 10, Domains with > 15 posts


/r/Cricket + /r/fantasyfootball + /r/LiverpoolFC + /r/nba + /r/nfl + /r/soccer

  Posts with score > 10, Domains with > 15 posts

Figure 5: Posts with score > 10, Domains with > 15 posts


/r/movies + /r/television + /r/boxoffice

  Posts with score > 10, Domains with > 15 posts

Figure 6: Posts with score > 10, Domains with > 15 posts


Top subreddits where Global Mainstream media gets referred

  Posts with score > 25, Subreddits with > 100 posts. Excl /r/politics

Figure 7: Posts with score > 25, Subreddits with > 100 posts. Excl /r/politics


Top subreddits where Indian Mainstream media gets referred

  Posts with score > 10, Subreddits with >10 posts. Excl /r/india

Figure 8: Posts with score > 10, Subreddits with >10 posts. Excl /r/india


Top subreddits where Science+Tech publications get referred

  Posts with score > 5, subreddits with > 20 posts.

Figure 9: Posts with score > 5, subreddits with > 20 posts.


Code-Preview

#  Query to extract relevant summary from full-dataset 
sql <- "
SELECT domain, subreddit, count_dom, COUNT(*) posts FROM (
  SELECT id, domain, subreddit, COUNT(*) OVER(PARTITION BY domain) count_dom
  FROM [fh-bigquery.reddit_posts.2017_09]
  WHERE score>25
  AND domain NOT IN (
  'puu.sh', 'zkillboard.com', 'gifsound.com', 'youtu.be', 'bato.to', 'archive.is', 'archive.fo',
  'pbs.twimg.com', 'streamable.com', 'cdn.awwni.me')
  AND NOT over_18 
  AND subreddit IN ('politics', 'news', 'worldnews', 'inthenews', 'The_Donald')
  ) 
WHERE count_dom>100
GROUP BY 1, 2, 3
ORDER BY 4 DESC"

out_1<- query_exec(sql, project = project, useLegacySql = FALSE)

out_1 %>% 
  mutate(domain = reorder(domain, count_dom)) %>% 
  ggplot(mapping = aes(x = domain, y = posts, fill = subreddit)) +
  geom_bar(stat = "identity", width = .6) +
  coord_flip() +
  theme_economist()

Dataset for this analysis is available here and information on getting started with Google BigQuery here