-->

Saturday, December 31, 2016

Most Visited Websites

In the past I came across a post that allowed you to see your most used bash commands like here. Learning this was very enlightening, and today I had a thought in the exact same vein:

What are my most visited websites?

The answer took longer to get at than the bash version (since Firefox stores history in a sqlite file), but the process was exactly the same. Get your history, isolate the domain part of the url, count the number of occurrences for each domain, and finally sort. I ended up using python 2 for the entire process

History


Firefox stores history, bookmarks, and download history in one sqlite file called places.sqlite. This file is located in AppData\Roaming\Mozilla\Firefox\Profiles\*hash*.default.

I had trouble at this step because I had never used sql before. Thankfully python makes it very easy. Using the sqlite3 library I opened a connection and then a cursor. The file has a lot of tables in it, but the only one I ended up needing was moz_places. The urls are under the column url. To get all the urls run the following code:

urls = []
cursor.execute("SELECT url FROM moz_places;")
for url in cursor.fetchall():
urls.append(url[0])


The expression url[0] is there to remove it from a singular tuple.

Getting the Domain


My gut instinct was to use a regex to grab the domain from the full url; however, I found out python includes the urlparse library.

from urlparse import urlparse
domains = []
for url in urls:
domains.append(urlparse(url).netloc)


Counter


From here I was tempted to print the domains to stdout and just reuse the solution for the most used bash commands (*output* | sort | uniq -c | sort -nr). I didn't though and instead elected to finish the process entirely in python. To emulate the uniq command I simply used pythons Counter class.

from collections import Counter
domain_frequencies = Counter(domains)


Sort


I spent a longer time on this step than I care to admit. I almost wrote a sorting method from scratch which would have been painful. Thankfully the Counter class comes with a builtin method to sort the keys.

most_visited = domain_frequencies.most_common()
for idx in range(0,20):
print idx, "-", most_visited.keys()[idx]


Summary


That's all it took! I didn't expect the solution to take so little effort. What started as intellectual curiosity turned into a fun programming experience. By the way here are my top 20 most visited domians:

1 - i.imgur.com
2 - www.google.com
3 - imgur.com
4 - www.youtube.com
5 - www.reddit.com
6 - narioko.tk
7 - gfycat.com
8 - i.redd.it
9 - www.netflix.com
10 - en.wikipedia.org
11 - i.reddituploads.com
12 - twitter.com
13 - i.4cdn.org
14 - cdn.awwni.me
15 - www.mangastream.to
16 - www.minecraftforum.net
17 - out.reddit.com
18 - github.com
19 - www.curse.com
20 - stackoverflow.com


I'm not surprised at the image domains ranking so high. Your usual suspects also fill the list like google, youtube, reddit, netflix, etc... I'm somewhat surprised by minecraftforum and curse being this high on the list. I suppose they rank up there because of all the effort I put into finding mods for Minecraft. A more refined url parsing would return better results. I neglected to remove www from the urls which split up some domains. Other domains are split across multiple subdomains. Tumblr is not ranked high on my list because their domains show up as idiots-blackboard.tumblr.com. Of course I couldn't include all the domains I've visited. Much of my browsing is shared with my laptop and my phone. Not to mention any private browsing :)

No comments:

Post a Comment