Scraping Facebook Private Messages

Facebook allows you to download a dump of your data, which is supposed to include private messages. For some reason it was extremely incomplete for me (less than 1% of a particular thread with about 10000 messages). This Python script will scrape a particular private message/chat thread. Expect maybe half an hour for a long thread.


  • Facebook keeps changing their layout, login method, etc. This will obviously break the script.
  • Currently unimplemented: a way to dump all private messages, with everyone.
  • Facebook seems to sometimes detect rapid automated-looking behaviour like this and refuse a connection.
  • Older messages display date but not time. I decided to ignore time data for consistency.

Idiot’s guide: install Python 2 if on Windows and save the script as “”. Get Beautiful Soup 4 and place /bs4 in the same directory. Go find the token ID of the message thread you want to dump (look for something like “tid=id.12345” in the URL). Then, run

python password tokenid

substituting in your own arguments. You now have a new directory that will contain all the dumped HTML pages (you can delete these if you want), a progress file that lets you resume a partial dump, and an XML file with your parsed message dump.


import sys
import os
import re
import urllib
import urllib2
import cookielib
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import xml.sax.saxutils

def escape(string,escapeTable={}):
	return xml.sax.saxutils.escape(string,escapeTable).encode("ascii", "xmlcharrefreplace")

def parseDate(string):
	#hopefully I havent missed any date formats
	now =
	formats = [
		("%b %d, %Y",''),
		("%b %d, %Y",", %d" % now.year),
		("%b %d at %I:%M%p, %Y",''),
		("%b %d at %I:%M%p, %Y",", %d" % now.year)]
	for f in formats:
			return datetime.strptime(string.strip()+f[1],f[0]).date()
		except ValueError:
	weekday = string.split()[0]
	for day in range(7):
		testDay = now - timedelta(days=day)
		if weekday == "Yesterday" and day == 1:
		if datetime.strftime(testDay,"%A") == weekday:

def parseMessage(soup):
	for i in"strong"):
		#strip name and subject
	for i in"div"):
			if not
		except KeyError: pass
	for i in"br"):
	#return soup.get_text().strip()
	return re.sub(u'\xad','',re.sub("\n ",'\n',soup.get_text().strip())) #im not sure what the weird unicode is

def main():
	#check arguments
	if len(sys.argv) != 4:
	user = sys.argv[1]
	password = sys.argv[2]
	tid = sys.argv[3]

	#initialize url opener
	opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
	opener.addheaders = [('User-agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')] #facebook formats pages differently in each browser

	#login to facebook
	data = urllib.urlencode({'email' : user, 'pass' : password})
	page ='', data)
	if not'Logout',
		print 'Login Failed'

	#save to a message-specific directory
	if not os.path.isdir(tid):
	counter = 1
	while os.path.isfile(tid + "/dump%d.xml"%counter):
		counter += 1
	fileHandler = open(tid + "/dump%d.xml"%counter, 'a')
	fileHandler.write('<!--?xml version="1.0" ?-->')

		progressData = [line for line in open(tid+"/progress")]
		pageNo = int(progressData[0])
		url =  progressData[1]
	except IOError:
		pageNo = 1
		url = '' % (tid, tid)
	page =
	html =

	while 1:
		if not url: break
		page =
		url = None
		html =
		soup = BeautifulSoup(html)
		fileHandler = open(tid+"/page%d.html"%pageNo, 'w')
		pageNo += 1

		#loop through elements in thread div
		for div in"#messageGroup")[0].find_all("div",recursive=False):
			if div.get("id") == "see_older_messages":
			if div.get("id") == "see_newer_messages":
				fileHandler = open(tid+"/progress", 'w')
				url = "" + div.a.get('href')
				fileHandler.write(str(pageNo) + "\n" + url)

			#this is an actual message, let's parse it
			messageTag ="div")[0] #get rid of any link blurb or whatever
			strong ="strong")
			author = strong[0].string
			subject = strong[1].string if len(strong) > 1 else ''
			date = str(parseDate("div")[-1].abbr.string.strip()))
			message = parseMessage(messageTag)

			fileHandler = open(tid + "/dump%d.xml"%counter, 'a')
			fileHandler.write('\n<![CDATA[\n' % (escape(author, {'"' : "&quot;"}), escape(subject, {'"' : "&quot;"}), date)) 			fileHandler.write(escape(message)) 			fileHandler.write("\n]]>")

def usage():
	print 'Usage: ' + sys.argv[0] + ' email password tid'

if __name__ == '__main__':
Posted in Uncategorized | Leave a comment

Cute Combinatorial Problem

This was question 9 in the Sydney University Mathematical Society Problem Competition 2012. My solution was quite different to the model solution — I use a decomposition technique common in graph theory and algorithm runtime analysis.


For a permutation \sigma of \{1,2,3,\cdots,n\}, a break of \sigma is an element k of \{1,2,3,\cdots,n\} such that \sigma(\{1,2,3,\cdots,k\})=\{1,2,3,\cdots,n\}. The score of \sigma is the square of the number of breaks. Show that the average score of all permutations of \{1,2,3,\cdots,n\} tends to 0 as n tends to infinity.


First, note that

\begin{array}{rcl} \displaystyle\sum_{j=1}^{n-1}{n \choose i}^{-1} & = & \displaystyle\frac{2}{n}+\sum_{j=2}^{n-2}{n \choose i}^{-1}\\ & \le & \displaystyle\frac{2}{n}+(n-3){n \choose 2}^{-1}\\ & = & \displaystyle\frac{2}{n}+\frac{2\left(n-3\right)}{n\left(n-1\right)}\\ & \rightarrow & \displaystyle0\mbox{ as }n\rightarrow\infty.\end{array}

As a corollary, we can find {M} so that

\displaystyle \sum_{j=1}^{n-1}{n \choose i}^{-1}\le M

for all {n}.

Now, let {\sigma} vary uniformly at random in {S_{n}} and let {q(i,\sigma)} be an indicator function that is 1 if {i} is a break of {\sigma}, and 0 otherwise. We note that {i} and {j} are breaks of {\sigma} if and only if {\sigma} is a permutation when restricted to {\left\{ 1,\dots,i\right\} }, {\left\{ i+1,\dots,j\right\} } and {\left\{ i+1,\dots,n\right\} }. This says:

\displaystyle \mathbb{E}\left[q(i,\sigma)q(j,\sigma)\right]=\mbox{Pr}\left(i\mbox{ and }j\mbox{ are breaks of \ensuremath{\sigma}}\right)=\frac{i!\left(j-i\right)!\left(n-j\right)!}{n!}.

Then, by the linearity of expectation:

\begin{array}{rcl} \displaystyle\mathbb{E}\left[\mbox{score}\left(\sigma\right)\right] & = & \displaystyle\mathbb{E}\left[\left(\sum_{i=1}^{n-1}q\left(i,\sigma\right)\right)^{2}\right]\\ & = & \displaystyle\sum_{i=1}^{n-1}\sum_{j=1}^{n-1}\mathbb{E}\left[q(i,\sigma)q(j,\sigma)\right]\\ & \le & \displaystyle 2\sum_{j=1}^{n-1}\sum_{j=i}^{n-1}\mathbb{E}\left[q(i,\sigma)q(j,\sigma)\right]\\ & = & \displaystyle 2\sum_{j=1}^{n-1}\sum_{j=i}^{n-1}\frac{i!\left(j-i\right)!\left(n-j\right)!}{n!}\\ & = & \displaystyle 2\sum_{j=1}^{n-1}{n \choose i}^{-1}\sum_{j=i}^{n-1}{n-j \choose j-i}^{-1}\\ & = & \displaystyle 2\sum_{j=1}^{n-1}{n \choose i}^{-1}\left(1+\sum_{q=1}^{n-i-1}{n-i \choose q}^{-1}\right)\\ & \le & \displaystyle 2\left(M+1\right)\sum_{j=1}^{n-1}{n \choose i}^{-1}\\ & \rightarrow & \displaystyle 0\mbox{ as }n\rightarrow\infty \end{array}

and the squeeze theorem yields the required result.

Posted in Uncategorized | Leave a comment