Facebook allows you to download a dump of your data, which is supposed to include private messages. For some reason it was extremely incomplete for me (less than 1% of a particular thread with about 10000 messages). This Python script will scrape a particular private message/chat thread. Expect maybe half an hour for a long thread.

Issues:

• Facebook keeps changing their layout, login method, etc. This will obviously break the script.
• Currently unimplemented: a way to dump all private messages, with everyone.
• Facebook seems to sometimes detect rapid automated-looking behaviour like this and refuse a connection.
• Older messages display date but not time. I decided to ignore time data for consistency.

Idiot’s guide: install Python 2 if on Windows and save the script as “dump.py”. Get Beautiful Soup 4 and place /bs4 in the same directory. Go find the token ID of the message thread you want to dump (look for something like “tid=id.12345” in the URL). Then, run

 python dump.py email@email.com password tokenid 

substituting in your own arguments. You now have a new directory that will contain all the dumped HTML pages (you can delete these if you want), a progress file that lets you resume a partial dump, and an XML file with your parsed message dump.

#!/usr/bin/python

import sys
import os
import re
import urllib
import urllib2
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import xml.sax.saxutils

def escape(string,escapeTable={}):
return xml.sax.saxutils.escape(string,escapeTable).encode("ascii", "xmlcharrefreplace")

def parseDate(string):
#hopefully I havent missed any date formats
now = datetime.now()
formats = [
("%b %d, %Y",''),
("%b %d, %Y",", %d" % now.year),
("%b %d at %I:%M%p, %Y",''),
("%b %d at %I:%M%p, %Y",", %d" % now.year)]
for f in formats:
try:
return datetime.strptime(string.strip()+f[1],f[0]).date()
except ValueError:
continue
weekday = string.split()[0]
for day in range(7):
testDay = now - timedelta(days=day)
if weekday == "Yesterday" and day == 1:
return testDay.date()
if datetime.strftime(testDay,"%A") == weekday:
return testDay.date()
return now.date()

def parseMessage(soup):
for i in soup.select("strong"):
#strip name and subject
i.clear()
for i in soup.select("div"):
try:
i['data-sigil']
if not i.br:
i.append("\n")
except KeyError: pass
for i in soup.select("br"):
i.replace_with('\n')
#return soup.get_text().strip()
return re.sub(u'\xad','',re.sub("\n ",'\n',soup.get_text().strip())) #im not sure what the weird unicode is

def main():
#check arguments
if len(sys.argv) != 4:
usage()
user = sys.argv[1]
tid = sys.argv[3]

#initialize url opener
opener.addheaders = [('User-agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')] #facebook formats pages differently in each browser
urllib2.install_opener(opener)

data = urllib.urlencode({'email' : user, 'pass' : password})
sys.exit(1)

#save to a message-specific directory
if not os.path.isdir(tid):
os.makedirs(tid)
counter = 1
while os.path.isfile(tid + "/dump%d.xml"%counter):
counter += 1
fileHandler = open(tid + "/dump%d.xml"%counter, 'a')
fileHandler.write('<!--?xml version="1.0" ?-->')
fileHandler.close()

try:
progressData = [line for line in open(tid+"/progress")]
pageNo = int(progressData[0])
url =  progressData[1]
except IOError:
pageNo = 1
page = opener.open(url)

while 1:
if not url: break
page = opener.open(url)
url = None
soup = BeautifulSoup(html)
fileHandler = open(tid+"/page%d.html"%pageNo, 'w')
fileHandler.write(html)
fileHandler.close()
pageNo += 1

#loop through elements in thread div
for div in soup.select("#messageGroup")[0].find_all("div",recursive=False):
if div.get("id") == "see_older_messages":
continue
fileHandler = open(tid+"/progress", 'w')
fileHandler.write(str(pageNo) + "\n" + url)
fileHandler.close()
break

#this is an actual message, let's parse it
messageTag = div.select("div")[0] #get rid of any link blurb or whatever
strong = messageTag.select("strong")
author = strong[0].string
subject = strong[1].string if len(strong) > 1 else ''
date = str(parseDate(div.select("div")[-1].abbr.string.strip()))
message = parseMessage(messageTag)

fileHandler = open(tid + "/dump%d.xml"%counter, 'a')
fileHandler.write('\n<![CDATA[\n' % (escape(author, {'"' : "&quot;"}), escape(subject, {'"' : "&quot;"}), date)) 			fileHandler.write(escape(message)) 			fileHandler.write("\n]]>")
fileHandler.close()

def usage():
print 'Usage: ' + sys.argv[0] + ' email password tid'
sys.exit(2)

if __name__ == '__main__':
main()


## Cute Combinatorial Problem

This was question 9 in the Sydney University Mathematical Society Problem Competition 2012. My solution was quite different to the model solution — I use a decomposition technique common in graph theory and algorithm runtime analysis.

### Problem:

For a permutation $\sigma$ of $\{1,2,3,\cdots,n\}$, a break of $\sigma$ is an element $k$ of $\{1,2,3,\cdots,n\}$ such that $\sigma(\{1,2,3,\cdots,k\})=\{1,2,3,\cdots,n\}$. The score of $\sigma$ is the square of the number of breaks. Show that the average score of all permutations of $\{1,2,3,\cdots,n\}$ tends to 0 as $n$ tends to inﬁnity.

### Solution:

First, note that

$\begin{array}{rcl} \displaystyle\sum_{j=1}^{n-1}{n \choose i}^{-1} & = & \displaystyle\frac{2}{n}+\sum_{j=2}^{n-2}{n \choose i}^{-1}\\ & \le & \displaystyle\frac{2}{n}+(n-3){n \choose 2}^{-1}\\ & = & \displaystyle\frac{2}{n}+\frac{2\left(n-3\right)}{n\left(n-1\right)}\\ & \rightarrow & \displaystyle0\mbox{ as }n\rightarrow\infty.\end{array}$

As a corollary, we can find ${M}$ so that

$\displaystyle \sum_{j=1}^{n-1}{n \choose i}^{-1}\le M$

for all ${n}$.

Now, let ${\sigma}$ vary uniformly at random in ${S_{n}}$ and let ${q(i,\sigma)}$ be an indicator function that is 1 if ${i}$ is a break of ${\sigma}$, and 0 otherwise. We note that ${i}$ and ${j}$ are breaks of ${\sigma}$ if and only if ${\sigma}$ is a permutation when restricted to ${\left\{ 1,\dots,i\right\} }$, ${\left\{ i+1,\dots,j\right\} }$ and ${\left\{ i+1,\dots,n\right\} }$. This says:

$\displaystyle \mathbb{E}\left[q(i,\sigma)q(j,\sigma)\right]=\mbox{Pr}\left(i\mbox{ and }j\mbox{ are breaks of \ensuremath{\sigma}}\right)=\frac{i!\left(j-i\right)!\left(n-j\right)!}{n!}.$

Then, by the linearity of expectation:

$\begin{array}{rcl} \displaystyle\mathbb{E}\left[\mbox{score}\left(\sigma\right)\right] & = & \displaystyle\mathbb{E}\left[\left(\sum_{i=1}^{n-1}q\left(i,\sigma\right)\right)^{2}\right]\\ & = & \displaystyle\sum_{i=1}^{n-1}\sum_{j=1}^{n-1}\mathbb{E}\left[q(i,\sigma)q(j,\sigma)\right]\\ & \le & \displaystyle 2\sum_{j=1}^{n-1}\sum_{j=i}^{n-1}\mathbb{E}\left[q(i,\sigma)q(j,\sigma)\right]\\ & = & \displaystyle 2\sum_{j=1}^{n-1}\sum_{j=i}^{n-1}\frac{i!\left(j-i\right)!\left(n-j\right)!}{n!}\\ & = & \displaystyle 2\sum_{j=1}^{n-1}{n \choose i}^{-1}\sum_{j=i}^{n-1}{n-j \choose j-i}^{-1}\\ & = & \displaystyle 2\sum_{j=1}^{n-1}{n \choose i}^{-1}\left(1+\sum_{q=1}^{n-i-1}{n-i \choose q}^{-1}\right)\\ & \le & \displaystyle 2\left(M+1\right)\sum_{j=1}^{n-1}{n \choose i}^{-1}\\ & \rightarrow & \displaystyle 0\mbox{ as }n\rightarrow\infty \end{array}$

and the squeeze theorem yields the required result.