inFERENCe: 2012

Friday 5 October 2012

The big data business model

This is of course in reference to this clip from South Park:

Monday 18 June 2012

How to think outside the box - literally

This blog post is about innovation and creativity. More precisely it's about a stupid little technique, that once helped me be creative and achieve something unexpected: come up with a quantum physics technique, and publish a paper, with barely any training in physics.

Most big ideas and breakthroughs are unexpected. They are not part of your plan for next week, in case of a company, probably not on your roadmap for next year. It is important to have a roadmap, but it is just as crucial to realise that your potentially biggest achievements are not yet on it. Keeping that in mind, what can you do right now to trigger that next big idea coming? This may be your moment: follow these instructions. It may take some time, but bear with me. I mean, really, try this, if you can't commit yourself to try something creative right now, when you have nothing better to do than read this crappy blog post, then when will you?

My thinking box

Step 1: To be creative, you want to go beyond your confort zone: you want to think outside the box. What's the first thing you need for that? A box. That's right. If you want to think outside the box, you have to have the box first. So go and get one now. A cardboard box, large or small, will easily do. I used the one pictured on the right, which I found in the recycling bin in the office. Resume reading when you found your box.

Step 2: Now that you have the box, what's inside? Inside the box are things that you are confortable with thinking about. If you have a to do list, a project plan, a roadmap for next year(s), those all go inside the box. Everything that you anticipate to happen to you and your work, everything predictable, those all go inside the box.

Step 3+: When you're ready and have the box, you can start thinking outside of it. You will likely need a pencil, to take notes, I simply sketch stuff on the wall of the box. You may prefer a quiet place to think: this time, try a busy place: remember, you're outside your comfort zone now. The rest is up to you. Let me know if you how you saved the world using this technique.

Wednesday 6 June 2012

My big fat data-driven wedding: how to decide when to get married

There's all this hype about the importance of using data to support critical business decisions. But few things can be more important in someone's life than choosing the partner they plan to spend the rest of their lives with. So look no further for a data-driven approach to support this critical decision: when should you stop looking and finally get married?The formal setup of the problem is as follows: there are $ N $ potential marriage partners out there, and I want to choose the one and only partner of my life. My goal of course is to settle with no-one but the very best partner to marry.I can 'date' each candidate sequentially in a random order. When I date someone, I can evaluate how she ranks in comparison to all previous partners I dated. Based on this, I can either go on and marry her in which case the process stops, or reject her and carry on dating more partners. Once a partner was rejected she cannot be called back again. The problem is hard, because it's assumed that I know nothing about a partner before I actually dated her. Some might immediately recognise the typical problem of exploration versus expoitation hidded in here. So when should I stop dating and marry someone?Let's say I want to maximise the probability of ending up with the best partner as wife. Firstly, there's no point in selecting someone who's not the best so far, because that also implies she cannot be the best overall. Also, the earlier you marry, the higher the chance of missing out on someone without even dating her. It turns out the optimal strategy is as follows: date the first $ k $ candidates and reject them irrespective of their 'performance'. Then, keep on dating partners, and choose the first one, who ranks above everyone I dated before. The threshold $ k $ depends on $ N $, assymptotically it's about a third of all candidates ( $ k \approx \frac{1}{e}N $ see slides for details).I gave a talk about this theory in our lab at Cambridge not long after getting married, and did a quick back-of-the-envelope calculation to see how close my marriage decision was to the optimal. The theory predicts based on my criteria I should've dated roughly 196,132 partners before getting married. Well, I leave it to you to guess whether I did that or not :) But at the end of the day I ended up making the best decision, so it doesn't matter.I got recently reminded of this problem, as I was thinking about making instant decisions in the PeerPerks platform, which as it turns out can be modelled as a more complicated version of the optimal marriage problem, called probabilistic submodular multiple-choice secretary problem. In PeerPerks, we allow brands to offer freebies to influential users of social media. Our job is to make sure the promotion generates maximal success based on various indicators. In PeerPerks, every once in a while a user turns up on the website and authenticates using their twitter or facebook. At this point we have access to their social media profile, so it's like we've dated that person: we have a little insight into which topics she's interested in, what kind of people she's connected to, etc. Based on this instant evaluation our algorithm decides whether or not the person qualifies to receive a perk or not.

Tuesday 6 March 2012

Sock gnomes versus laws of large numbers

Whilst finishing one of my previous posts on causality and influence, I was pairing up socks that just came out of the laundry. I was outraged to observe that, yet again, I was hardly able to pair up over half of the socks. 'This cannot happen by chance, we must have sock gnomes, who deliberately mess up your socks' - I was thinking to myself. Clearly, I'm not alone with this problem:

eff you sock gnomes ! & your dirty sock stealing ways ! twitter.com/stevieraepeter…
— Stevie peters(@stevieraepeters) March 6, 2012

But then I realised I may be wrong, maybe it CAN happen by chance. Maybe it's in fact EXPECTED to happen, and if I could perfectly pair up all socks, I ought to be more surprised. So I started formulating the problem:

We have $ n $ perfectly distinguishable pairs of socks in our laundry bag. We decide to do our laundry, but because our washing machine has finite capacity, we can only wash $ k \leq 2n $ pieces of socks. We select the particular socks to be washed by drawing from the laundry bag uniformly at random, without replacement. After this procedure in our sample of $ k $ socks there will be a number $ Y_{n,k} $ of them that won't have their pairs in the sample. What is the expected value of $ Y_{n,k} $ as a function of $ n $ and $ k $?

First I wanted to give an analytic answer, but I had waaay to many other things to use my brainpower for, so sorry, this blog post is based on Monte Carlo estimates, that I could quickly hack together. (For those of you who doesn't know, Monte Carlo is a fancy name for simulating a bunch of random outcomes and then taking averages)

I came up with a visualisation to explain what's going on. Here's a simple (and ugly) version of plot if we have only $ n=2 $ pairs of socks, with explanation below.

2 pairs of socks means a total of 4 pieces in our laundry bag that we may select to wash. The x axis shows the number of socks washed (k), which therefore ranges between 0 and 4. The y axis shows the number of socks left unpaired. The gray rectangles show the distribution of unpaired socks with white meaning probability 0, black meaning probability 1 (see colorbar on the right).
If we wash 0 socks, with probability 1 we have 0 unpaired socks, hence the black rectangle at (0,0), meaning if we wash 0 socks, with probability 1 we have 0 unpaired socks.
If we wash a single sock, we obviously can't pair it up with anything, thus with probability one, we have 1 unpaired sock, hence the black rectangle at (1,1).
If we wash 2 pieces, than with probability ~0.65 those two won't be pairs, so we're left with two unpaired socks, hence the dark gray rectangle at (2,2); with probability ~0.35 we can actually pair them up and we're left with 0 unpaired socks, hence the lighter gray rectangle at (2,0). It's impossible to have a single unpaired sock, hence (2,1) is white.
And I hope you get it by now and could finish the reasoning all the way to the end, and figure out why the graph is always symmetric.

The blue dotted line shows the maximum of the number of unpaired socks, it's easy to show you can never have more than $ min(k,2n-k) $ unpaired socks. So everything above the blue dotted line should be white. The red line shows the expected number of unpaired socks as a function of $ k $, the number of pieces we have washed (remember, $ n $ is now fixed)

Below is the same kind of graph for the case we have $ n=10 $ pairs of socks. Notice the checkerboard-like pattern which due to the fact that the parity of the number of unpaired socks always corresponds to the parity of the number of socks washed. Pretty pleasing to the eye, isn't it?

The expected number of socks starts to be a smoother function of the number of socks washed, and it becomes even further as we further increase this number to 50, as the following figure shows:

The shape of the expectation looks the same as a function of $ \frac{k}{2n} $, the variance around this mean however has decreased substantially.
If we further increase the number of socks to 200 pairs, the variance practically shrinks to 0:

And for a little animation putting all this together for growing values of $ n $ click here.

So what's the answer? Well, based on Monte Carlo simulations I conjecture that the following law of large numbers holds for sock pairing:

As $ n \rightarrow \infty$ and $ k \rightarrow \infty$ but such that $ \lambda = \frac{k}{n2}$ is fixed, the fraction of unpaired socks converges to a deterministic value with probability one:

\[ \frac{Y_{n,k}}{2n} \stackrel{1}{\rightarrow} \lambda\left(1-\lambda\right)\]

So, the expected fraction of socks that you can't pair up can be as high as half of the total number of socks you have washed, so my analysis allowed me to rule out the alternative hypothesis of sock gnomes messing up my socks.

Take home message #1: Before you start believing in sock gnomes, satan, god, and other supernatural powers, check if events that seem unlikely really cannot be explained by more parsimonious models of probability.

Take home message #2: If you want to be able to pair up your socks, buy loads of the same colour and you won't have such problems.

Sunday 29 January 2012

Flattr: should we use it in academia?

I have just found the service that I have been looking for for a while: Flattr. It's a social micropayment platform, an evolution of "Buy me a coffee" buttons that started appearing on blogs a few years ago. It allows you to donate and distribute a small amount of money each month to authors of content that you "liked" that month.

Here's how it works: you sign up on Flattr, top up your account with a small amount, let's say $5 a month, then you come back to my blog and press the Flattr button on the right. At the end of the month, your $5 will be distributed equally to authors of things you Flattr'ed, so I get my share of your $5. This way, the small contributions add up, and the community can reward authors of content they frequently read and who inspire them. I personally think, details and implementation apart, this general idea is brilliant. I'm definitely going to top up my account and suggest people, whose stuff I read to add a flattr button.

I actually first thought of such a social redistribution system in terms of academic funding, and almost launched a buy me a coffee button on my publications page to make a point. Now, having found Flattr, I started thinking again about how could such a system be used for a more equal redistribution of funding in academia. Let's review some current inefficiencies in academic research funding:

Inflexible fellowship values: Money in academia is often awarded in discrete chunks, especially for junior positions. You apply for various grants, fellowships, scholarships, and you either get the whole thing for x years or you get nothing. If you for example got a Gates Scholarship to study for a PhD in Cambridge, you got it for three years, the amount you get is practically fixed for 3 years (other than following inflation), you can't cease to be a Gates scholar if you do poorly, you can't become one if you perform better. It's pretty much fixed and independent of your performance for 3-4 years. This lack of adaptation may generate random inequalities: there are some fellowships that are only awarded every second year for example; it's close to impossible to get a decent academic salary in certain countries, independently of the quality of research being done.

Poor metrics: In my opinion academia struggles with a big unsolved problem: How to attribute authority, how to measure and reward someone's contribution, inspiration, impact? The most widespread metrics are number of citations, h-index, impact factors, eigenfactors, etc. These are crude, heuristic measures of influence and authority. They certainly make some intuitive sense and vaguely correlate with what we really wanted to measure, but these are not a solution, just complicated features. If we want a fairer system we have to establish new standards for measuring influence/importance/etc. (Well, first of all we have to agree what we want to measure)

Discrete random variables: When measuring output, there are way too many discrete variables involved: Is your paper accepted? Yes or No? Does this paper cite relevant papers of yours appropriately? Yes or No? Which journal did your stuff appear in? These binary decisions rectify the opinion formed about your paper: if your paper is accepted, it means some undisclosed set of reviewers who looked at it didn't find a mistake and liked it. But how much did they like it? Were they absolutely amazed by it, or did they just not care enough to reject it? Were the reviewers respected academics, or were they incompetent? What does the rest of the community think? If I want to state that in my opinion a particular piece of research represents huge step forward in machine learning, pretty much my only option to measurably reward the authors is to cite their paper, but maybe my work is on something completely different, so it's inappropriate.

All in all, the measurement of scientific output is inherently inefficient because of these binary/discrete decisions involved. Everybody with engineering background knows that discretisation means loss of information: measurement of academic output is discretised rather arbitrarily, with little consideration given to effective information transfer. And this loss of bandwith constrains the system of academic rewards to be either slow and accurate (you have to grow old to get recognition) or fast but inaccurate (you will be judged on the basis of a few noisy random variables). Often I think it's both slow and inaccurate :)

Does Flattr suggest a solution? Can something like Flattr help improve academic funding and assignment of credits? Imagine you get a fellowship, and you have to re-allocate 5% of your earnings each month to fellow scientists according to how important their contributions are to your field? Can this redistribution be done in such a way that equilibrium salaries are commensurate with the value of each scientist? Can we create an incentive-compatible academia, where it is in everyone's best interest to honestly report whose research they find valuable? Let me know what you think in comments.

Wednesday 25 January 2012

Observe, don't just see, Dr Watson

The other day I was reading stories of Sherlock Holmes on the train. The following quite famous quote in which Holmes explains to Dr Watson his extraordinary inference and deduction skills caught my eyes:

Watson: And yet I believe that my eyes are as good as yours.

Holmes: Quite so. You see, but you do not observe...

In my interpretation this means that everybody is capable of seeing the same evidence, but what makes a good detective is the ability to focus on relevant aspects and putting these pieces together to reconstruct stories underlying their observations.

I started thinking about the relevance of this philosophy to data science and data-driven enterprises. Many companies, governments, etc now realise the value of data: the more data they have about their users/citizens, the better position they are in when making decisions - in theory. Take a look at for example the bazillion of startups based on the business model: "People like to engage in activity X. We will make their experience better by leveraging data from social networks". But can they really realise the potential in data? My impression is that many of them can't. These companies see; but they do not observe: they may be sitting on terabytes of data, but they're incapable of using it to tell the right story about their users.

For that, they need real data scientists, who are a scarce resource. Not every company can afford the luxury to have an experienced data science team playing with their data. Data scientists are private detectives of the modern world: they are presented with evidence (the data) and they are asked to uncover the hidden story that explains it all. In most cases, Sherlock Holmes had to find out who the murderer was and what their motivations were, data scientists (or rather, the algorithms they build) try to figure out what made you share a link on Facebook; whether your recent credit card transaction was made by you or it was fraud; or figure out from thousands of USB stick descriptions that "four gigabytes" and "4GB" mean the same thing.

For the average reader, the above examples would be somewhat less exciting than any of the murder cases Sir Arthur Conan Doyle's famous fictional consulting detective has ever worked on. But hey, maybe that's only because no-one with sufficient writing talent has picked this up yet: "Twelve Adventures of Hercule Bayes, the Consultant Data Scientist". I'd buy it.

Wednesday 18 January 2012

Influence and causality: castration doesn't make you live longer

Hopefully obvious to most readers, but in my experience many people tend to still confuse the following two statements: "whenever A holds, B is more likely to hold" and "A causes B". They are not the same. The first statement expresses statistical dependence, the second causal influence.

Let me demonstrate this distinction with the following example: if you look at statistics, you may observe, that "people who do not have testicles live longer". Clearly, this doesn't imply that if you do have testicles then you should cut them off to live longer. (I mean, really, please don't try this at home). It simply reflects the fact that women tend not to have testicles and they also tend to live longer than men. Despite this extreme example - and several others - clearly demonstrating the possible implications of misinterpreting statistical dependence as causal influence, the distinction is very often overlooked not only by common people and journalists but, very sadly, even by scientists and policy-makers.

When analysing social influence of people, blogs and news sites, the distinction between causation and dependence is highly relevant: we may observe that whenever a particular user shares something in a social network, the volume of activity around the topic - on average - increases. But this alone does not imply that the user is actually directly responsible for this increase in activity, nor that she or he is influential.

Fortunately, in social networks, there are ways to explicitly record causal influence: for example, if Alice retweets Bob's message, or shares his post, it is very likely that there was direct causal relationship between Bob's activity and Alice's activity. But often such influences remain implicit in the data: instead of explicitly forwarding Alice's message, Bob may just post the same piece of information without explicitly acknowledging Alice as a source of inspiration. These situations make it very hard (although not impossible, that's my job) to disambiguate between alternative stories explaining the data: was Bob influenced by Alice, or is it just a coincidence that they both shared the same piece of information being influenced by third party sources.

The most powerful, although usually very costly, way of detecting causal influence is through intervention: to go back to our castration example, this amounts to cutting a few guys' testicles and implanting them into women and then measuring how long these patients lived. If you can do that - set the value of a few variables and observe what happens to the rest of the variables - you really are in a better position to detect causal relationships.

In a recently published study, Facebook's data team did just that in the context of social networks: they intervened. During their experiment in 2010, they randomly filtered users' news feed to see how likely they were to share certain content in situations when they do vs. when they do not see their friends related activities. Unsurprisingly from facebook, the scale of the experiment was humongous: it involved around 250 million users, 78 million shared URLs amounting to over 1 billion (as in thousand million, $10^9$) user-URL pairs. This randomised experiment allowed them to gain unprecedented qualitative insights into the dynamics of social influence: the effect of peer influence on the latency of sharing times; the effect of multiple friends sharing the same piece of information; connections between tie strength and influence. I encourage everyone interested in either causality or influence in social networks to look at the paper.

Finally, just to illustrate how hard inferring the presence of influence is, just consider this blog post: The underlying truth is that I first read about the Facebook study on TechCrunch, then I looked it up on Google News, chose to read the New Scientist coverage which finally pointed me to the facebook note and the paper. Now, had I not included this final paragraph, just imagine how hard it would've been to identify these sources of influence. Well, this is my (or rather, my algorithms') job.

Monday 9 January 2012

Influentomics: the science of social influence

Coming up with new -omics has been a fast growing trend in modern science in the past decade. It’s not that hard: you simply start calling a field of science of your picking, i.e. what you, your close collaborators, friends and scientific-political allies are working on somethingomics. The collection of data, objects, concepts that pioneers of somethingomics study is then collectively called THE somethingome.

The terms genomics, proteomics, transcriptiomics were the first ones I came across in my biolinformatics studies during undergraduate years. The term genome certainly is amongst the most widely adopted ones outside science, especially since the human genome project.

Like genomics, most -omics refer to a subfield of biology: connectomics studies patterns of connectivity between neurons in the brain; vaccinomics studies the genetic underpinning of vaccine development, etc. Take a look at the long list of omics on this website or at omics.org. Some omes and omics are even trademark names (I won’t cite them here for fear of misusing TM signs and being sued).

But omics’ have started appearing beyond the boundaries of biology: The legalome refers to the whole set of laws in a society. The Human Speechome Project looks at how children learn to speak, with the speechome being the collection of all the speech and speech-related interactions a child is exposed to since her birth. It’s now over a year since culturomics was born (see Science paper): culturomists look for patterns in the culturome: the vast amount of printed books mankind has produced - and Google then kindly digitised - since 1800. But my favourite -omics of all is arguably PhDcomics, which I follow more actively than any other.

So following the footsteps of great thinkers, I hereby declare my very own new pet omics, influentomics: the study of social influence. The influentome is the entirity of social interaction between groups of people and all social actions that can be used to detect or infer how members of a community influence each other’s actions and opinions. Some of this is available as observable data and can form the basis of analysis, a large part of it remains unobserved.

Today, the single most useful observed (or partially observed) subset of the influentome is the vast collection of social data about people on twitter, facebook and similar sites. These social platforms are to influentomics what high throughput microarrays were to transcriptomics: suddenly high volumes of fairly well organised data are available, opening the door for more sophisticated data-driven insights into influence than ever before. It is not only the scale and resolution of these datasets which is different. The limited range of social actions one can exercise on twitter or facebook also helps formulating theories: on twitter you can tweet, retweet, mention, use hashtags, follow, unfollow. On facebook you can become friends, now you can subscribe, like, post, repost, comment, become a fan, etc. All these are well-defined canonical social actions from which it is far easier to infer patterns of influence than on the basis of less structured data such as books, reviews and journal articles.

So it is no surprise we are seeing a surge of interest in mining the influentome; both in academy and in business. A month ago I joined PeerIndex, a fine example of young companies that try to leverage social data to provide measures of social influence. I’m looking forward to face the machine learning challenges that my newly declared field, influentomics, has to offer.