inFERENCe: January 2012

Sunday 29 January 2012

Flattr: should we use it in academia?

I have just found the service that I have been looking for for a while: Flattr. It's a social micropayment platform, an evolution of "Buy me a coffee" buttons that started appearing on blogs a few years ago. It allows you to donate and distribute a small amount of money each month to authors of content that you "liked" that month.

Here's how it works: you sign up on Flattr, top up your account with a small amount, let's say $5 a month, then you come back to my blog and press the Flattr button on the right. At the end of the month, your $5 will be distributed equally to authors of things you Flattr'ed, so I get my share of your $5. This way, the small contributions add up, and the community can reward authors of content they frequently read and who inspire them. I personally think, details and implementation apart, this general idea is brilliant. I'm definitely going to top up my account and suggest people, whose stuff I read to add a flattr button.

I actually first thought of such a social redistribution system in terms of academic funding, and almost launched a buy me a coffee button on my publications page to make a point. Now, having found Flattr, I started thinking again about how could such a system be used for a more equal redistribution of funding in academia. Let's review some current inefficiencies in academic research funding:

Inflexible fellowship values: Money in academia is often awarded in discrete chunks, especially for junior positions. You apply for various grants, fellowships, scholarships, and you either get the whole thing for x years or you get nothing. If you for example got a Gates Scholarship to study for a PhD in Cambridge, you got it for three years, the amount you get is practically fixed for 3 years (other than following inflation), you can't cease to be a Gates scholar if you do poorly, you can't become one if you perform better. It's pretty much fixed and independent of your performance for 3-4 years. This lack of adaptation may generate random inequalities: there are some fellowships that are only awarded every second year for example; it's close to impossible to get a decent academic salary in certain countries, independently of the quality of research being done.

Poor metrics: In my opinion academia struggles with a big unsolved problem: How to attribute authority, how to measure and reward someone's contribution, inspiration, impact? The most widespread metrics are number of citations, h-index, impact factors, eigenfactors, etc. These are crude, heuristic measures of influence and authority. They certainly make some intuitive sense and vaguely correlate with what we really wanted to measure, but these are not a solution, just complicated features. If we want a fairer system we have to establish new standards for measuring influence/importance/etc. (Well, first of all we have to agree what we want to measure)

Discrete random variables: When measuring output, there are way too many discrete variables involved: Is your paper accepted? Yes or No? Does this paper cite relevant papers of yours appropriately? Yes or No? Which journal did your stuff appear in? These binary decisions rectify the opinion formed about your paper: if your paper is accepted, it means some undisclosed set of reviewers who looked at it didn't find a mistake and liked it. But how much did they like it? Were they absolutely amazed by it, or did they just not care enough to reject it? Were the reviewers respected academics, or were they incompetent? What does the rest of the community think? If I want to state that in my opinion a particular piece of research represents huge step forward in machine learning, pretty much my only option to measurably reward the authors is to cite their paper, but maybe my work is on something completely different, so it's inappropriate.

All in all, the measurement of scientific output is inherently inefficient because of these binary/discrete decisions involved. Everybody with engineering background knows that discretisation means loss of information: measurement of academic output is discretised rather arbitrarily, with little consideration given to effective information transfer. And this loss of bandwith constrains the system of academic rewards to be either slow and accurate (you have to grow old to get recognition) or fast but inaccurate (you will be judged on the basis of a few noisy random variables). Often I think it's both slow and inaccurate :)

Does Flattr suggest a solution? Can something like Flattr help improve academic funding and assignment of credits? Imagine you get a fellowship, and you have to re-allocate 5% of your earnings each month to fellow scientists according to how important their contributions are to your field? Can this redistribution be done in such a way that equilibrium salaries are commensurate with the value of each scientist? Can we create an incentive-compatible academia, where it is in everyone's best interest to honestly report whose research they find valuable? Let me know what you think in comments.

Wednesday 25 January 2012

Observe, don't just see, Dr Watson

The other day I was reading stories of Sherlock Holmes on the train. The following quite famous quote in which Holmes explains to Dr Watson his extraordinary inference and deduction skills caught my eyes:

Watson: And yet I believe that my eyes are as good as yours.

Holmes: Quite so. You see, but you do not observe...

In my interpretation this means that everybody is capable of seeing the same evidence, but what makes a good detective is the ability to focus on relevant aspects and putting these pieces together to reconstruct stories underlying their observations.

I started thinking about the relevance of this philosophy to data science and data-driven enterprises. Many companies, governments, etc now realise the value of data: the more data they have about their users/citizens, the better position they are in when making decisions - in theory. Take a look at for example the bazillion of startups based on the business model: "People like to engage in activity X. We will make their experience better by leveraging data from social networks". But can they really realise the potential in data? My impression is that many of them can't. These companies see; but they do not observe: they may be sitting on terabytes of data, but they're incapable of using it to tell the right story about their users.

For that, they need real data scientists, who are a scarce resource. Not every company can afford the luxury to have an experienced data science team playing with their data. Data scientists are private detectives of the modern world: they are presented with evidence (the data) and they are asked to uncover the hidden story that explains it all. In most cases, Sherlock Holmes had to find out who the murderer was and what their motivations were, data scientists (or rather, the algorithms they build) try to figure out what made you share a link on Facebook; whether your recent credit card transaction was made by you or it was fraud; or figure out from thousands of USB stick descriptions that "four gigabytes" and "4GB" mean the same thing.

For the average reader, the above examples would be somewhat less exciting than any of the murder cases Sir Arthur Conan Doyle's famous fictional consulting detective has ever worked on. But hey, maybe that's only because no-one with sufficient writing talent has picked this up yet: "Twelve Adventures of Hercule Bayes, the Consultant Data Scientist". I'd buy it.

Wednesday 18 January 2012

Influence and causality: castration doesn't make you live longer

Hopefully obvious to most readers, but in my experience many people tend to still confuse the following two statements: "whenever A holds, B is more likely to hold" and "A causes B". They are not the same. The first statement expresses statistical dependence, the second causal influence.

Let me demonstrate this distinction with the following example: if you look at statistics, you may observe, that "people who do not have testicles live longer". Clearly, this doesn't imply that if you do have testicles then you should cut them off to live longer. (I mean, really, please don't try this at home). It simply reflects the fact that women tend not to have testicles and they also tend to live longer than men. Despite this extreme example - and several others - clearly demonstrating the possible implications of misinterpreting statistical dependence as causal influence, the distinction is very often overlooked not only by common people and journalists but, very sadly, even by scientists and policy-makers.

When analysing social influence of people, blogs and news sites, the distinction between causation and dependence is highly relevant: we may observe that whenever a particular user shares something in a social network, the volume of activity around the topic - on average - increases. But this alone does not imply that the user is actually directly responsible for this increase in activity, nor that she or he is influential.

Fortunately, in social networks, there are ways to explicitly record causal influence: for example, if Alice retweets Bob's message, or shares his post, it is very likely that there was direct causal relationship between Bob's activity and Alice's activity. But often such influences remain implicit in the data: instead of explicitly forwarding Alice's message, Bob may just post the same piece of information without explicitly acknowledging Alice as a source of inspiration. These situations make it very hard (although not impossible, that's my job) to disambiguate between alternative stories explaining the data: was Bob influenced by Alice, or is it just a coincidence that they both shared the same piece of information being influenced by third party sources.

The most powerful, although usually very costly, way of detecting causal influence is through intervention: to go back to our castration example, this amounts to cutting a few guys' testicles and implanting them into women and then measuring how long these patients lived. If you can do that - set the value of a few variables and observe what happens to the rest of the variables - you really are in a better position to detect causal relationships.

In a recently published study, Facebook's data team did just that in the context of social networks: they intervened. During their experiment in 2010, they randomly filtered users' news feed to see how likely they were to share certain content in situations when they do vs. when they do not see their friends related activities. Unsurprisingly from facebook, the scale of the experiment was humongous: it involved around 250 million users, 78 million shared URLs amounting to over 1 billion (as in thousand million, $10^9$) user-URL pairs. This randomised experiment allowed them to gain unprecedented qualitative insights into the dynamics of social influence: the effect of peer influence on the latency of sharing times; the effect of multiple friends sharing the same piece of information; connections between tie strength and influence. I encourage everyone interested in either causality or influence in social networks to look at the paper.

Finally, just to illustrate how hard inferring the presence of influence is, just consider this blog post: The underlying truth is that I first read about the Facebook study on TechCrunch, then I looked it up on Google News, chose to read the New Scientist coverage which finally pointed me to the facebook note and the paper. Now, had I not included this final paragraph, just imagine how hard it would've been to identify these sources of influence. Well, this is my (or rather, my algorithms') job.

Monday 9 January 2012

Influentomics: the science of social influence

Coming up with new -omics has been a fast growing trend in modern science in the past decade. It’s not that hard: you simply start calling a field of science of your picking, i.e. what you, your close collaborators, friends and scientific-political allies are working on somethingomics. The collection of data, objects, concepts that pioneers of somethingomics study is then collectively called THE somethingome.

The terms genomics, proteomics, transcriptiomics were the first ones I came across in my biolinformatics studies during undergraduate years. The term genome certainly is amongst the most widely adopted ones outside science, especially since the human genome project.

Like genomics, most -omics refer to a subfield of biology: connectomics studies patterns of connectivity between neurons in the brain; vaccinomics studies the genetic underpinning of vaccine development, etc. Take a look at the long list of omics on this website or at omics.org. Some omes and omics are even trademark names (I won’t cite them here for fear of misusing TM signs and being sued).

But omics’ have started appearing beyond the boundaries of biology: The legalome refers to the whole set of laws in a society. The Human Speechome Project looks at how children learn to speak, with the speechome being the collection of all the speech and speech-related interactions a child is exposed to since her birth. It’s now over a year since culturomics was born (see Science paper): culturomists look for patterns in the culturome: the vast amount of printed books mankind has produced - and Google then kindly digitised - since 1800. But my favourite -omics of all is arguably PhDcomics, which I follow more actively than any other.

So following the footsteps of great thinkers, I hereby declare my very own new pet omics, influentomics: the study of social influence. The influentome is the entirity of social interaction between groups of people and all social actions that can be used to detect or infer how members of a community influence each other’s actions and opinions. Some of this is available as observable data and can form the basis of analysis, a large part of it remains unobserved.

Today, the single most useful observed (or partially observed) subset of the influentome is the vast collection of social data about people on twitter, facebook and similar sites. These social platforms are to influentomics what high throughput microarrays were to transcriptomics: suddenly high volumes of fairly well organised data are available, opening the door for more sophisticated data-driven insights into influence than ever before. It is not only the scale and resolution of these datasets which is different. The limited range of social actions one can exercise on twitter or facebook also helps formulating theories: on twitter you can tweet, retweet, mention, use hashtags, follow, unfollow. On facebook you can become friends, now you can subscribe, like, post, repost, comment, become a fan, etc. All these are well-defined canonical social actions from which it is far easier to infer patterns of influence than on the basis of less structured data such as books, reviews and journal articles.

So it is no surprise we are seeing a surge of interest in mining the influentome; both in academy and in business. A month ago I joined PeerIndex, a fine example of young companies that try to leverage social data to provide measures of social influence. I’m looking forward to face the machine learning challenges that my newly declared field, influentomics, has to offer.