At total-impact, we love data. So we get a lot of it, and we show a lot of it, like this:
There’s plenty of data here. But we’re missing another thing we love: stories supported by data. The Wall Of Numbers approach tells much, but reveals little.
One way to fix this is to Use Math to condense all of this information into just one, easy-to-understand number. Although this approach has been popular, we think it’s a huge mistake. We are not in the business of assigning relative values to different metrics; the whole point of altmetrics is that depending on the story you’re interested in, they’re all valuable.
So we (and from what they tell us, our users) just want to make those stories more obvious—to connect the metrics with the story they tell. To do that, we suggest categorizing metrics along two axis: engagement type and audience. This gives us a handy little table:
Now we can make way more sense of the metrics we’re seeing. “I’m being discussed by the public” means a lot more than “I seem to have many blogs, some twitter, and ton of Facebook likes.” We can still show all the data (yay!) in each cell—but we can also present context that gives it meaning.
Of course, that context is always going to involve an element of subjectivity. I’m sure some people will disagree about elements of this table. We categorized tweets as public, but some tweets are certainly from scholars. Sometimes scholars download html, and sometimes the public downloads PDFs.
Those are good points, and there are plenty more. We’re excited to hear them, and we’re excited to modify this based on user feedback. But we’re also excited about the power of this framework to help people understand and engage with metrics. We think it’ll be essential as we grow altmetrics from a source of numbers into a source of data-supported stories that inform real decisions.
In the previous post we assumed we had a list of 100 papers to use as baseline for our percentile calculations. But what papers should be on this list?
It matters: not to brag, but I’m probably a 90th-percentile chess player compared to a reference set of 3rd-graders. The news isn’t so good when I’m compared to a reference set of Grandmasters. This is a really important point about percentiles: they’re sensitive to the reference set we pick.
The best reference set to pick depends on the situation, and the story we’re trying to tell. Because of this, in the future we’d like to make the choice for total-impact reference sets very flexible, allowing users to define custom reference sets based on query terms, doi lists, and so on.
For now, though, we’ll start simply, with just a few standard reference sets to get going. Standard reference sets should be:
- easily interpreted
- not too high impact nor too low impact, so gradations in impact are apparent
- applicable to a wide variety of papers
- amenable to large-scale collection
- available as a random sample if large
For practical reasons we focus first on the last three points. Total-impact needs to collect reference samples through automated queries. This will be easy for the diverse products we track: for Dryad datasets we’ll use other Dryad datasets, for GitHub code repositories we’ll use other GitHub repos. But what about for articles?
Unfortunately, few open scholarly indexes allow queries by scholarly discipline or keywords… with one stellar exception. PubMed. If only all of research had a PubMed! PubMed’s eUtils API lets us query by MeSH indexing term, journal title, funder name, all sorts of things. It returns a list of PMIDs that match our queries. The api doesn’t return a random sample, but we can fix that (code). We’ll build ourselves a random reference set for each publishing year, so a paper published in 2007 would be compared to other papers published in 2007.
What specific PubMed query should we use to derive our article reference set? After thinking hard about the first three points above and doing some experimentation, we’ve got a few top choices:
- any article in PubMed
- articles resulting from NIH-funded research, or
- articles published in Nature,
All of these are broad, so they are roughly applicable to a wide variety of papers. Even more importantly, people have a good sense for what they represent — knowing that a metric is in the Xth percentile of NIH-funded research (or Nature, or PubMed) is a meaningful statistic.
There is of course one huge downside to PubMed-inspired reference sets: they focus on a single domain. Biomedicine is a huge and important domain, so that’s good, but leaving out other domains is unhappy. We’ll definitely be keeping an eye on other solutions to derive easy reference sets (a PubMed for all of Science? An open social science API? Or hopefully Mendeley will include query by subdiscipline in its api soon?).
Similarly, Nature examines only on a single publisher—and one that’s hardly representative of all publishing. As such, it may feel a bit arbitrary.
Right now, we’re leaning toward using NIH-funded papers as our default reference set, but we’d love to hear your feedback. What do you think is the most meaningful baseline for altmetrics percentile calculations?
(This is part 5 of a series on how total-impact will give context to the altmetrics we report.)
Let’s take the definitions from our last post for a test drive on tweeted percentiles for a hypothetical set of 100 papers, presented here in order of increasing readership with our assigned percentile ranges:
- 10 papers have 0 tweets (0-9th percentile)
- 40 papers have 1 tweet (10-49th)
- 10 papers have 2 tweets (50-59th)
- 20 papers have 5 tweets (60-79th)
- 1 paper has 9 tweets: (80th)
- 18 papers have 10 tweets (81-98th)
- 1 paper has 42 tweets (99th)
If someone came to us with a new paper that had 0 tweets, given the sample described above we would assign it to the 0-9th percentile (using a range rather than a single number because we roll like that). A new paper with 1 tweet would be in the 10th-49th percentile. A new paper with 9 tweets is easy: 80th percentile.
If we got a paper with 4 tweets we’d see it’s between the datapoints in our reference sample — the 59th and 60th percentiles — so we’d round down and report it as 59th percentile. If someone arrives with a paper that has more tweets than anything in our collected reference sample we’d give it a 100th percentile.
(This is part 4 of a series on how total-impact will give context to the altmetrics we report.)
Normalizing altmetrics seem by percentiles seems so easy! And it is. except when it’s not.
Our first clue that percentiles have tricky bits is that there is no standard definition for what percentile means. When you get an 800/800 on your SAT test, the testing board announces you are in the 98th percentile (or whatever) because 2% of test-takers got an 800… their definition of percentile is the percentage of tests with scores less than yours. A different choice would be to declare that 800/800 is the 100th percentile, representing the percentage with tests with scores less than or equal to yours. Total-impact will use the first definition: when we say something is in the 50th percentile, we mean that 50% of reference items had strictly lower scores.
Another problem: how should we represent ties? Imagine there were only ten SAT takers: one person got 400, eight got 600s, and one person scored 700. What is the percentile for the eight people who scored 600? Well…it depends.
- They are right in the middle of the pack so by some definitions they are in the 50th percentile.
- An optimist might argue they’re in the 90th percentile, since only 10% of test-takers did better.
- And by our strict definition they’d be in the 10th percentile, since they only beat the bottom 10% outright.
The problem is that none of these are really wrong; they just don’t include enough information to fully understand the ties situation, and they break our intuitions in some ways.
What if we included the extra information about ties? The score for a tie could instead be represented by a range, in this case the 10th-89th percentile. Altmetrics samples have a lot of ties: many papers recieve only one tweet, for example, so representing ties accurately is important. Total-impact will take this range approach, representing ties as percentile ranges. Here’s an example, using PubMed Central citations:
Finally, what to do with zeros? Impact metrics have many zeros: many papers have never been tweeted. Here, the range solution also works well. If your paper hasn’t been tweeted, but neither have 80% of papers in your field, then your percentile range for tweets would be 0-79th. In the case of zeros, when we need to summarize as a single number, we’ll use 0.
We’ll take these definitions for a test-drive in the next post.
(part 3 of a series on how total-impact plans to give context to the altmetrics it reports)
Total-impact is in the process of incorporating as a non-profit, which means (among other things) we need to form a Board of Directors.
It was been a tough decision, since we are lucky to know tons of qualified people, and there’s not even a standard number of people to pick. After much discussion, though, we decided two things:
- Small is better than big. We like being light and agile, and fewer people is consistent with that.
- Aim high. Worst that can happen is they say no.
The first point led us to a board of four people: the two of us and two more. The second point led us to ask our best-case-scenario choices, Cameron Neylon and John Wilbanks. Both these guys are whip smart, extraordinary communicators, and respected leaders in the open science communities. Both have extensive experience working with researchers and funders (our users) at all levels. And both mix principle and pragmatism in their writing and work in a way that resonates with us. These people are, along pretty much every dimension, our dream board.
And we are immensely excited (I am literally bouncing in my seat at the coffee shop as I write this) to now publicly announce: they both said yes. So welcome, Cameron and John. We’re excited to start changing how science works, together.
In the last post we talked about the need to give raw counts context on expected impact. How should this background information be communicated?
Our favourite approach: percentiles.
Try it on for size: Your paper is in the 88th percentile of CiteULike bookmarks, relative to other papers like it. That tells you something, doesn’t it? The paper got a lot of bookmarks, but there are some papers with more. Simple, succinct, intuitive, and applicable to any type of metric.
Percentiles were also the favoured approach for context in the “normalization” breakout group at altmetrics12, and have already popped up as a total-impact feature request. Percentiles have been explored scientometrics for journal impact metrics, including in a recent paper by Leydesdorff and Bornmann [http://dx.doi.org/10.1002/asi.21609, free preprint PDF.] The abstract says “total impact” in it, did you catch that? :)
As it turns out, actually implementing percentiles for altmetrics isn’t quite as simple as it sounds. We have to make a few decisions about how to handle ties, and zeros, and sampling, and how to define “other papers like it”…. stay tuned.
(part 2 of a series on how total-impact plans to give context to the altmetrics it reports)
How many tweets is a lot?
Total-impact is getting pretty good at finding raw numbers of tweets, bookmarks, and other interactions. But these numbers are hard to interpret. Say I’ve got 5 tweets on a paper—am I doing well? To answer that, we must know how much activity we expect on a paper like this one.
But how do we know what to expect? To figure this out, we’ll need to account for a number of factors:
First, expected impact depends on the age of the paper. Older papers have had longer to accumulate impact: an older paper is likely to have more citations than a younger paper.
Second, especially for some metrics, expected impact depends on the absolute year of publication. Because papers often get a spike in social media attention at the time of publication, papers published in years when a social tool is very popular recieve more attention on that tool than papers published before or after the tool was popular. For example, papers published in years when twitter has been popular recieve more tweets than papers published in the 1980s.
Third, expected impact depends on the size of the field. The more people there are who read papers like this, the more people there are who might Like it.
Fourth, expected impact depends on the tool adoption patterns of the subdiscipline. Papers in fields with a strong Mendeley community will have more Mendeley readers than papers published in fields that tend to use Zotero.
Finally, expected impact levels depends on what we mean by papers “like this.” How do we define the relevant reference set? Other papers in this journal? Papers with the same indexing terms? Funded under the same program? By investigators I consider my competition?
There are other variables too. For example, a paper published in a journal that tweets all its new publications will get a twitter boost, an Open Access paper might receive more Shares than a paper behind a paywall, and so on.
Establishing a clear and robust baseline won’t be easy, given all of this compexity! That said, let’s start. Stay tuned for our plans…
Total-impact is in early beta. We’re releasing early and often in this rapid-push stage, which means that we (and our awesome early-adopting users!) are finding some bugs.
As a result of early code, a bit of bad data had made it into our total-impact database. It affected only a few items, but even a few is too many. We’ve traced it to a few issues:
- our wikipedia code called the wikipedia api with the wrong type of quotes, in some cases returning partial matches
- when pubmed can’t find a doi and the doi contains periods, it turns out that the pubmed api breaks the doi into pieces and tries to match any of the pieces. Our code didn’t check for this.
- a few DOIs were entered with null and escape characters that we didn’t handle properly
We’ve fixed these and redoubled our unit tests to find these sorts of bugs earlier in the future…. but how to purge the bad data currently in the database?
Turns out that the data architecture we had been using didn’t make this easy. A bad pubmed ID propagated through our collected data in ways that were hard for us to trace. Arg! We’ve learned from this, and taken a few steps:
- deleted the problematic Wikipedia data
- deleted all the previously collected PubMed Central citation counts and F1000 notes
- deleted 56 items from collections because we couldn’t rederive the original input string
- updated our data model to capture provenance information so this doesn’t happen again!
What does this mean for a total-impact user? You may notice fewer Wikipedia and PubMed Central counts than you saw last week if you revisit an old collection. Click the “update” button at the top of a collection and accurate data will be re-collected.
It goes without saying: we are committed to bringing you Accurate Data (and radical transparency on both our successes and our mistakes :) ).
We created total-impact’s “@totalpimpactdev” Twitter account a while ago, as a way to keep our small group of developers and early users enlooped about changes to the code. Since then, total-impact has matured past the point where only developers care.
So, we’re updating our Twitter handle accordingly: we’re now tweeting from @totalimpactorg. If you follow us already, no need to change anything. If you don’t, do! Our codebase and feature list are improving almost daily, and our Twitter feed is a great way to stay up to date.