I've been reading reviews on Destructoid for a long time now; probably since they started doing 'em. About a year ago, amidst much debate about the validity of review scores from various sites (
continuing today still), Aaron Linde posted the first
Destructoid Review Guide, and there was much rejoicing.
At least there was for a short while. It wasn't long until the complaints started back up again. People claimed that games were being rated too low solely to get hits, rather than reflecting on the quality of the games. So a few months ago, Jim Sterling released the
Destructoid Review Guide Version 2.0, reminding us that yes, Destructoid editors use the full ten point scale, and that they don't rate games low just for controversy.
I have collected data on the past year of reviews, since Aaron posted the first review guide, up through the most recent review on
Mortal Kombat vs. DC Universe, and as I will show, not only are they not rating games lower than they should, it is quite possible that even the Destructoid editors
overrate games, utilizing the maligned "IGN Scale" of 5 to 10. Without further ado, I present histograms of the review scores awarded by the Destructoid staff. The first histogram (at the top of the page) shows the data between Aaron Linde's guide posting and Jim Sterling's guide posting, the second shows the data between Jim's review guide and today, and the third shows all of the data since the first review guide was posted.
The first chart shows an interesting, almost linear relation of score to occurrence between the scores of two and nine, with each subsequent score being more common than the last. It shows only two ones awarded, both of which were for
Eternity's Child. It should be noted that the half-point scores generally occur less.
The second chart doesn't contain quite as many data points, so it isn't as smooth as the first. It still shows a surprising trend: eight and nine dominate the chart, each with twice as many occurrences as the next most common score. Only two games scored between zero and two (
Facebreaker and
Warriors Orochi 2), whereas thirty-four games scored between eight and ten.
The third chart is just the sum of the first two. With the most data points, it shows an obvious bias toward the right side of the chart, or the higher review scores.
So what causes this? Frankly, I could come up with a number of reasons. While Destructoid is large, it still isn't IGN or Gamespot, whose employees review and score every single game that is released, and so the Destructoid review team is probably more likely to play and review only the games that are likely to enjoy. It could be that in general, games that come out these days are mediocre at worst, and really awful games only come out every once in awhile. Or, the most dire possibility: perhaps some of the reviewers don't quite understand the review guides themselves, and are still grading on a five to ten scale.
Jim Sterling wrote in his review guide, "We try and review games based on what we like or know about most so if, for example, Colette rates
Chocobo Cloud Smiles 2: Chocobonkers a 2 out of 10, you know it's
got to be bad." While that's true, this review philosophy is potentially faulty, because if, for example, Colette rates
Chocobo Cloud Smiles 2: Chocobonkers an eight out of ten, then it doesn't really tell us
anything. Sure, a chocobo fanatic loved it, but what does that mean for the rest of us?
Let's step back for a moment and consider the second point though. I've given it a bit of thought at this point (in case the graphs weren't indication enough), but how
should the histograms be shaped? Do we consider a score of five to be "average" or "mediocre"? If a score of five means that the game is average, then the histogram should show a nice bell curve, with four, five, and six being most common, while zero and ten are least common. If a score of five means the game is mediocre, or in other words that it is as good as it is bad, as enjoyable as it is unenjoyable, as exciting as it is boring, then it is possible for the histogram to be shaped in any way imaginable. I bring this up because depending on how one defines the full scale, there may not be anything necessarily
wrong with the review scores, no matter how interesting.
Still, I wanted to more fully explore the possibility that some editors are more "at fault" for the score buffing phenomenon apparent in the total histograms. So for every editor in my data set who has written ten or more reviews, I have tabulated separate histograms. These follow, with commentary for each.
Arguably one of the most called out of the review staff for giving undeserved low scores is Reverend Anthony. His scores, however, are the most spread out of all of the editors highlighted here. Despite his negative infamy, his average score awarded is 6.71, nearly two points above an average score of five. Some points to note are that he has awarded as many threes as he has seven-point-fives and nine-point-fives. Even though his is the most spread out, it is still biased toward the high end, with only one game getting a review score between zero and two-point-five.
Brad's histogram is undoubtedly the most like a bell curve centered around five out of any of the reviewers. I've become particularly interested in his reviews after seeing these data, because it shows a lot of promise for the proper use of the one to ten scale, but it really needs more data points in it before we can draw any conclusions. One thing you may note is that he almost never awards half-point scores, which I really prefer.
And now we begin to see the more telling data of skewing toward high scores. Aside from two games (
Baroque and
Infinite Undiscovery), Colette has not given any games a score less than seven. Her average score awarded is 7.38, and her most common scores awarded are seven-point-five, eight, and nine.
Dale's histogram is almost identical to Colette's, with one game getting a four (
Zoids Assault), and no others receiving a score less than six. His average score awarded is 7.80, with his most common scores being eight and nine.
Among all of the charts I have prepared, Dick McVengeance's (Brad Rice's) has probably the most interesting shape. It's clearly
bimodal, but I'm not sure what we can conclude from that. It's as if he is actually reviewing on a binary scale; either he likes a game or he doesn't. Even so, his average score is a full point and a half above five, and his most commonly awarded score is eight-point-five.
Jim and Anthony are tied for total number of reviews within this data set at twenty-six reviews in the past year (that's one every other week on average, as it turns out), and he is another common target for accusations of doling out lower scores than were deserved. However, his is not that different from Colette's or Dale's, except that he has given a total of four games scores less than five. His histogram is still pretty heavily biased toward the right, with an average over a point and a half above five, and most common scores of seven and nine.
I'll be honest; part of the reason I was motivated to look at individual editors' scores in this analysis was because I feel that Jonathan is
not critical enough of the games he reviews. The histogram really speaks for itself (but that won't stop me from pointing out various parts). The
lowest score he has awarded in the
past year of reviewing is a
six (for
Samba de Amigo), and it's the only score he has given out less than seven-point-five. His average is way up at 8.44, and his most commonly awarded (twice as many as the next most common scores) is a nine.
Lastly, we've got the histogram for Destructoid's Editor-in-Chief. It's like Anthony's in that his scores are fairly evenly distributed (aside from an overabundance of eights), but it's unlike Anthony's in that it appears Nick is using the IGN five-to-ten scale. The entire left half of the graph is blank, and his average score is 7.61, just about halfway between five and ten.
So there you have it. What do you think about this? Is the high score biasing because the reviewers review games that they are more likely to enjoy? Is it just that some people are extremely enthusiastic about games and they give everything a high score? Is it a case of the reviewers not adhering to the two posted review guides? And perhaps the most important question: is this an issue that requires addressing, or should we just go on as we have been, calibrating our judgment of scores based on the author or ignoring them altogether?
FOOTNOTE: This analysis is not meant as an attack on any of the reviewers. I love them all as people, and most of them as writers too!
(# 0) on 11/23/2008 19:27
(# 1) on 11/23/2008 19:32
(# 2) on 11/23/2008 19:35
(# 3) on 11/23/2008 19:39
*head explodes*
(# 4) on 11/23/2008 19:42
I know people might start rushing to claim you of being a troll or of flaming us or something, but this is one of the most well thought-out pieces I've seen in a long time. You put a shitload of effort into this -- congrats. We may just have to kill you now.
FIXED
Seriously though. Excellent analysis. Terrific read.
(# 5) on 11/23/2008 19:43
Dexter345 is a nerd.
J/k man, not really, but this must have taken a lot of work and I think it is something everyone should read before they bitch about review scores.
(# 6) on 11/23/2008 19:43
But seriously, I love the scoring system, but I've started to doubt how strictly the reviewers adhere to it. Dead Space getting not one, but two 9's was the straw that broke the camel's back for me.
It's still 9 times out of 10 the best place to go for reviews IMO, but the fact of the matter is they DO seem to use somewhat of an IGN-ish scale. It's more of a spin on an already existing method rather than the revolutionary scale we were told about.
And you rock.
(# 7) on 11/23/2008 19:45
(# 8) on 11/23/2008 19:45
Goddamn I fail.
(# 9) on 11/23/2008 19:58
Seriously, though, this provides some wonderful insight into Destructoid's reviews, and not just because we don't yet have any sort of reviews database (which makes your surely painstaking process much more clear, and much more significant). Great analysis, man -- constructive criticism is always appreciated!
(# 10) on 11/23/2008 19:59
(# 11) on 11/23/2008 20:05
(# 12) on 11/23/2008 20:08
I think that the 10 point review scale is a horrible way to review games. There is way too much choice, therefore when you go low (if you do at all) it can look very skewed. The 5 point/stars system or what 1UP does are what I consider to be the best systems for reviewing just about anything.
(# 13) on 11/23/2008 20:09
(# 14) on 11/23/2008 20:13
(# 15) on 11/23/2008 20:19
And this has already been said, but this is an excellent analysis. Great work.
(# 16) on 11/23/2008 20:22
Looks good to me. I gather from this that of games reviewed, less games are complete shit than not and that Destructoid does not give mediocre games lower scores than they deserve "just because". Also, that reviewers don't review an equal amount of games and probably mostly review games that people have interest in, which usually is not the case with horrible games.
If I would be in there my histogram would probably be heavier on the right side too, purely because if I could choose what to spend my time on, play and review, I'd pick games that most likely are pretty good. And while disappointments do happen (way too often), they're honestly not that often so earth-shatteringly horrible that it brings a game you had high hopes for all the way down to below average (5). Very simple, but just my opinion. Keep the good reviews coming!
(# 17) on 11/23/2008 20:27
Great write up.
(# 18) on 11/23/2008 20:27
To explain why my graph is like that, I tend to pick the games that I review, and don't get too many "assigned" to me. So, I either pick a game that I know is going to be worth it (hopefully), or one that I know is going to be really fucking bad. Like, I expect Persona 4 to score well, because the previous title was so damned good.
Also, it seems like you're missing a few numbers. Like, for example, I gave Persona 3 a 10.
(# 19) on 11/23/2008 20:30
(# 20) on 11/23/2008 20:31
(# 21) on 11/23/2008 20:31
(# 22) on 11/23/2008 20:38
That being said, I think your definition of 5 being "average" is fascinating. I always tend to look at it as if 5 is a mediocre game, not as if the average game is a 5. That's a really interesting distinction to make, and one I'd never really considered.
(# 23) on 11/23/2008 20:39
(# 24) on 11/23/2008 20:46
Why in the world can you write the worst comments in the history of ever and then come back with one like that? You make my head hurt...
(# 25) on 11/23/2008 20:47
Still, what possessed you to go through all this trouble? It's fantastic, but damn son!
<3
(# 26) on 11/23/2008 20:54
(# 27) on 11/23/2008 20:55
(# 28) on 11/23/2008 21:02
I believe Anthony is more confidant in his opinions and thus more willing to write lower score reviews. Other editors, like most gamers, focus on the good games... why bother with the trash.
Great work Dexter!
(# 29) on 11/23/2008 21:05
(# 30) on 11/23/2008 21:16
(# 31) on 11/23/2008 21:27
but its hard for gamers to review games. we like them. i could never do a review, cause if a game isn't totally broken, i'll find something to like. its a lot like pie. even when its bad, its still pie, and better than vegetables. or histograms.
(# 32) on 11/23/2008 21:33
Once they get every game for free and are big enough not to use Metacritic anymore the reviews will use stars and everyone will be happy and bellcurvy.
(# 33) on 11/23/2008 21:37
Fantastic work, Dexter. Very enlightening. I hope people take your caveats about possible reasons for this distribution to heart as well.
(# 34) on 11/23/2008 21:41
Also, nice work D.
(# 35) on 11/23/2008 21:44
(# 36) on 11/23/2008 21:51
(# 37) on 11/23/2008 22:02
(# 38) on 11/23/2008 22:14
(# 39) on 11/23/2008 22:19
"[...]the Destructoid review team is probably more likely to play and review games only the games that are likely to enjoy."
Like you said, they aren't reviewing every game to come out; there's a long list of those every year.
Page 1: http://en.wikipedia.org/wiki/Category:2008_video_games
The games reviewed are in no way chosen as a random sample from all titles released. Because of that, the averages of the scores have no extrapolative significance and they mean absolutely nothing.
Frankly, I think that if you were to tabulate the averages of all games released, the "high scores" that you're highlighting in this post would then be the outliers in your histograms. Would you make a post implying that Destructoid might not be fully utilizing their own 10-point scale when the far majority would be between 1-5?
Maybe, just maybe, there's some kind of method to the madness that is their process in choosing which games to review! Could it be!?!?
(# 40) on 11/23/2008 22:24
(# 41) on 11/23/2008 22:26
@Joseph: One of the things I've been wondering about (and I mention in the writeup) is that the assumption that a score of five means a game is average means that the scores should be centered around it, but that assumption is not necessarily true.
@Nick: No need to defend yourself, and indeed, the idea that you tend to review big-name games which are typically more likely to be actually good has been brought up as the most likely cause of the score buffing.
@Anthony: I put ten shitloads of work into this.
@Everybody else: Wow, I didn't really expect this kind of response for this article. Thanks for the responses. Perhaps I'll collect them and make a histogram out of 'em.
(# 42) on 11/23/2008 22:30
Cunning bastard.
(# 43) on 11/23/2008 22:33
Since we're not in a single office some editors get more PR-sent games more often than others, and the team as whole aren't reviewing the same games it is hard to read editor trends. Also, editors buy their own games 9/10 times and since we're not funded we primarily review games we've been looking forward to or that PR companies share with us. So there's a lot of variables that make it tricky to weigh against.
It would be more productive to see the same data set (accounting for actual games reviewed). I'm curious if that would reveal that we're more likely to use the 1-10 scale, or would that only show that we agreed that the good games were good? Its hard to say.
(# 44) on 11/23/2008 22:33
I really thought I would see a nice bell like shape for the full time histogram, based on my believing of harsh good reviewers on this site.
But, after a quick glance to the latest reviewed titles I started to imagine a more left oriented shape, really orientated to the 4 as the highest peak and not the 8 and 9 scores.
Yes, it could be like Dick McVengance and Joseph stated, games that they spent 60 bucks on, and not shit to review.
IT could be, but I am starting to think that the scale is not well implemented, and should be well revised by the reviewers.
Come on! games aren´t that great lately.
Also, sorry for my English rapping.
(# 45) on 11/23/2008 22:49
(# 46) on 11/23/2008 22:50
[overly complicated part]First, we have to find the score averages of every reviewed game in the past two years and average them (minus Destructoid review scores) to discover the "industry bias". I'm guessing you'll find the average game gets a 6.8 or so. You'll probably also finds that bias follows a bell-shape as you get away form this score median; it's really easy to bump something up from a 6.5 to a 7 without anyone complaining, but bumping up a score from a 9 to a 9.5 will draw criticism if it's not surrounded by a lot of 9.5s from other reviews. In other words, there still won't be many games (proportionately) in that last 9-10 score range.
Then, once we get a rough bias curve, we can limit our set to the body of games reviewed by Destructoid, and compare the Dtoid score on a given game to the industry average and see what emerges.[/overly complicated part]
A quick-and-dirty way to approximate these results is simply to do histograms of dtoid game reviews vs. the gamerankings.com average for the same games. It's not as statistically accurate, but it's way easier.
(# 47) on 11/23/2008 22:52
GOOD IDEA
(# 48) on 11/23/2008 22:55
Nice job.
(# 49) on 11/23/2008 22:58