an analysis of 3.9 million tf2 games

Sep 09 2025

this started all the way back in july when i was chatting with my friend labricecat, that was working on a new website, slurs.tf, and for it she decided to download data from logs.tf, a popular website for tf2 logs, which inspired me in turn to do this! just a preface, even though i've played tf2, i am barely knowledgeable about it, make sure to scold me if i've missed something obvious

step 1: scraping

usually such websites provide database dumps, which i checked for beforehand, but didn't find anything

this is a pretty simple script, since logs.tf stores parsed log files under http://logs.tf/json/<log_id> where log_id is a sequential integer, writing up a simple script in python was easy (tip: use requests.Session!), but inspired by tom murphy vii and the elaborate TUIs he makes for his projects, i used enlighten to spice it up a bit

the final look of the display, showing status codes for each game id, and the size of the downloaded json log

all in all it took about a week to download everything. a bit before finishing though, my friend found a database dump (here). whoops x3

another tip, when downloading a bunch of small files (and lets also say you can't use something smarter like a database or you're lazy like me), write to a file system (be it a virtual one or an external drive) with a small block size, to minimize overhead (when your file system uses a more data than the file size itself to remain fast, which can sometimes be more than 10x than your file itself!)

step 2: processing

to speed up our various data analysis tasks down the line, we can just run over all the downloaded files and fetch all the info that we want in one go, storing it in a database, speeding up any sort of retrieval task. this step wasn't very interesting, i just took values that i thought would be interesting and dumped them into a database

(step 3?) analysis

charts! so many charts! let's go (click on an image to view it)

the first one's a bit meta, this is the total logs uploaded per month, but you can notice the absolute peak in january of 2016, to the fall, to the eventual other spike during the pandemic. the sharp dip at the end is due to the data having a cutoff early in the month. otherwise the uploads seem to hover at a stable 20k per month, meaning about 1 log uploaded every 2 minutes

actually checking the logs per minute, there are also spikes i can't seem to explain, the maximum on this chart is exactly at 2:30 UTC. now onto the match data itself

as you can see, BLU has a ~3% higher chance of winning. player count per team seemingly is even inversely correlated - when a team has more players, it has a 30% of winning, while the opposing team has a 50% chance of winning! i have no idea why, but on the topic of player count

the peaks are at common game modes like 2v2, 6v6 (making up 42% of all games!), and 9v9. next up

a bunch of flags and attributes were included, which are graphed out (green is true, red is false, gray is missing), and also this correlation matrix (the value indicates if a stat is correlated, not correlated or inversely correlated)

finally, here's a ranking of the top 50 maps (click to view in full)

running a player simulation

my actual goal with this project was running a simulation, where i use an elo-like algorithm to try and guess the skill scores of the players, but before i run any results, a quick explanation is in order

rating system algorithms

a rating system algorithm in essence, is an algorithm that takes some skill prediction for a player, gets fed data in the form of a match result, and returns an updated skill prediction, where higher rated players have a higher chance at winning against lower rated players.

such systems work usually by calculating the "surprise" of the match result - ie if someone with a high score wins against someone with a low score, that doesn't really defy the expectations of the algorithm, but if someone rated low wins against someone rated high, that means that the players' ratings aren't accurate enough yet, and should be more drastically adjusted.

you might've heard of ELO, but it generally only supports 1v1 matches, so i settled on an algorithm by Microsoft called TrueSkill, which is used in games like Halo, Gears of War and Forza Motorsport exactly for the purpose of game matches. the interesting part is that actually TrueSkill scores are comprised of 2 numbers, which represent the variables of a normal distribution, mu (μ) - the mean, and sigma (σ) - the standard deviation, this is translated as mu being the score and sigma the uncertainty. to get an estimate, you use the conservative estimate formula,

μσ·k

where k is usually 3, meaning that we are 99% sure that the skill estimate is actually lower than the player's skill. this also means that someone with a score of μ=5,σ=1 has a higher skill estimate than someone with a score of μ=10,σ=3

the default score for a player is μ=25,σ=253=8.333, meaning that their conservative skill estimate starts at 0, and that's pretty much everything you need to know about TrueSkill

running the algorithm

running this over the 3.9 million games (with ~20k outliers filtered) and 250k players took 3 days, for which i also designed a TUI

at this point i switched my terminal font to Iosevka

i wanted to asses the results' accuracy first, so i ran a few tests:

this is how often the match result was correctly predicted over time, showing that the more data the algorithm had, the more correct its predictions are. in total, on average, its accuracy was 65%, which could probably be finetuned to be higher, something a bit more representative is this plot

this is the difference in mu between the two teams and, on average, how much more did the winning team score. as you can see, after the difference becomes positive, the average score difference jumps from 2 to 3 points, showing that the algorithm has correctly assessed the team scores. the falloff at the edges is probably due to higher sigmas, meaning that the difference in mu itself was too uncertain. while doing this analysis, i noticed an interesting pattern,

the average conservative score estimate decreased over time. my guess is that influxes of new players "feed" the total score pool, as you can see the spike resembling the player activity during the pandemic

results

making a leaderboard with TrueSkill requires you use the conservative skill estimate, but also due to there being outliers, i also added the following restrictions:

with that, you get this:

# Name Rating μ σ Games played
1 b4nny 27.34 31.321 1.32678 15470
2 nR ryb 25.71 29.6099 1.29982 1734
3 kobe1920 25.15 29.0349 1.29486 5641
4 nR TEK36@i55 24.85 28.7026 1.28393 2121
5 delpo 24.08 27.7716 1.23051 4928
6 AAYYY 23.77 27.6449 1.29299 9588
7 Freestate 23.55 27.54 1.32844 2073
8 nook nook 23.46 27.5874 1.37543 2216
9 manic 23.38 27.5052 1.37439 4461
10 nR Flippy 23.02 26.9048 1.29502 1360
11 T' 22.61 26.2606 1.21573 2233
12 toemas_ 22.41 26.3641 1.31712 9795
13 bom 22.29 26.2162 1.30965 7483
14 heawing (●´ω`●) 22.26 25.9825 1.24144 1343
15 ShaDowBurn 22.17 26.0482 1.29266 3714
16 nR Zebbosai 22.11 25.9856 1.29186 3106
17 brian's conjecture 21.98 25.9557 1.32425 5930
18 FROYO habib @ RGL LAN 21.85 25.8737 1.33973 11006
19 VNC kn *pepis 21.73 25.5833 1.28563 5453
20 ᴱᵀ hng 21.69 25.8237 1.37941 1025

at the #1 spot is the renowned twitch streamer b4nny (Grant Vincent), at #2 is ryb (Ruben Ljungdahl), who is actually retired (you'll be able to see this soon), and at #3 is kobe1920 (Noah G.). another thing we can do is plot the skill estimate over time. here's an example:

in this plot i've also added the raw μ and σ in yellow (the uncertainty is the filled in area), the conservative estimate being in blue. as you can see, in the beginning, sigma being high, the conservative estimate is quite low compared to mu, but when b4nny plays more games, the uncertainty decreases, and the estimate gets closer to mu. plotting all 20 players,

zooming in, you might notice how ryb's activity basically stops (his skill doesn't change), which seems to happen to other players in the chart too.

here is a total distribution of all players' conservative ratings. you can see how that's a bell curve centered at 0, and how generally the more games you play, the higher your score is. and that's about it!

notes

i feel like this is one of my lesser works, it underdelivered a bit, but at least i got to make a few charts! x3 on that topic, you can see a few extra charts that i did (in alphabetical order):


© by nicole, licenced under CC BY-SA 4.0; go back