an analysis of 3.9 million tf2 games

Sep 09 2025

this started all the way back in july when i was chatting with my friend labricecat, that was working on a new website, slurs.tf, and for it she decided to download data from logs.tf, a popular website for tf2 logs, which inspired me in turn to do this! just a preface, even though i've played tf2, i am barely knowledgeable about it, make sure to scold me if i've missed something obvious

step 1: scraping

usually such websites provide database dumps, which i checked for beforehand, but didn't find anything

this is a pretty simple script, since logs.tf stores parsed log files under http://logs.tf/json/<log_id> where log_id is a sequential integer, writing up a simple script in python was easy (tip: use requests.Session!), but inspired by tom murphy vii and the elaborate TUIs he makes for his projects, i used enlighten to spice it up a bit

the final look of the display, showing status codes for each game id, and the size of the downloaded json log

all in all it took about a week to download everything. a bit before finishing though, my friend found a database dump (here). whoops x3

another tip, when downloading a bunch of small files (and lets also say you can't use something smarter like a database or you're lazy like me), write to a file system (be it a virtual one or an external drive) with a small block size, to minimize overhead (when your file system uses a more data than the file size itself to remain fast, which can sometimes be more than 10x than your file itself!)

step 2: processing

to speed up our various data analysis tasks down the line, we can just run over all the downloaded files and fetch all the info that we want in one go, storing it in a database, speeding up any sort of retrieval task. this step wasn't very interesting, i just took values that i thought would be interesting and dumped them into a database

(step 3?) analysis

charts! so many charts! let's go (click on an image to view it)

the first one's a bit meta, this is the total logs uploaded per month, but you can notice the absolute peak in january of 2016, to the fall, to the eventual other spike during the pandemic. the sharp dip at the end is due to the data having a cutoff early in the month. otherwise the uploads seem to hover at a stable 20k per month, meaning about 1 log uploaded every 2 minutes

actually checking the logs per minute, there are also spikes i can't seem to explain, the maximum on this chart is exactly at 2:30 UTC. now onto the match data itself

as you can see, BLU has a ~3% higher chance of winning. player count per team seemingly is even inversely correlated - when a team has more players, it has a 30% of winning, while the opposing team has a 50% chance of winning! i have no idea why, but on the topic of player count

the peaks are at common game modes like 2v2, 6v6 (making up 42% of all games!), and 9v9. next up

a bunch of flags and attributes were included, which are graphed out (green is true, red is false, gray is missing), and also this correlation matrix (the value indicates if a stat is correlated, not correlated or inversely correlated)

finally, here's a ranking of the top 50 maps (click to view in full)

running a player simulation

my actual goal with this project was running a simulation, where i use an elo-like algorithm to try and guess the skill scores of the players, but before i run any results, a quick explanation is in order

rating system algorithms

a rating system algorithm in essence, is an algorithm that takes some skill prediction for a player, gets fed data in the form of a match result, and returns an updated skill prediction, where higher rated players have a higher chance at winning against lower rated players.

such systems work usually by calculating the "surprise" of the match result - ie if someone with a high score wins against someone with a low score, that doesn't really defy the expectations of the algorithm, but if someone rated low wins against someone rated high, that means that the players' ratings aren't accurate enough yet, and should be more drastically adjusted.

you might've heard of ELO, but it generally only supports 1v1 matches, so i settled on an algorithm by Microsoft called TrueSkill, which is used in games like Halo, Gears of War and Forza Motorsport exactly for the purpose of game matches. the interesting part is that actually TrueSkill scores are comprised of 2 numbers, which represent the variables of a normal distribution, mu ( $μ$ ) - the mean, and sigma ( $σ$ ) - the standard deviation, this is translated as mu being the score and sigma the uncertainty. to get an estimate, you use the conservative estimate formula,

μ - σ \cdot k

where $k$ is usually 3, meaning that we are 99% sure that the skill estimate is actually lower than the player's skill. this also means that someone with a score of $μ = 5, σ = 1$ has a higher skill estimate than someone with a score of $μ = 10, σ = 3$

the default score for a player is $μ = 25, σ = \frac{25}{3} = 8.333$ , meaning that their conservative skill estimate starts at 0, and that's pretty much everything you need to know about TrueSkill

running the algorithm

running this over the 3.9 million games (with ~20k outliers filtered) and 250k players took 3 days, for which i also designed a TUI

at this point i switched my terminal font to Iosevka

i wanted to asses the results' accuracy first, so i ran a few tests:

this is how often the match result was correctly predicted over time, showing that the more data the algorithm had, the more correct its predictions are. in total, on average, its accuracy was 65%, which could probably be finetuned to be higher, something a bit more representative is this plot

this is the difference in mu between the two teams and, on average, how much more did the winning team score. as you can see, after the difference becomes positive, the average score difference jumps from 2 to 3 points, showing that the algorithm has correctly assessed the team scores. the falloff at the edges is probably due to higher sigmas, meaning that the difference in mu itself was too uncertain. while doing this analysis, i noticed an interesting pattern,

the average conservative score estimate decreased over time. my guess is that influxes of new players "feed" the total score pool, as you can see the spike resembling the player activity during the pandemic

results

making a leaderboard with TrueSkill requires you use the conservative skill estimate, but also due to there being outliers, i also added the following restrictions:

the player must've played more than 1000 games
the player must've last played a game after the game id of 3,000,000 (Aug 2021)

with that, you get this:

#	Name	Rating	$μ$	$σ$	Games played
1	b4nny	27.34	31.321	1.32678	15470
2	nR ryb	25.71	29.6099	1.29982	1734
3	kobe1920	25.15	29.0349	1.29486	5641
4	nR TEK36@i55	24.85	28.7026	1.28393	2121
5	delpo	24.08	27.7716	1.23051	4928
6	AAYYY	23.77	27.6449	1.29299	9588
7	Freestate	23.55	27.54	1.32844	2073
8	nook nook	23.46	27.5874	1.37543	2216
9	manic	23.38	27.5052	1.37439	4461
10	nR Flippy	23.02	26.9048	1.29502	1360
11	T'	22.61	26.2606	1.21573	2233
12	toemas_	22.41	26.3641	1.31712	9795
13	bom	22.29	26.2162	1.30965	7483
14	heawing (●´ω｀●)	22.26	25.9825	1.24144	1343
15	ShaDowBurn	22.17	26.0482	1.29266	3714
16	nR Zebbosai	22.11	25.9856	1.29186	3106
17	brian's conjecture	21.98	25.9557	1.32425	5930
18	FROYO habib @ RGL LAN	21.85	25.8737	1.33973	11006
19	VNC kn *pepis	21.73	25.5833	1.28563	5453
20	ᴱᵀ hng	21.69	25.8237	1.37941	1025

at the #1 spot is the renowned twitch streamer b4nny (Grant Vincent), at #2 is ryb (Ruben Ljungdahl), who is actually retired (you'll be able to see this soon), and at #3 is kobe1920 (Noah G.). another thing we can do is plot the skill estimate over time. here's an example:

in this plot i've also added the raw $μ$ and $σ$ in yellow (the uncertainty is the filled in area), the conservative estimate being in blue. as you can see, in the beginning, sigma being high, the conservative estimate is quite low compared to mu, but when b4nny plays more games, the uncertainty decreases, and the estimate gets closer to mu. plotting all 20 players,

zooming in, you might notice how ryb's activity basically stops (his skill doesn't change), which seems to happen to other players in the chart too.

here is a total distribution of all players' conservative ratings. you can see how that's a bell curve centered at 0, and how generally the more games you play, the higher your score is. and that's about it!

notes

i feel like this is one of my lesser works, it underdelivered a bit, but at least i got to make a few charts! x3 on that topic, you can see a few extra charts that i did (in alphabetical order):