NCAA D1 game model clustering and KPIs derived by machine learning

“Success leaves clues. People who succeed at the highest level are doing something differently than everyone else does.” Tony Robbins

The American college game is a different beast.

With clock stoppages extending ball in-play time by nearly 20 minutes and virtually unlimited substitutions, NCAA soccer requires a different perspective.

General principles of play remain the same, but because the game is so unique, college soccer coaches also have to immerse themselves in trends that are specific to their level of play. Sure, take a high-pressing, possession-dominant approach like we’ve seen from Manchester City and Pep Guardiola. That’s fine. But now, tailor it to this unique environment.

Yeah, sit a little deeper and create space to counterattack like Atlético Madrid and Diego Simeone. But know how to take the La Liga side’s model and adapt it to the college game.

Don’t stop there. Now, study the best teams that NCAA D1 soccer has to offer. Use teams with similar game models as a guide for tactical adjustments to your own game model.

That’s where this data analysis originates. A couple of D1 coaches have asked, “How does our game model relate to other programs? And who with a similar game model is pulling it off better than us?”

To those coaches, thanks for the inspiration and conversations. That sparked further conversations within Total Football Analysis and put Sathish Prasad V.T and me to work.

You’ll recognise Satish‘s name from the brilliant work that he’s done on the Total Football Analysis website, but he’s also our data analytics guru and completing a Master’s in Quantitative Management at Duke University. After a great deal of discussion, he proposed we use his unique skill set and data analytics to cluster NCAA D1 men’s soccer programs into four distinct playing styles ranging from the aesthetically pleasing to sides that bunker in defensively. From those four clusters, which were created through statistics representing styles of play within our NCAA spreadsheet and its 25k+ values, he applied his talent in machine learning to reverse engineer significant statistical categories.

Through clustering and machine learning, we have identified statistical categories that carry greater significance within each game model, helping coaches better allocate their time to determine their programs’ KPIs. The game model-specific KPIs in this article are not exhaustive, but they are an excellent start and summarise the top three groups well (the fourth cluster lacked successful representation).

On to our findings.

Clustering, machine learning and general KPIs

Let’s start with a basic overview of the process.

Building out the spreadsheet was the first step. From there, we were able to use pass types and other statistics that correlate to styles of play rather than match results, which separated the teams into four distinct groups. As we reviewed the clusters, we found that certain key statistical categories were dominated by specific groups.

This is where machine learning kicked in. Reverse engineering the data, Satish used his coding magic to identify the categories that were most significant to each cluster. With those findings, we were able to go back to the data to validate the results proposed by his machine learning process.

But we didn’t stop there.

The point of the game isn’t to play within a specific game model. That’s simply representative of our views of the sport and how we want our sides to interact within the chaos of a game environment.

The point is to win.

Going back to the clusters, we looked for the most successful teams within each game model. To add to the challenge, Power Conference schools were eliminated from contention. We looked outside of the ACC, Pac-12 and Big Ten to find our teams.

For readers unfamiliar with NCAA D1 soccer, the term mid-majors is perhaps better reflected in NCAA football and basketball. It’s a term for universities that fall outside of the five, now four, power conferences. The Power Conference schools are the “haves” in those respective sports. Their facilities are on par with top football clubs around the world, and you could argue they have more investment in support staff.

Mid majors is a bit of a misnomer in college soccer. Look at any of the rankings throughout the season, and you’ll find equal, if not better, representation from the “mid-majors.” The only reason we are selecting teams from this group is to avoid the excuses of facilities and resources. Again, some mid-majors are virtually on par with the power conference teams, but it’s an excuse we want to eliminate.

Going back to the clusters, the first three featured very strong teams, whereas the fourth was populated by many of the weakest teams in the division. Since our objective is to identify statistical trends within NCAA men’s soccer’s most successful programs, our attention is on the first three clusters.

Our chart on points per game and possession percentage gives an idea of how we selected our representative teams for three clusters: New Hampshire, UNC Charlotte and Seton Hall. Each program had a very successful season, and we can see the variance in where they fall on the chart. Charlotte and Seton Hall are close in possession percentage; that’s where we can differentiate the two programs with more advanced metrics.

Even though each cluster has specific data that is more relevant to its game model and makes for better KPIs to track throughout the season, we did find that there are some general performance markers that are applicable across each grouping.

One of those is xG and xGA. When looking at successful programs, you’re going to find a direct statistical relationship between their performances in these two categories and the results — perhaps obvious, but worth noting.

Two other statistics that were significant across the clusters were field tilt and offensive duels. Even here, you can see that the significance of our highlighted teams’ placement is relative to their game model. It is critical for New Hampshire and teams like them to spend more time in the opposition’s half of the pitch, especially in or near the opponent’s box.

For a more balanced team like Charlotte, there is still an above-average number of box entries, but their approach doesn’t require quite the degree of touches in the box that you get with New Hampshire. For each of these key models, proficient 1v1 attacking is important, especially in the final third.

Finally, and this is perhaps the most interesting piece, Seton Hall’s field tilt is close to the average. Given that the teams in our third cluster will have less of the ball. They don’t need as many touches in the box as the first two groups, but to have success, the closer the NCAA D1 average they are, the better their chances of getting results. And do notice how few 1v1 attacking duels they have. They want to get from Point A to Point B as efficiently as possible. Fewer attacks will inherently result in fewer 1v1 dribbles P90, but those quick counterattacking situations often limit the significance of 1v1 attacking and place greater reliance on off-the-ball movement and controlled passing.

Cluster 1 – possession-dominant New Hampshire

As we work through these groups, they are ordered from more possession to less possession. While the possession percentage isn’t necessarily significant and, in fact, is often overvalued, it does give a sense of each team’s approach to the attacking phases of the game. More possession tends to correlate to a greater need for effective ball circulation in tight spaces and creativity in the final third. In contrast, less possession will free space for a team to attack against minimal opposition.

With that in mind, we look at our first cluster, that of New Hampshire and the possession-dominant, high-pressing sides. Manchester City, Bayern Munich and Barcelona go into each match with the intention of being the aggressors, setting the initiative in tempo with their in and out of possession tactics. Ideally, they can bend the opposition to their will. While they may make slight changes to their tactics based on the opponent, there is a very clear identity predicated on dictating the terms of the match. That’s the model.

For this group, possession percentage is one to track, but they also need to ensure that their possession is productive. Relating their possession percentage numbers to progressive passes P90 will give them an idea of how often they are breaking lines.

From an opposition analysis standpoint, pass-type wheels will also shed light on the opponent’s tendencies. This can be done on a per 100 passes scale or by putting the ratio relative to total passes in a percentile rank. In either case, the opposition’s tendencies will emerge. If they have an over-reliance on lateral passes, that could signal an inability to play through the press. Suppose there is a greater reliance on long passes or even forward passes. In that case, that is likely a sign of a more transitional team, requiring quick counterpressing and a sound rest defence to counteract their pathway forward.

In our last attacking graph, we have positional attack efficiency, so the number of positional attacks P90 as well as the number that convert to shots. If positional attacks are the preference, the idea is to monitor how successful the side is in converting them to shots. When the numbers are off or lagging behind baselines, it’s a concrete sign that something’s off tactically.

Positional attack efficiency is important, but know that we can also relate positional attacks P90 to several metrics, including the percentage that convert to shots and field tilt. There are several different ways to measure whether positional attacks are effective, so this is an opportunity for programs to determine what’s of greater importance to them.

The goal is to make sure that when they are in possession of the ball, they’re productive with it. Tracking the ratio of passes to more specific pass types, such as progressive passes, will also shed light on their variability in attack. When suffering against opponents who are willing to concede possession while defending in deeper spaces, a wheel to show a balance of pass types can shed some light on tendencies in possession. At some point, those forward pathways are sealed, requiring back passes to draw the opposition out or lateral passes to take advantage of an unbalanced opponent.

Finally, we have the defensive side. Again, this is not designed to be a comprehensive listing of KPIs. That may require a book or case study, but this does give a sense of key metrics relative to each game model. For cluster one, PPDA and high recoveries are critical for this group of teams. First, with the PPDA, it’s really their ability to prevent the opponent from taking advantage of the team’s expansive, attacking shape that’s priority number one. The basic ideas are to win the ball back, force a negative pass, or at least delay their progress forward so that the side can recover to their more compact defensive shape.

In terms of high recoveries, it’s not essential that these teams commit to the high press, as we saw in the previous data analysis with Bryant University, but that is typically a distinct characteristic of this cluster. If a program commits to that game model, having an awareness of its relation to the top teams is vital.

Finally, these teams typically don’t concede many shots. That said, the teams within this cluster that tend to struggle often allow high-percentage shots to go against them. Ineffectiveness to stop the opposition’s counterattacks and prevent them from entering the box lead to high xG shots. For Cluster 1 teams, tracking xG per shot against will help evaluate the quality of the shots they are allowing.

One hypothetical example is to consider is that of Team A vs Team B, each achieving an xG of 1.0 for a game. Team A has 20 shots, so an average xG per shot of 0.05. Team B takes five shots, so the average value is 0.20. xG is equal, but if you had to pick one team’s shot quality, which would you take?

The goal is to win, so I would pick team B. That’s where these possession-dominant teams have to prioritise solid rest defence and a quick counterpress. They simply can’t afford to allow high-calibre shots to the opponent, even with fewer total shots going against them. While the sample size from a single game is small, tracking xG per shot against over the course of the season — even visualising it chronologically will give an idea of how effective the team is with their in-possession structure and defensive transitions.

Cluster 2 – tactically balanced UNC Charlotte

It’s tough to beat a tactically balanced side. It’s even tougher to determine outright KPIs for them or map out their in-and-out-of-possession tactics. That’s why I wrote a whole book about Zinedine Zidane’s 2019/20 Los Blancos side in “Revitalizing Real Madrid.” There are so many nuances to these tactically balanced teams, and they are often terribly misunderstood or misinterpreted.

Two of the benefits of tactical balance are clear principles in the approach without dictating every action and the ability to adapt to whatever the opposition shows. If the opponent is obstinate in playing one style regardless of who they’re playing or what the match state is, it’s easier to plan against them when you have a team that’s capable of playing in any style. Looking back at the recent Real Madrid teams of Zinedine Zidane and Carlo Ancelotti, they could just as easily have 75% possession against Cadiz or 35% against Manchester City. Their identity wasn’t predicated on having the ball or not having it. Rather, it was on playing to their collective strengths, containing the opposition’s top threats, and ruthlessly attacking their opponent’s weaknesses. There was still an understanding of themselves, but perhaps a greater understanding of how they related to the opposition.

And that’s the conundrum of the tactically balanced squads. Their identity is somewhat more difficult to pinpoint, leading to poor interpretations. That was an interesting theme in a conversation with Kevin Langan, the head coach at UNC Charlotte (“Charlotte” on the charts). One of the top sides in NCAA D1 men’s soccer and spending time in the top 10 this season, Charlotte’s season statistics give an idea of why they were difficult to pinpoint. While there typically wasn’t a significant spread in possession, there were times when Charlotte enjoyed 55 to 62% of the ball and other times when they found themselves in the 30s. When Charlotte had 61% of the ball against Memphis, they won. In the four games they held 41% possession or less, they recorded four wins.

The point wasn’t to achieve a certain percentage of possession or overemphasise any one phase of the attack. Instead, the objective was to implement the program’s principles of play in whichever terms they experienced in a given game.

Langan was kind enough to share his side’s five principles: 1) verticality, 2) tempo, 3) compactness, 4) winning 1v1s (especially when attacking in the final third), and 5) continuity between the phases. Their trainings are primarily concerned with game situations and are modelled after what they’ll see in the opposition. Speaking with the assistant coaches over the summer, the team emphasises the security of their structures in practice.

The obvious benefit is that the team has experience playing against the approach they’ll encounter in the next game. The less obvious benefit is the opportunity to train tactical IQ. This is not to say training one specific game model won’t lead to a higher soccer IQ. It does. But knowing who you are is one part of the equation. The other is knowing how your side is to interact with the opposition to solve the problems that they’ll pose. Training in and out of possession structure and experiencing the opposition’s tactics in training gives players the reps they need to effectively find solutions when points are on the line.

In terms of KPIs, we’ll share some of UNC Charlotte’s momentarily, but we do want to draw out our process as well. Looking at some of the data points that balanced sides can track, we do find that Charlotte was excellent in match tempo and average shot distance. Whether creating from the press or playing through the opposition, Charlotte routinely found a way to get into the box and create quality shots from close range.

The tempo of the match plays a big role here. Whether it’s a mistake that’s induced by Charlotte’s press or the application of the principle of verticality that allows them to quickly break lines, Charlotte’s approach gives opponents little time to get organised. If the opponent is reactive, the advantage is Charlotte’s.

Our next chart visualises the passes to long passes ratio and passes to the final third P90. We do get a sense of how the possession percentage doesn’t reflect Charlotte’s presence in the final third. Given a middling possession percentage, the objective is to still rate above average in passes to the final third, which ensures the attacks the team does have are routinely positioning them to attack the box. For a side like Charlotte that does not send a significant proportion of long passes, they need to connect their short and intermediate passes to consistently pose a threat in the final third.

Defensive tactics are also adaptable for these teams. They may have a preference for defending in specific regions of the pitch, but they are typically good at defending high, middle or low. For Charlotte, they were slightly above average in high recoveries and a little bit better in middle regains of possession. This is where the side could just as easily set out in a high press and attempt to win the ball in the final third, drop into a mid-block or use the high press to force the opposition to play long into an area dominated by Charlotte defenders.

Charlotte’s presence in that top right quadrant gives an idea of their priorities in defence. PPDA is one of their KPIs, and the objective is to keep it below 10. In 2023, that score was 8.27.

Finally, we get to shots against and shots on target. Charlotte’s structure in and out of possession greatly limited the opposition. They’re not only one of the best in the country in terms of shots against P90, but they were also one of the best teams in shots against on target percentage. Forcing the opposition into longer-range shots and keeping numbers in front of the ball to block shots was a priority for Charlotte.

Charlotte is already ahead of the game with the sports analytics concentration from the school of data science. They have a team of students that meets with them each week to review performance in opposition data analysis. This award-winning group developed an xG model specifically for UNC Charlotte and has travelled abroad to present it.

The interactions with that team of students have helped UNC Charlotte establish its KPIs. Statistics like packing, field tilt based on the position of the ball, match tempo, PPDA and duels in the final third are key metrics for Charlotte. The analytics team also works on set pieces and possession within a set number of seconds after a throw-in. Langan and his assistants, Shane Carew, Charles Rodriguez and Austin Pack, work directly with the analytics team to innovate, assess performance and research questions that spark their curiosity.

In the big picture, this is a side with a highly professional setup. The principles of play are clear: data analytics helps him evaluate their play, and the players are trained to identify and solve problems that the opposition poses.

While they’ve had their fair share of success on the pitch, Langan framed the program’s success in a different way: “ When you look at our program, we can be proud of all the players playing professionally.”

That’s the goal. Between a tactically demanding game model and the advanced analytics happening behind the scenes, the players are well prepared for the demands of the professional game. Best of all, their story, game model and analytics show the cohesion of the program and give a better sense of its identity.

Cluster 3 – Simeone-style Seton Hall

Here, we have the lower-possession teams that can hurt opponents on the counterattack. Think Atlético Madrid and Diego Simeone or the first tenure Massimiliano Allegri years at Juventus in the Champions League: less possession, but a lethal, well-executed counterattack.

These teams prioritise defensive security above all. They’ll let the opposition retain possession in their half of the pitch, dropping off to protect space between their lines as well as behind them. To describe the approach in a single sentence, the goal is to minimise the space to defend while maximising the space to attack. Each team’s defensive shape also allows them to move more fluidly into attacking transitions, keeping numbers connected and attacking the opponent’s expansive shape before they can close the gaps.

In terms of shots P90 and xG per shot, the more shots, the better, but there is something of an acknowledgement that the style of play leads to less time in the attacking third and fewer shots as a result. With fewer shots in total, the objective is to maximise the quality of those shots. They need to be on frame, from close range and have a very high likelihood of finding the back of the net. If there’s one image to capture that approach, it’s shots and xG per shot, a graph that shows how phenomenal Seton Hall was in scoring goals despite limited chances.

And that’s critical for all Cluster 3 teams. While chances on goal may not be as plentiful, they can still be dangerous. As mentioned earlier, fewer shots but of a higher calibre can help a team overcome an opponent that enjoys more shots but of a lesser quality.

More precise passes in counterattacking situations are ideal, but teams in this grouping will also play an above-average number of long passes in relation to the total number of passes they play P90. That ratio of long passes to total passes is one worth tracking. Ideally, a number near the 75th percentile seems appropriate, as we see with Seton Hall.

‘Passes to the final third’ is an interesting one. Much like touches in the box or field tilt, the goal is to remain close to the median. High possession teams will typically enjoy more passes to the final third, but in measuring this statistic and performing at or just below the NCAA D1 average, Cluster 3 teams will have some reassurance that they are beating the initial wave of the opposition’s counterpress and positioning themselves to attack the box. And that’s the reason we like this measure for Cluster 3 teams. If they are struggling to beat the counterpress or not recovering those long pass attempts played up the field, they will struggle to even enter the final third. That’s a sign that training is needed in the earliest seconds of attacking transitions.

Defensively, total and low recoveries P90 are good statistics to track. In terms of recoveries P90, given the higher PPDA, so less immediate pressure on the ball, recoveries should be near the average line. It’s a sign that a team has constructed the press to limit the number of engagements necessary in a game, but also a sign they can take the ball off the opponent in their defensive third rather than allowing a shot on goal.

Defensively, there’s an element of frustrating the opposition to bait them into bringing more numbers forward. Teams don’t necessarily have to set up in a low block to achieve that objective. They can set a midfield line of confrontation, congest the centre of the pitch and force the opposition to commit numbers high and wide. Attempts to play into those high and wide players are often some of the best passes to intercept. There’s also the fact that when play is funnelled to one side, the press can condense space, recover the ball and then counterattack with numbers in support of the ball carrier against limited opposition.

The point here is that even though we have an above-average number of low recoveries, and Seton Hall was one of the national leaders in this statistic, teams can’t set up in a mid-block or with a midfield line of confrontation to set up those low recoveries.

Conclusion

Again, this is not an exhaustive list of KPIs. We’re really only scratching the surface here. Even with KPIs on hand, there’s still a matter of determining baselines. The three groups we selected offer excellent examples of archetypal baselines relative to the game model.

That’s a start.

In addition to performance analysis, this data can also help with opposition analysis and develop an understanding of results-based outcomes within clusters and in relation to the other groups.

Remember, success leaves clues.

Looking at some of the most successful teams in the three most successful clusters, we have models to evaluate. While no game model is simply a copy-and-paste job, and we must account for personnel, these archetypes offer a foundation.

What you do with the information is up to you. Just know that the top D1s are innovating in this area. Jump on the trend or get left behind.