The Problem
In order to effectively process a lot of decks data while publishing tournament reports, decks used by users need to be classified into their respective community-acknowledged archetypes. Doing this manually for over 500 decks is time consuming and not repeatable.
The Solution in a Nutshell
To effectively publish tournament reports, I made this tool that helps cluster decks based on the specified criteria, and it automatically detects the deck’s archetype when provided with the cards list.
Collecting Sample Data
Before I could start clustering decks, I had to collect a large sample of decks. Fortunately, for the game of Shadowverse, there is a good amount of tournaments held every week. In the tournaments, each participant would bring two decks, and the tournament website publishes the deck list links for all participants. I built a web scraper to collect deck links from the tournament websites. In the end, I would build a large list of deck links that can be used to start the clustering process.
The Process
While it would be ideal to immediately start clustering decks off their deck links, unfortunately some work must be done on the data before it can be fed to a clustering algorithm. Mainly, the deck’s data need to be vectorized. Also, I had to come up with good criteria to vectorize the decks. Initially, I vectorized based only on card names and copies ran in each deck. However, this ran into cases where similar deck types with different win conditions would not be detected. The solution was to vectorize based on these criteria:
- Copies ran of each card
- Keywords
- Special traits/races of cards
- Card types (Spell, Followers, or Amulets)
When testing vectorization using this criteria, decks with different win conditions were correctly identified as different clusters.
Clustering Decks
For clustering decks, I tested several clustering algorithms from scikit-learn. In the end, I settled on Agglomerative Clustering with single linkage. The number of clusters changes depending on the format and diversity of each class in the game, but the default is set to 4.
The Base Vector
While the full code for this project will be too long to fit in this blog, I chose to include the vectorization process as I felt it was the most crucial part in this project. For each criteria, I had to create a base vector. The base vector would be a 1 dimension array that maps card copies, traits, types, and keywords into the number ran in each deck. I will talk about how the base vector for the cards in the game was created, as it is the most essential piece.
- First, we get a list of all cards in the game
base_vector = [card_id for card in cards_list]
Then, for each deck, we go through all the cards in the deck, and register the number of copies played in the position of the card in the base vector. For example, say the card id is 10101234, we run 3 copies of it, and its id is located at the 49th index in the base vector. In this case, the vectorizer would do the following:
deck_vector = [0]*len(base_vector)
deck_vector[49] = 3
Notice how we create a copy of the base vector while mapping 0 to all its values. This is the default vector, where we assume every card is ran at 0 copies. When a card is detected in the deck, its position is mapped to the number of copies ran. In the end, each deck would be represented as a vector with each position indicating how many copies of such card is ran. That way, the clustering algorithm would be able to detect the euclidean distance between each vector, thus separating the clusters.
However, as I mentioned before, this was not accurate enough. On top of the base vector, I added the extra three vectors to increase accuracy. That resulted in decks being different enough to warrant different clusters when necessary. A recent example would be this class that had two dominant decks, where they both run the same core. The difference is their win condition is starkly different. However, the win condition was based on two different cards, which was not enough to make it identifiable as two different decks. By introducing keywords, special traits and card types, the clustering algorithm detected this as two different archetypes!
Creating the Clusters
This would all be meaningless without actually creating the clusters, right?! The most efficient way in terms of cost and deployment in this project’s case was to rely on json files to build the clusters. The json file would contain a large array of objects mapping each detected archetype into its card weights. For example, for the archetype labeled rally, it would look like this in the array:
"rally": [
{
"base_id": 126241020,
"weight": 1.0
},
{
"base_id": 124221020,
"weight": 0.95
},
{
"base_id": 124241010,
"weight": 0.95
},
{
"base_id": 123041020,
"weight": 0.75
},
{
"base_id": 125211020,
"weight": 0.68
},
{
"base_id": 125011010,
"weight": 0.61
},
{
"base_id": 123241010,
"weight": 0.59
},
{
"base_id": 126241030,
"weight": 0.59
},
{
"base_id": 125041020,
"weight": 0.56
},
{
"base_id": 123244010,
"weight": 0.56
},
{
"base_id": 122214010,
"weight": 0.54
},
{
"base_id": 122241010,
"weight": 0.54
},
{
"base_id": 125041030,
"weight": 0.49
},
{
"base_id": 122244010,
"weight": 0.42
},
{
"base_id": 122031020,
"weight": 0.37
},
{
"base_id": 125231020,
"weight": 0.36
},
{
"base_id": 126041020,
"weight": 0.32
}
]
The Classification Process
To detect the deck’s archetype, each deck starts with a score of 0. Each card in the deck would add a score of number of copies x card weight in the cluster. For example, if a card weight is 0.32, and it is ran at 3x copies, it would add a total of 0.96 to the deck’s score for that archetype. In the end, the archetype where the deck scores the highest will be chosen as the deck’s archetype.
Deploying the Classifier
The deployment constraints were why I picked json as the file format to output the clusters, instead of using a classification algorithm. I stored the clusters in a Workers KV Store, and deployed a worker that is responsible for classifying decks based on its link. That way, it can be integrated into the main project where it is being used seamlessly.
Conclusion
I understand this was not a thorough rundown of the classifier, but I tried going through the most crucial parts in the process to showcase this. I found learning and applying data science and machine learning concepts for this project interesting. Unsurprisingly, this project increased my interest in the data science and machine learning field, so I am planning to improve it further the more I learn!