3. Content based recommender

Content based recommenders use the information regarding the item to suggest items with similar characteristics that the user may like. In this dataset, the authors have tagged each of their works with different groups of tags which include:

Fandoms: A fandom here refers to a particular universe whithin which the work can be emplaced. For instance, within the Marvel fandom, there are fandoms for the Thor movies, the Spiderman movies, the Ironman movies, etc... The fandom will give a general idea on which universe the work will happen in. Multiple fandoms can be assigned to a single work.
Characters: One characteristic of this kind of works is the characters are repeated over and over again as the works are always revolving about the same fandom. So, these group of tags will give a list of characters that are important to the story. Of course, this list is subjective and set by the author. Some authors will give a complete list of characters regardless on their importance, whether some authors will only put the main ones.
Relationships: A lot, if not most, of the works tend to have romantic relationships between different characters. These tags will indicate between which characters such a relationship can be expected.
Addtional tags: These are the most varied and maybe the most informative tags. They indicate aspect from the story and are kind of the summary of the work. A good set of tags will provide with a good idea on the kind of work you're about to read. Still, the tags are free-form and as such many may be unique to a single or few works.

To process all these tags and extract information from them the script will treat them as if they were bags of words. These bags of words can be build in two different ways: either by counting how often each word appears in the whole set of words and keeping those that appear most often or by calculating the Term frequency – Inverse document frequency (TFIDF). Both metrics are calculated using the implementation found in sklearn. The script will extract independent bags of words for each of the categories (or those categories selected by the user) and for the author name in case this give any further signal. The final bag of words which will be used to calculate similarities will be the aggregation of the different bags of words.

Once the bag of words has been built, for each user the script will relate the works that the user has liked to the bag of words by doing a dot product of the two and obtaining a vector that will have different weigths for each word depending on which works the user has read and the words associated to each of these works. This vector can then be used to find works with a similar word frequency. We will use the cosine similarity implemented in sklearn to calculate the most similar works and recommend them to the user.

In the schematic below the different tokens are represented as colour, size and shape for the different items. The user of interest likes square things, shows a preference for bigger sized, but does not show a particular preference for colour. Therefore the items that will be first recommended will be square in shape, and go from bigger to smaller. If a similarity filter is put then it is possible that no more items are recommended due to the fact that there is not enough similarity between the preference of the user an the remaining items. The disadvantage of this method is that it is very slow compared to other methods. It can be sped up by having a pre-calculated set of item recommendations (see item to item recommendations in 7_additional_scripts)

How to run the script

In order to run this script you will need to execute the following command:

python3 3_content_based_similarity/content_similarity_recommender.py -i metadata_fics.clean.txt -t train.u2i.txt -o recom_3.txt

The -i option will accept the metadata file, the -t option will take the user to item table, which as we want to assess the performance of the method will be the training table. -o will provide the recommendations for each user. The other options the script accepts are the follwing:

-w: indicates whether the bag of words will be calculated using tfidf or counts.
--number_words: indicates the amount of tokes each independent bag of words will accept at maximum.
-k: Number of recommendations
--add_tags, --add_characters, --add_relationships, --add_authors, --add_fandoms. These options allow you to shape your bag of words by deciding which group of tags you want to include. By default all tags are going to be included.
--minSimilarity: There are instances where a user does not share a profile with any item and in those instances the recommendation is random. In order to avoid such a situation a minimum amount of similarity can be set of an item to be recommended.
--print_vocabulary: This is informative and allows you to print the words that form the final bag of words and how often they appear.

An alternative approach is to first build item to item recommendations. Meaning that for each item you'll search for similar items using the same bag of words used before. The advantage to this is that you don't need to learn what the user itself likes, you just search for items that are similar to other items in terms of the tags that they have. Once this is done, you see which items a particular user liked and then put together all the recommendations for those items. The items that are recommended more often are the ones you will end up recommending to the user.

Test dataset

I first ran the script on the train dataset calculating the Term frequency – Inverse document frequency for all the tags included. The number of words was limited to 10000, which means that the final bag of words was formed by, at most, 50,000 words. I tested its performance comparing it to the validation set and measuring the f1@k and the map@k which were 0.0016 and 0.0002 respectively. I then tested whether different combinations of tags could provide better results. So I ran the script several times doing all possible combinations. Curiously the best result is obtained when using only tags belonging to the relationship category with a f1@k of 0.0033 and a map@k of 0.0014. The second best result depends on the metric used to evaluate the system. So, according to f1@k the second best result is obtained by using a combination of additional tags, relationships and author names (f1@k of 0.0031 and map@k of 0.0007). Which basically means the results get worse with the addtional data. Curiously, when looking at map@k the second best combination is additional tags and characters (f1@k of 0.0018 and map@k of 0.0009). Still, none of these results come close to doing better than the non-personalized likes based recommender (f1@k 0.0364 and map@k 0.0090). I tried to improve the results by using counts instead of tdifd but the results either were unchanged or were worse. There was a slight improvement when using smaller bags of words, but still this go nowhere near results obtained by recommending the most liked works.

There are several reasons why this may be happening. The most obvious one is that the combination of number of words is not optimal which should be solved with repeated testing with different number of words. The second is that the tastes of users are not as well defined. Take into account that the service provided by this website is free therefore anyone is able to read anything they want to which gives a lot of freedom to explore things that you may not think you'd like in first instance. As such there may be not enough signal to capture a clear preference and such a simple model is not able to cope with it. Lastly, content based recommenders solve some of the issues found with cold-start users, when little information is available for those users and we cannot rely on collaborative methods. The dataset we are using here contains the users with the largest amount of information so we do not have any user that would, in theory, benefit from using a content based approach. I ran a test to see whether in a dataset with cold-start users it would show a better performance but that was not the case.

The alternative approach based on an initial item to item recommendation, works better, although the results stay at the level of the non-personalized likes (f1@k of 0.0351 and map@k of 0.0097). It is possible that tweaking the building of the bag of words could improve those results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Content based recommender

How to run the script

Test dataset

Clone this wiki locally