Skip to content

7. Additional recommendations

Marina Marcet-Houben edited this page Sep 3, 2020 · 4 revisions

Item to item recommendations

While this project has been focused on recommending items to users based on the opinions of other users or on the expected taste of the users there are other ways to recommend things. One of those is to use an item to recommend other, similar items. So, I could for instance, after reading a work I liked, want to search for similar works. Normally I would scan the author and check whether there are more works written by the same author as a first step. I here provide two different options that will use similarities to recommend items. The first one is based on the tags obtained from each story. Similar to what it did in the content based recommender, the script will build a bag of words that will return an item to token matrix which will then be used to find items that have a similar set of words.

To run this script you need to execute:

python3 additional_scripts/content_based_item2item.py -i metadata_fics.clean.txt -o recom_i2i.txt

The -i option will accept the metadata file and the -o will provide a list of similar items for each item. The other options the script are the same as in the user content based approach:

  • -w: indicates whether the bag of words will be calculated using tfidf or counts.
  • --number_words: indicates the amount of tokes each independent bag of words will accept at maximum.
  • -k: Number of recommendations
  • --add_tags, --add_characters, --add_relationships, --add_authors, --add_fandoms. These options allow you to shape your bag of words by deciding which group of tags you want to include. By default all tags are going to be included.
  • --minSimilarity: There are instances where a user does not share a profile with any item and in those instances the recommendation is random. In order to avoid such a situation a minimum amount of similarity can be set of an item to be recommended.
  • --print_vocabulary: This is informative and allows you to print the words that form the final bag of words and how often they appear.

The second way to obtain item similarities and therefore be able to provide a list given a item of interest is by using the matrix obtained when training the matrix factorization model. Implicit provides a method to quickly get the list of similar items from the trained model. This option was added to the 5_collaborative_recommender_implicit/collaborative_recommender_implicit.py script and can be called with the -o_i2i option, which expects an file name to be provided so that the recommendations for each item included in the model can be provided. In this case similar items will be recommended by similarity of the users that have liked them rather than an intrinsic characteristic of the item itself.

No evaluation of the two methods has been done as it would require imput of other users and performing an A/B test in which we would use the recommender on one group of users and not on the other and then we would check for instance if the test group liked more items than the not-tested group.

Recommending authors with LightFM

In this project I used the implicit library to build a matrix factorization recommender system. I used it because it was well suited for the dataset I was provided in the sence that I did not have a ranking between users and items but simply a like, which means that not having a like could mean multiple things. Other libraries, such as LightFM, can work with such data but they perform better when having ranked data. In order to test this library I build a recommender for authors instead of works. The advantage of recommending authors is that you can make a ranking based on the number of works of a given author a user has liked. So, in the script meant to clean data there is an option that will allow the creation of a user to author table in which the values are not going to be 1 but rather they will be ranked from 1 to 5. This is done by dividing the number of works a user has read from a given author and dividing it by the total number of works found in the dataset that were written by said author. This value is then normalized to a scale that goes from 1 to 5. Note that the minimum value for a user that has read at least one work of an author will always be 1 and not 0, the script will reserve the 0 value for those cases in which the user has not given a like to any of the works of a given author. Additionally, not all authors will be included. Only authors with a minimum of five works will be considered in this dataset.

To obtain a test dataset agains which to compare predictions we need to have a metadata matrix, a matrix that will summarize a set of characteristics of the author. As we do not have this direct information, we will build it from the works written by an author. So, values such as number of hits, number of likes and so forth, will be calculated as the average values of those values in the corresponding words. Tags will be inferred by taking all the tags from all the stories written by the author and taking the 10 most common tags in each category if present. The idea is that while each story has a particular set of tags, an author usually has a tendency to write similar things that will be reflected in this set of tags.

I use the non-personalized recommendation script to obtain a baseline to improve. Like in the user to work recommendation, the number of likes is the best feature to base the prediction on. In this case it has a f1@k of 0.0039 and a map@k of 0.0003. Then I used the recommender based on the lightFM library to build a recommender model.

First I let the program explore the hyperparameter space 100 times. To do that I execute the following script:

python3 additional_scripts/collaborative_recommender_lightFM.py -t train.u2a.txt -o explore -e -n 100

In this case we provide the user to authors train table to the -t option, the -o option will include the first part of the output name and each subsequent recommendations resulting from the exploration will be called explore_1.txt, explore_2.txt, ... the -e option indicates we want to explore the hyperparameter space and the -n option indicates the number of times we want to do it. After running the script we evaluate the results with the batch option incorporated in the 6_evaluate/evaluate_predictions.py script which will provide the list of metrics for each exploration.

The best model achieves a f1@k of 0.1358 and a map@k of 0.0565, which is far above the baseline. Still, despite reaching a better recommendation than the one provided by the baseline there is a clear tendency to always recommend the same authors first. When checking recommendations with worse hyperparameters this effect is much more accentuated and there are cases where different users are recommended the same set of fics over and over again meaning the model is not really learning anything. Note that, while the variety in recommendations improve with better hyperparameters we are still recommending a very small variety of authors (116 out of 8272 possible).

LightFM, in addition to building pure matrix factorization models, is also able to incorporate item and user features to the model. In this dataset we do not have any user features as we do not have any information regarding the users, but we can introduce features. While this has been implemented in the script in such a way that when a metadata file is provided it will extract the tags from the file, calculate the word frequencies and introduce them in the model. The performance of this addition has not been tested.