How does the
proyect work?

In this page you will find how all features work and were developed,
going from the tools we used and the datasets to the description of
the software and its performance.

Dataset description

We started working with multiple datasets from kaggle but we finally decided to work with a dataset from steamspy, which offered us the information we wanted for our study. This dataset contains information about more than 25000 videogames and data about their sales, playtime, tags, developers...
Using this data, we managed to compare and get information about thousands of applications in just a few seconds.

Download data

Tools we use

To construct our data and analyze it we are using pyspark both in local and a hadoop cluster. Through these applications, we manage to produce new organised data which is the processed using other python scripts with the matplotlib library to generate diagrams to visualize the results.

Repository

All our proyect can be easily found in our GitHub repository.

Go to Repository

Software Description

We have divided our project in two separate sections. On the one hand, we have our script that get you specific information about the data, such as average prices or average playtime for different aspects of the games. On the other hand, we have developed a script to recommend specific videogames according to a player's preferences. Click on READ MORE to se more detailed information.

Read more

User Guide

There are two ways to run our code. That is in local mode and in cluster mode, which is possible using an AWS (Amazon Web Services) instance. In this section we explain how to use both options.

Local Mode

In order to run our applications in local mode using Linux Ubuntu or an Ubuntu virtual machine in Windows, we need:

Python installed
Spark installed
Steam.csv downloaded (you can find it in the dataset section of this page)

Once we have all the requirements, we can easily run the code with the following command line:

$spark-submit file_name.py "argument"

Where "argument" is only included for the execution of codes that need an argument, such as GameRecommendation.py, which needs a game as an argument in order to operate correctly.

Cluster Mode

In order to run our applications in cluster mode using an AWS instance:

Python installed in our instance
Spark installed in our instance
Steam.csv downloaded in our instance(you can find it in the dataset section of this page)

Once we have all the requirements, we can run the code with the following command line:

$spark-submit --num-executors N --executor-cores M  file_name.py "argument"

Where:

N is the number of worker nodes
M is the number of cores per worker node
"argument" is only included for the execution of codes that need an argument, such as GameRecommendation.py, which needs a game as an argument in order to operate correctly.

Performance

We tested the performance of our scripts both locally and in an AWS cluster (m4.xlarge to be precise), trying many numer of cores. We tested the scrip GameRecommendation.py, since we consider its the most complex one regarding data processing out of our scripts. Howewer, to our surprise, we did not obtain the results we expected. We barely saw any difference in any of the tests, obtaining execution times around 14s for every execution. After thinking about it, we came to the conclusion that the execution times barely change since the size of the data we are processing is not big enough to make a noticeable difference when processing in a cluster.