Dataset description
We started working with multiple datasets from kaggle but we finally decided
to work with a dataset from steamspy, which offered us the information we wanted for our study.
This dataset contains information about more than 25000 videogames and data about their sales, playtime, tags, developers...
Using this data, we managed to compare and get information about thousands of applications in just a few seconds.
Tools we use
To construct our data and analyze it we are using pyspark both in local and a hadoop cluster.
Through these applications, we manage to produce new organised data which is the processed using other python scripts with
the matplotlib library to generate diagrams to visualize the results.
Repository
All our proyect can be easily found in our GitHub repository.
Software Description
We have divided our project in two separate sections.
On the one hand, we have our script that get you specific information about the data, such as average prices or
average playtime for different aspects of the games. On the other hand, we have developed a script to recommend specific videogames
according to a player's preferences. Click on READ MORE to se more detailed information.
User Guide
There are two ways to run our code. That is in local mode and in cluster mode,
which is possible using an AWS (Amazon Web Services) instance. In this section we explain how to use
both options.
Local Mode
In order to run our applications in local mode using Linux Ubuntu or an
Ubuntu virtual machine in Windows, we need:
- Python installed
- Spark installed
- Steam.csv downloaded (you can find it in the dataset section of this page)
Once we have all the requirements, we can easily run the code with the following command line:
$spark-submit file_name.py "argument"
Where "argument" is only included for the execution of codes that need an argument, such as
GameRecommendation.py, which needs a game as an argument in order to operate correctly.
Cluster Mode
In order to run our applications in cluster mode using an AWS instance:
- Python installed in our instance
- Spark installed in our instance
- Steam.csv downloaded in our instance(you can find it in the dataset section of this page)
Once we have all the requirements, we can run the code with the following command line:
$spark-submit --num-executors N --executor-cores M file_name.py "argument"
Where:
- N is the number of worker nodes
- M is the number of cores per worker node
- "argument" is only included for the execution of codes that need an argument, such as
GameRecommendation.py, which needs a game as an argument in order to operate correctly.
Performance
We tested the performance of our scripts both locally
and in an AWS cluster (m4.xlarge to be precise), trying many numer of cores.
We tested the scrip GameRecommendation.py, since we consider its the most complex one
regarding data processing out of our scripts. Howewer, to our surprise, we did not obtain
the results we expected. We barely saw any difference in any of the tests, obtaining execution
times around 14s for every execution. After thinking about it, we came to the conclusion that the execution times
barely change since the size of the data we are processing is not big enough to make a noticeable
difference when processing in a cluster.