In this study, we aim to analyze the frequency of letters in various languages using Hadoop and MapReduce. We’ll be diving into two different methods for data combination: the Combiner and In-Mapping Combining techniques. Additionally, we'll explore how changing the number of reducers affects performance.
We’ll also compare these distributed approaches with a local execution using Python, and in the end, we’ll use the best-performing distribuited configuration to analyze a text in different lenguages. For our text, we've chosen the beloved classic "Pinocchio".
- Combiner Method: Using a combiner to reduce the data transferred between the mapper and the reducer.
- In-Mapping Combining: Combining data directly within the mapper to further reduce network load.
- Running the letter frequency analysis with different numbers of reducers to see how this configuration impacts performance.
- Comparing the performance of different configurations using metrics like execution time and resource utilization.
- Comparing these results with a local execution in Python to highlight the pros and cons of each approach.
- Applying the best configuration to analyze the letter frequency in "Pinocchio", examining the text in multiple languages.
Automated Analysis
- We’ve implemented shell scripts to automate the analysis tasks.
With the optimal configuration, we’ve analyzed the letter frequency in "Pinocchio", providing a detailed look at how letters are distributed across different languages.
For a deeper dive into the project, check out the Documentation. and the presentation slides. available above ☝🏻.
- Martina Fabiani
- Tommaso Falaschi
- Rossana Antonella Sacco