On the 5th of February 2019, the B4F Big Data team organised a training workshop on the implementation of a data lake stack within the domain of animal breeding. The workshop was held at Zodiac, Wageningen, and attracted 20 participants from the industry and WUR.

Since the start of B4F Precision Phenotyping in 2017, the Big Data team focussed on Big Data analytics in a broad sense, with the objective to get exposed to and learn from state-of-the-art tools and methodologies within the domain of data science. One of these tools to collect, store, and analyse data are data lakes. This training workshop was organized for B4F partners to get accustomed with this specific tool.
The workshop was split into a theoretical (morning) and practical (afternoon) session. The theoretical session gave an introduction to Big Data, how Big Data systems work, and when they are needed. During this theoretical session the main two principles of Big Data, immutability and pure functions, were introduced, as well as the two pillars of a Big Data systems, i.e. Map-Reduce and Distributed File Systems (see figure 1). These principles and pillars were subsequently illustrated interactively with the ‘Greek Island Game’. Following this interactive game, participants worked individually on a Map-Reduce tutorial in Apache Spark.
During the hands-on afternoon session, participants worked with a data lake stack, which was set-up beforehand and filled with data from an animal experiment regarding locomotion scoring in turkeys (part of B4F Precision Phenotyping – Locomotion). Participants were introduced to the stack of tools used (Apache Spark, Hadoop HDFS, Amazon Web Services, Jupyter, PixieDust, and Docker) for this data lake. They worked on multiple tutorials using interactive Jupyter Notebooks, from reading single sensor data files (collected during the locomotion trial), subsequently transforming and visualizing them, to saving the output. In addition to analyse and work on these single sensor data files, it was demonstrated how to easily scale this ‘Extract, Transform, and Load (ETL)’-procedure to a multitude of sensor data files. Finally, the participants discovered the advantage of a customizable ‘machine learning’ pipeline, in which two sensor data output files were linked to the phenotype of interest (here, locomotion score of turkeys).

The handouts of this workshop, and tutorial to install training material can be obtained by the organizers of this workshop (Ioannis.Athanasiadis@wur.nl or Dirkjan.Schokker@wur.nl).

Figure 1: Immutable data, pure functions, and Map-Reduce within a Big Data System (Hadoop Distributed File Systems)