Photo by Eric Han on Unsplash

Spark is a very popular framework for data processing. It has slowly taken over the use of Hadoop for data analytics. In memory processing can yield up to 100x speed compared to Hadoop and MapReduce. One of the main advantages of Spark is that no more need to write map reduce jobs. Moreover, the spark engine is compatible with a large number of data sources (txt, json, xml, sql and nosql data stores). Spark is with Hadoop, SQL, Python and R one of the most sought after skills for data scientists.

A spark application is made of:


Photo by Quinten de Graaf on Unsplash

In software engineering, continuous deployment and integration is a growing trend consisting of frequently updating and releasing code via automation. Every change to the codebase is processed trough an automated pipeline: it is tested, and when merged with the main branch, deployed (in this case a new version of your code is released).

The principle behind CI is: by testing every new addition to your codebase, you can catch bugs early and improve the quality of your code before it being deployed.

The principle behind CD is: an efficient deployment pipeline will allow you to release more often with little…


Docker is an application allowing to manage Linux containers on top of an existing OS. It provides a virtualisation layer (the Docker Engine) and thus any command and operations ran inside the container remain the same regardless of the OS on which Docker is set up. Docker relies solely on the host OS as a result the only compatibility issue we can run into is whether the OS supports Docker: this simplifies greatly sharing of code and transitioning from development to production.

Containers have a great advantage compared to traditional virtual machines is that they are lightweight as they do…


The recent developement in data storage and processing have been motivated by the increasing amount and complexity of data available to individuals and companies. Most of these recent advancements require sophisticated and powerful hardware. Aspiring data scientists must be able to understand and master these new tools.

For instance, given the computer power required to carry out training of algorithms, these are usually carried out “on the cloud” (i.e. by remote access to a virtual machine) which avoid to buy and maintain very expensive hardware especially if the peak usage is only occasional. …


Large data set with low variance:

original dataset:

k-nearest neighbours:


OpenCV is an open source C++ library focusing on computer vision launched in 1999 by Intel research. It is written in C++ but bindings in Python and Matlab are available. The project has been supported by Willow Garage since 2008 and is under active development. OpenCV provides tools for many computer vision applications such as image/gesture recognition, motion tracking, mobile robotics… Computer vision is closely related to machine learning thus OpenCV has a module that implements many traditional algorithms. And more recently, OpenCV 3 added support for deep learning algorithms.

I decided to do some experiments quite close to the…


When I first started using Xcode for my C++ projects I was a bit overwhelmed by the settings interface; sure I had used Xcode heavily for Objective-C and then Swift project but most of the Build Settings are already set up to run any iOS/macOS app. For C++ however, it gets a bit trickier. Here is a solution I found works for C++ libraries built with cmake, make or Homebrew. I hope it turns out to be useful for anyone coding in C++ with Xcode.

Setting up Xcode for C++ projects is a four step process:

a. defining the build…


Armadillo est une librarie template C++ avec une interface très complete pour le calcul matriciel et l’algèbre linéaire. Elle permet de rapidement obtenir un code fonctionel sans compromis sur la performance du language.

Ce tutoriel couvre l’installation de cette librarie (sur macOS seulement pour le moment), l’execution de simple code et un tour de sa riche interface. Une alternative à Armadillo est Eigen.

Armadillo est developpé et maintenu par la NITCA (National Information and Communications Technology of Australia). Leur but était de fournir une interface aussi facile d’utilisation que celle de Matlab pour les utilisateurs du language de bas-niveau qu’est…

René-Jean Corneille

Principal Data scientist. I write about Machine learning, C++ and Python coding.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store