Sign in

Data Science @H2O.ai | Editor @wicds

Statistical tests and analysis can be confounded by a simple misunderstanding of the data

Statistics rarely offers a single “right”way of doing anything — Charles Wheelan in Naked Statistics

In 1996, Appleton, French, and Vanderpump conducted an experiment to study the effect of smoking on a sample of people. The study was conducted over twenty years and included 1314 English women. Contrary to the common belief, this study showed that Smokers tend to live longer than non-smokers. Even though I am not an expert on the effects of smoking on human health, this finding is disturbing. …


In conversation with Dmitry Gordeev: A Data Scientist and a Kaggle Competition Grandmaster

In these series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. These interviews are intended to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.

In this interview, I shall be sharing my interaction with Dmitry Gordeev, also known as dott in Kaggle world. He is a Kaggle Competition’s Grandmaster and a Senior Data Scientist at H2O.ai. Dmitry studied at Moscow State University and graduated as a specialist in applied math/data mining. …


An overview and a tour of the course content

Massive Open Online Courses (MOOCs) are an indispensable part of the life of a self-taught data scientist. If you are in a room full of wanna-be data scientists, the chances are that fifty percent of them have taken the famous Machine Learning course by Andrew Ng. However, here is the twist. Even though many of us get enrolled in various online courses, only a handful complete them. In fact, a study titled Why MOOCs Didn’t Work, in 3 Data Points, claims that the completion and retention rates of online courses are minimal. …


A whirlwind tour of five libraries that could be a great addition to your Data Science stack

Open-source is the backbone of machine learning. They go hand in hand. The rapid advancements in this field wouldn’t have been possible without the contribution of the open-source fraternity. Many of the widely used tools in the machine learning community are open source. Every year more and more libraries get added to this ecosystem. In this article, I present a quick tour of some of the libraries that I recently encountered and which could be a great supplement to your machine learning stack.

1️⃣. HummingBird

Humminbird is a library for compiling trained traditional machine learning models into tensor computations. This means you…


Important caveats to be kept in mind when encoding data with pandas.get_dummies()

Handling categorical variables forms an essential component of a machine learning pipeline. While machine learning algorithms can naturally handle the numerical variables, the same is not valid for their categorical counterparts. Although there are algorithms like LightGBM and Catboost that can inherently handle the categorical variables, it is not the case with most other algorithms. These categorical variables have to be first converted into numerical quantities to be fed into the machine learning algorithms. There are many ways to encode categorical variables like one-hot encoding, ordinal encoding, label encoding, etc. …


Hands-on Tutorials

Streamline your data science code repository and tooling quickly and efficiently

Good Code is its own best documentation

Dr. Rachael Tatman, in one of her presentation, highlighted the importance of code reproducibility in a very subtle way :

“Why should you care about reproducibility? Because the person most likely to need to reproduce your work… is you.”

This is true on so many levels. Have you ever found yourself in a situation where it became difficult to decipher your codebase? Do you often end up with multiple files like untitled1.py or untitled2.ipynb? Well, if not all, a few of us must have undoubtedly faced the brunt of bad coding practices on…


Building interpretable Boosting Models with IntepretML

As summed up by Miller, interpretability refers to the degree to which a human can understand the cause of a decision. A common notion in the machine learning community is that a trade-off exists between accuracy and interpretability. This means that the learning methods that are more accurate offer less interpretability and vice versa. However, of late, there has been a lot of emphasis on creating inherently interpretable models and doing away from their black box counterparts. In fact, Cynthia Rudin argues that explainable black boxes should be entirely avoided for high-stakes prediction applications that deeply impact human lives. …


Hands-on Tutorials

An open-source package for decision tree visualization and model interpretation

It is rightly said that a picture is worth a thousand words. This axiom is equally applicable for machine learning models. If one can visualize and interpret the result, it instills more confidence in the model's predictions. Visualizing how a machine learning model works also makes it possible to explain the results to people with less or no machine learning skills. Scikit-learn library inherently comes with the plotting capability for decision trees via the sklearn.tree.export_graphviz function. However, there are some inconsistencies with the default option. …


Tips and Tricks

Making the most of Google Colab notebooks

Colaboratory, or “Colab” for short, are hosted Jupyter Notebooks by Google, They allow you to write and execute Python code via your browser. It is effortless to spin a Colab since it is directly integrated with your Google account. Colab provides free access to GPUs and TPUs, requires zero configuration, and makes sharing of code seamless.

Colab has an interesting history. It initially started as an internal tool for data analysis at Google. However, later it was launched publically, and since then, many people have been using this tool to accomplish their machine learning tasks. …


Notes from Industry

In conversation with Guanshuo Xu: A Data Scientist, Kaggle Competitions Grandmaster(Rank 1), and a Ph.D. in Electrical Engineering.

In this series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. The intention behind these interviews is to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.

In this article, I shall be sharing my interaction with Guanshuo Xu. He is a Kaggle Competitions Grandmaster and a Data Scientist at H2O.ai. Guanshuo obtained his Ph.D. in Electrical & Electronics Engineering at the New Jersey Institute of Technology, focusing on machine learning-based image forensics and steganalysis.

Guanshuo is a man…

Parul Pandey

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store