Data Science Evangelist @H2O.ai | Editor @wicds

Parse XML files with the Python’s ElementTree package

Image for post
Image for post

Real-world data is messy, and we know it. Not only does such data require a lot of cleaning, a lot of times, the format in which we receive data is also not suited for analysis. This means that before the analysis even begins, the data has to undergo a series of transformations to get it into a suitable format — a format that makes it easy to work upon. This happens mostly when the data is either scraped from the web or is provided in the form of documents. I came across a pretty similar dataset, which was in the form of various XML files. …


Rethinking a Visual dataframe workflow based on the user’s intent

Image for post
Image for post

“Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those that we believe to be there.” — John W Tukey

The importance and necessity of data visualization in data science cannot be emphasized enough. The fact that a picture is worth a thousand words can be aptly applied to any project's life cycle associated with data. However, a lot of times, the tools that enable these visualizations aren’t intelligent enough. This essentially means that while we have hundreds of visualization libraries, most of them require users to write a substantial amount of code for plotting even a single graph. …


In conversation with Philipp Singer: A Data Scientist, Kaggle Double Grandmaster, and a Ph.D. in Computer Science.

Image for post
Image for post

In this series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. These interviews are intended to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.

In this interview, I shall be sharing my interaction with Philipp Singer, better known as Psi in Kaggle world. He is a Kaggle Double Grandmaster and a Senior Data Scientist at H2O.ai. Philipp obtained his Ph.D. …


Image for post
Image for post

Data Visualisation is an essential step in any data science pipeline. Exploring your data visually opens your mind to a lot of things that might not be visible otherwise.

There are several useful libraries for doing visualization with Python, like matplotlib or seaborn. These libraries are intuitive and simple to use. There’s also pandas, which is mainly a data analysis tool, but it also provides multiple options for visualization.

Plotting with pandas is pretty straightforward. In this article, we’ll look at how to explore and visualize your data with pandas, and then we’ll dive deeper into some of the advanced capabilities for visualization with pandas. …


Office Hours

In conversation with Gábor Fodor: A Data Scientist at H2O.ai and a Kaggle Competitions’ Grandmaster.

Image for post
Image for post

In this series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. These interviews are intended to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.

In this interview, I shall be sharing my interaction with Gábor Fodor, better known as Beluga in Kaggle world. He is a Kaggle Competitions Grandmaster and a Data Scientist at H2O.ai. Gabor, who hails from Hungary, holds a master’s degree in Mathematics as well as Computer Engineering and has around ten years of experience in the Data Science domain. He joined Kaggle nine years ago and since then has made quite a mark there. …


An introductory guide on getting started with the Time Series Analysis in Python

Image for post
Image for post

Time series analysis is the backbone for many companies since most businesses work by analyzing their past data to predict their future decisions. Analyzing such data can be tricky but Python, as a programming language, can help to deal with such data. Python has both inbuilt tools and external libraries, making the whole analysis process both seamless and easy. Python’s Pandas library is frequently used to import, manage, and analyze datasets in various formats. However, in this article, we’ll use it to analyze stock prices and perform some basic time-series operations.

Time Series Data

Time Series data is a sequence of data points listed in time order. It is a set of observations at specified times and equal intervals. Time series data is pretty common in our day to day lives, and some common examples…


Summary of WiCDS ( Women in Coding and Data Science) Online Meetup.

Image for post
Image for post

Women in Coding & Data Science(WiCDS) held an online Meetup on 22nd November 2020. Here is a quick summary of the vital points covered in the presentation. For the detailed presentation, refer to the video below.

Disclaimer: All the images used in this article have been taken from the presentation. You can access the presentation from here.

Speaker

Preeti Ravikiran — Assistant Professor & Program Chair — School of Science — B.Sc (Analytics & Applied Statistics) at NMIMS Bangalore.

Preeti comes with extensive experience in both corporate as well as academia. She has worked with sin domain experts in Global Banks, NBFCs, Manufacturing companies globally. Her area of interest is Quantitative Techniques, Business Analytics, and Quantitative Finance. …


Create model documentation for Supervised learning models in H2O-3 and Scikit-Learn — in minutes

Image for post
Image for post

Proper documentation is essential not only for your own credibility but the credibility of the organization you represent.

The Federal Reserve’s 2011 guidelines state that model risk assessment and management would be ineffective without adequate documentation. A similar requirement is put forward today by many regulatory and corporate governance bodies. This clearly means that model documentation today is more of a necessity than a choice. However, there is still no denying that it is one of the most time-consuming jobs for a data scientist. …


Making Sense of Big Data

A series on Image data augmentation libraries in Python

Image for post
Image for post

Deep learning techniques have found great success in computer vision tasks, namely image recognition, image segmentation, object detection, etc. Such deep learning techniques are heavily dependent on big data to avoid overfitting. So what do you do when you have limited data? You go for data augmentation. Data Augmentation techniques enhance training datasets' size and quality such that better Deep Learning models can be built using them. In this article, we shall look at some of the standard image augmentation techniques and some popular libraries that help achieve these augmentations.

Types of Augmentations Techniques

Augmentations can be broadly classified as, i.e., Offline and Online Augmentations based on when they are performed in the pipeline. …


Learn how to use the H2O Aggregator function to reduce the size of data effectively

Image for post
Image for post

Exploratory data analysis is one of the most essential parts of any data processing pipeline. However, when the magnitude of data is high, these visualizations become vague. This is because if we were to plot millions of data points, it would become impossible to discern individual data points from each other. The visualized output in such a case is pleasing to the eyes but offers no statistical benefit to the analyst. Researchers have devised several methods to tame massive datasets for better analysis. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store