Data Science Evangelist @H2O.ai | Editor @wicds

Hands-on Tutorials

Parse XML files with the Python’s ElementTree package

Website vector created by stories — www.freepik.com

Real-world data is messy, and we know it. Not only does such data require a lot of cleaning, a lot of times, the format in which we receive data is also not suited for analysis. This means that before the analysis even begins, the data has to undergo a series of transformations to get it into a suitable format — a format that makes it easy to work upon. This happens mostly when the data is either scraped from the web or is provided in the form of documents. I came across a pretty similar dataset, which was in the…


Rethinking a Visual dataframe workflow based on the user’s intent

Photo by Luis Tosta on Unsplash

“Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those that we believe to be there.” — John W Tukey

The importance and necessity of data visualization in data science cannot be emphasized enough. The fact that a picture is worth a thousand words can be aptly applied to any project's life cycle associated with data. However, a lot of times, the tools that enable these visualizations aren’t intelligent enough. This essentially means that while we have hundreds of visualization libraries, most of…


In conversation with Fatih Öztürk: A Data Scientist and a Kaggle Competition Grandmaster.

Image by Author

In this series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. These interviews are intended to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.

In this interview, I shall be sharing my interaction with Fatih Öztürk. He is a Kaggle Competitions’ Grandmaster and a Data Scientist at H2O.ai. Fatih obtained a Bachelor’s in industrial engineering with honors at the Boğaziçi University, Istanbul. He worked as a Data Scientist at UrbanStat before joining H2O.ai. Fatih joined Kaggle almost four…


Learn how to cluster your data in Tableau easily

Image by Nicky ❕❣️ PLEASE STAY SAFE ❣️❕ from Pixabay

Consider a situation where you have some sales data belonging to your company. Let’s say you wanted to discover a pattern in terms of the consumers' spending capacity. If you could uncover distinct groups or associations within the data, your company could target the different groups based on their preferences. The basic idea behind this intuition is called clustering, and tableau has an inherent feature that can automatically cluster similar data points based on certain attributes. In this article, we will explore this functionality of Tableau and see how we can apply the clustering method to some real-world data set.

What is Clustering?


The best way to learn data science is by doing it

https://www.freepik.com/vectors/data

If you are just getting started in Data Science and looking for some cool datasets to play with, this might be the article for you. A lot of courses and books never really move beyond the classic titanic and the Iris datasets. Not that there is any harm in that, but there have been instances of extreme familiarity with these datasets to the extent that people also know the number of missing values or the number of string columns in them. Therefore, this article might appear as a fresh chance to learn about some great data sets to tinker with.

Palmer Archipelago penguin data


Get notified when your long-running cell finishes execution.

Photo by Manja Vitolic on Unsplash

If you are a Jupyter Notebook user, there must have been scenarios when a particular cell took a lot of time to finish the execution. This is particularly common during model training in machine learning, hyperparameter optimization, or even when running lengthy computations, etc. If yes, then a browser notification that would inform you once the process is finished could come in real handy. This way, you would be able to navigate to other tabs and only return to your machine learning experiment once you get that completion notification. Well, it turns out that there is a Jupyter extension that…


Managing large datasets on Kaggle without fearing about the out of memory error

Image by user

Datatable is a Python package for manipulating large dataframes. It has been created to provide big data support and enable high performance. This toolkit resembles pandas very closely but is more focused on speed. It supports out-of-memory datasets, multi-threaded data processing, and has a flexible API. In the past, we have written a couple of articles that explain in detail how to use datatable for reading, processing, and writing tabular datasets at incredible speed:

These two articles compare datatable’s performance with the pandas’ library on…


Learn how to analyze data in the form of a dynamic quadrant chart in Tableau

Image by Author

If you are a part of the Data Science ecosystem, you must have heard about Gartner’s Magic Quadrant(MQ). These MQs are a series of reports containing the market research and analysis of several technology companies. They are one of the most anticipated and awaited reports in this space. Here is the 2020 Magic Quadrant for Analytics and Business Intelligence platforms. You can clearly see the four distinct quadrants denoting four different categories.


Learn about some of Python’s built-in methods that can be used on strings

Photo by Paul Volkmer on Unsplash

Before PEP 498 was introduced, Python had primarily three ways of formatting strings i.e. the , and the . In 2015, Eric V. Smith proposed a new string formatting mechanism known as Literal String Interpolation, which provided a simpler and effective way for formatting strings. These strings are referred to as formatted string literal or f-strings as they are prefixed with or .

To begin, we’ll quickly look at the different string formatting styles in Python and how f-strings shines above the rest. …


In conversation with Philipp Singer: A Data Scientist, Kaggle Double Grandmaster, and a Ph.D. in Computer Science.

In this series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. These interviews are intended to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.

In this interview, I shall be sharing my interaction with Philipp Singer, better known as Psi in Kaggle world. He is a Kaggle Double Grandmaster and a Senior Data Scientist at H2O.ai. Philipp obtained his Ph.D. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store