Sign in

Data Science @H2O.ai | Editor @wicds

Hands-on Tutorials

Parse XML files with the Python’s ElementTree package

Website vector created by stories — www.freepik.com

Real-world data is messy, and we know it. Not only does such data require a lot of cleaning, a lot of times, the format in which we receive data is also not suited for analysis. This means that before the analysis even begins, the data has to undergo a series of transformations to get it into a suitable format — a format that makes it easy to work upon. This happens mostly when the data is either scraped from the web or is provided in the form of documents. I came across a pretty similar dataset, which was in the…


Rethinking a Visual dataframe workflow based on the user’s intent

Photo by Luis Tosta on Unsplash

“Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those that we believe to be there.” — John W Tukey

The importance and necessity of data visualization in data science cannot be emphasized enough. The fact that a picture is worth a thousand words can be aptly applied to any project's life cycle associated with data. However, a lot of times, the tools that enable these visualizations aren’t intelligent enough. This essentially means that while we have hundreds of visualization libraries, most of…


Tips and Tricks

A deep dive into some of the parameters of the function in pandas

Time vector created by storieswww.freepik.com

Pandas is one of the most widely used libraries in the Data Science ecosystem. This versatile library gives us tools to read, explore and manipulate data in Python. The primary tool used for data import in pandas is This function accepts the file path of a comma-separated value, a.k.a, CSV file as input, and directly returns a panda’s dataframe. A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values.


Harnessing the true potential of AI by enabling every employee, customer, and citizen with sophisticated AI technology and easy-to-use AI applications.

Democratization is an essential step in the development of AI, and AutoML technologies lie at the heart of it. AutoML tools have played a pivotal role in transforming the way we consume and understand data. Given the impact that data and predictive analytics can have in addressing critical problems, it becomes imperative to make the power of AI available to a wide variety of users in an organization. This is essential to address day-to-day needs and deliver greater insights…


Ten ways to sort data in pandas

Health photo created by freepik

My tryst with the pandas' library continues. Of late, I have been trying to look deeper into this library and consolidating some of the pandas’ features in byte-sized articles. I have written articles on reducing memory usage while working with pandas, converting XML files into a pandas dataframe easily, getting started with time series in pandas, and many more. In this article, I’ll touch upon a very useful aspect of data analysis, and that is sorting. We’ll begin with a brief introduction and then quickly jump on some ways to perform sorting efficiently in pandas.

Sorting

If you are an excel…


Making Sense of Big Data

Optimizing pandas memory usage by the effective use of datatypes

Photo by Tolga Ulkan on Unsplash

Managing large datasets with pandas is a pretty common issue. As a result, a lot of libraries and tools have been developed to ease that pain. Take, for instance, the pydatatable library mentioned below.

Despite this, there are a few tricks and tips that can help us manage the memory issue with pandas to an extent. They might not offer the best solution, but the tricks can prove to be handy at times. Hence there is no harm in getting to know them. …


A look at the most used package management system in Python

Image by Author

An in-depth article was published in the February of 2020 by Sebastian Raschka et al. that studies the role and importance of Python in the Machine Learning ecosystem. The paper titled Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence put forward a fascinating observation which I’d like to quote here:

Historically, a wide range of different programming languages and environments have been used to enable machine learning research and application development. …


In conversation with Fatih Öztürk: A Data Scientist and a Kaggle Competition Grandmaster.

Image by Author

In this series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. These interviews are intended to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.

In this interview, I shall be sharing my interaction with Fatih Öztürk. He is a Kaggle Competitions’ Grandmaster and a Data Scientist at H2O.ai. Fatih obtained a Bachelor’s in industrial engineering with honors at the Boğaziçi University, Istanbul. He worked as a Data Scientist at UrbanStat before joining H2O.ai. Fatih joined Kaggle almost four…


Learn how to cluster your data in Tableau easily

Image by Nicky ❕❣️ PLEASE STAY SAFE ❣️❕ from Pixabay

Consider a situation where you have some sales data belonging to your company. Let’s say you wanted to discover a pattern in terms of the consumers' spending capacity. If you could uncover distinct groups or associations within the data, your company could target the different groups based on their preferences. The basic idea behind this intuition is called clustering, and tableau has an inherent feature that can automatically cluster similar data points based on certain attributes. In this article, we will explore this functionality of Tableau and see how we can apply the clustering method to some real-world data set.

What is Clustering?


The best way to learn data science is by doing it

https://www.freepik.com/vectors/data

If you are just getting started in Data Science and looking for some cool datasets to play with, this might be the article for you. A lot of courses and books never really move beyond the classic titanic and the Iris datasets. Not that there is any harm in that, but there have been instances of extreme familiarity with these datasets to the extent that people also know the number of missing values or the number of string columns in them. Therefore, this article might appear as a fresh chance to learn about some great data sets to tinker with.

Palmer Archipelago penguin data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store