Sign in

Data Science @H2O.ai | Editor @wicds

Tips and Tricks

Making the most of Google Colab notebooks

Source: https://pixabay.com/images/id-4905013/

Colaboratory, or “Colab” for short, are hosted Jupyter Notebooks by Google, They allow you to write and execute Python code via your browser. It is effortless to spin a Colab since it is directly integrated with your Google account. Colab provides free access to GPUs and TPUs, requires zero-configuration, and makes sharing of code seamless.

Colab has an interesting history. It initially started as an internal tool for data analysis at Google. However, later it was launched publically, and since then, many people have been using this tool to accomplish their machine learning tasks. …


Statistical tests and analysis can be confounded by a simple misunderstanding of the data

Photo by Brendan Church on Unsplash

Statistics rarely offers a single “right”way of doing anything — Charles Wheelan in Naked Statistics

In 1996, Appleton, French, and Vanderpump conducted an experiment to study the effect of smoking on a sample of people. The study was conducted over twenty years and included 1314 English women. Contrary to the common belief, this study showed that Smokers tend to live longer than non-smokers. Even though I am not an expert on the effects of smoking on human health, this finding is disturbing. …


Notes from Industry

In conversation with Guanshuo Xu: A Data Scientist, Kaggle Competitions Grandmaster(Rank 1), and a Ph.D. in Electrical Engineering.

Image by Author

In this series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. The intention behind these interviews is to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.

In this article, I shall be sharing my interaction with Guanshuo Xu. He is a Kaggle Competitions Grandmaster and a Data Scientist at H2O.ai. Guanshuo obtained his Ph.D. in Electrical & Electronics Engineering at the New Jersey Institute of Technology, focusing on machine learning-based image forensics and steganalysis.

Guanshuo is a man…


Hands-on Tutorials

A tutorial on creating Plotly and Bokeh plots directly with Pandas plotting syntax

Infographic vector created by macrovector — www.freepik.com

Data exploration is by far one of the most important aspects of any data analysis task. The initial probing and preliminary checks that we perform, using the vast catalog of visualization tools, give us actionable insights into the nature of data. However, the choice of visualization tool at times is more complicated than the task itself. On the one hand, we have libraries that are easier to use but are not so helpful in showing complex relationships in data. Then there are others that render interactivity but have a considerable learning curve. …


Originally published at https://www.h2o.ai on April 21, 2021.

Photo by Alina Grubnyak on Unsplash

It is impossible to deploy successful AI models without taking into account or analyzing the risk element involved. Model overfitting, perpetuating historical human bias, and data drift are some of the concerns that need to be taken care of before putting the models into production. At H2O.ai, Machine Learning Interpretability (MLI) is an integral part of our ML products. This deep commitment to better machine learning has been built directly into our suite of products enabling data scientists and business users to understand better what their model is thinking.

H2O.ai has built…


Tips and Tricks

A deep dive into some of the parameters of the read_csv function in pandas

Time vector created by storieswww.freepik.com

Pandas is one of the most widely used libraries in the Data Science ecosystem. This versatile library gives us tools to read, explore and manipulate data in Python. The primary tool used for data import in pandas is read_csv().This function accepts the file path of a comma-separated value, a.k.a, CSV file as input, and directly returns a panda’s dataframe. A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values.


Harnessing the true potential of AI by enabling every employee, customer, and citizen with sophisticated AI technology and easy-to-use AI applications.

Democratization is an essential step in the development of AI, and AutoML technologies lie at the heart of it. AutoML tools have played a pivotal role in transforming the way we consume and understand data. Given the impact that data and predictive analytics can have in addressing critical problems, it becomes imperative to make the power of AI available to a wide variety of users in an organization. This is essential to address day-to-day needs and deliver greater insights…


Ten ways to sort data in pandas

Health photo created by freepik

My tryst with the pandas' library continues. Of late, I have been trying to look deeper into this library and consolidating some of the pandas’ features in byte-sized articles. I have written articles on reducing memory usage while working with pandas, converting XML files into a pandas dataframe easily, getting started with time series in pandas, and many more. In this article, I’ll touch upon a very useful aspect of data analysis, and that is sorting. We’ll begin with a brief introduction and then quickly jump on some ways to perform sorting efficiently in pandas.

Sorting

If you are an excel…


Making Sense of Big Data

Optimizing pandas memory usage by the effective use of datatypes

Photo by Tolga Ulkan on Unsplash

Managing large datasets with pandas is a pretty common issue. As a result, a lot of libraries and tools have been developed to ease that pain. Take, for instance, the pydatatable library mentioned below.

Despite this, there are a few tricks and tips that can help us manage the memory issue with pandas to an extent. They might not offer the best solution, but the tricks can prove to be handy at times. Hence there is no harm in getting to know them. …


A look at the most used package management system in Python

Image by Author

An in-depth article was published in the February of 2020 by Sebastian Raschka et al. that studies the role and importance of Python in the Machine Learning ecosystem. The paper titled Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence put forward a fascinating observation which I’d like to quote here:

Historically, a wide range of different programming languages and environments have been used to enable machine learning research and application development. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store