Real-world data is messy, and we know it. Not only does such data require a lot of cleaning, a lot of times, the format in which we receive data is also not suited for analysis. This means that before the analysis even begins, the data has to undergo a series of transformations to get it into a suitable format — a format that makes it easy to work upon. This happens mostly when the data is either scraped from the web or is provided in the form of documents. I came across a pretty similar dataset, which was in the…
“Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those that we believe to be there.” — John W Tukey
The importance and necessity of data visualization in data science cannot be emphasized enough. The fact that a picture is worth a thousand words can be aptly applied to any project's life cycle associated with data. However, a lot of times, the tools that enable these visualizations aren’t intelligent enough. This essentially means that while we have hundreds of visualization libraries, most of…
In this series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. These interviews are intended to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.
In this interview, I shall be sharing my interaction with Fatih Öztürk. He is a Kaggle Competitions’ Grandmaster and a Data Scientist at H2O.ai. Fatih obtained a Bachelor’s in industrial engineering with honors at the Boğaziçi University, Istanbul. He worked as a Data Scientist at UrbanStat before joining H2O.ai. Fatih joined Kaggle almost four…
Consider a situation where you have some sales data belonging to your company. Let’s say you wanted to discover a pattern in terms of the consumers' spending capacity. If you could uncover distinct groups or associations within the data, your company could target the different groups based on their preferences. The basic idea behind this intuition is called clustering, and tableau has an inherent feature that can automatically cluster similar data points based on certain attributes. In this article, we will explore this functionality of Tableau and see how we can apply the clustering method to some real-world data set.
If you are just getting started in Data Science and looking for some cool datasets to play with, this might be the article for you. A lot of courses and books never really move beyond the classic titanic and the Iris datasets. Not that there is any harm in that, but there have been instances of extreme familiarity with these datasets to the extent that people also know the number of missing values or the number of string columns in them. Therefore, this article might appear as a fresh chance to learn about some great data sets to tinker with.
…
If you are a Jupyter Notebook user, there must have been scenarios when a particular cell took a lot of time to finish the execution. This is particularly common during model training in machine learning, hyperparameter optimization, or even when running lengthy computations, etc. If yes, then a browser notification that would inform you once the process is finished could come in real handy. This way, you would be able to navigate to other tabs and only return to your machine learning experiment once you get that completion notification. Well, it turns out that there is a Jupyter extension that…
Datatable is a Python package for manipulating large dataframes. It has been created to provide big data support and enable high performance. This toolkit resembles pandas very closely but is more focused on speed. It supports out-of-memory datasets, multi-threaded data processing, and has a flexible API. In the past, we have written a couple of articles that explain in detail how to use datatable for reading, processing, and writing tabular datasets at incredible speed:
These two articles compare datatable’s performance with the pandas’ library on…
If you are a part of the Data Science ecosystem, you must have heard about Gartner’s Magic Quadrant(MQ). These MQs are a series of reports containing the market research and analysis of several technology companies. They are one of the most anticipated and awaited reports in this space. Here is the 2020 Magic Quadrant for Analytics and Business Intelligence platforms. You can clearly see the four distinct quadrants denoting four different categories.
Before PEP 498 was introduced, Python had primarily three ways of formatting strings i.e. the %-formatting
, str.format
and the string.Template
. In 2015, Eric V. Smith proposed a new string formatting mechanism known as Literal String Interpolation, which provided a simpler and effective way for formatting strings. These strings are referred to as formatted string literal or f-strings as they are prefixed with 'f'
or 'F'
.
To begin, we’ll quickly look at the different string formatting styles in Python and how f-strings shines above the rest. …
In this series of interviews, I present the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments. These interviews are intended to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.
In this interview, I shall be sharing my interaction with Philipp Singer, better known as Psi in Kaggle world. He is a Kaggle Double Grandmaster and a Senior Data Scientist at H2O.ai. Philipp obtained his Ph.D. …