Digging For Golden Nuggets
5 min read
Everyone wants to rest easy knowing that they are making the right decision because their decisions are based on data and facts. The problem is, there is too much data for any one person to consume manually, and the problem is getting worse every day. Despite this issue, your bosses, clients, and end users still need to make decisions going forward. How can you help them? This is where filtering comes in. Filtering is what helps you to dig through the mountain of data to find the "diamond in the rough" that can later become a shiny golden nugget.
Filtering is a data transformation method to remove unnecessary or unwanted rows of data. It prevents unnecessary data from affecting your model results later on, or when you start summarizing with counts, sums, and averages. For example, you may need to remove the
nils from your dataset for a model to run properly.
Without filtering properly, you may produce irrelevant or wrong answers due to “dirty“ data. In the same way, most people don’t want a pile of dirt. It doesn’t help them. They need your ability to go dig efficiently to find where the gold lies. Filtering is the preferred tool in your Data Transformation tool-bag to help you find the gems your clients depend on.
Below we'll cover the following sections:
A good starting point would be to find all the flights from January 1st.
df |> DataFrame.filter_with( &(Series.equal(&1["month"], 1) |> Series.and(Series.equal(&1["day"], 1))) )
Let's break down the 3 key functions that are used above:
filter_with-> Filters the rows of the DataFrame based on columns and conditions you provide.
equal-> Searches for the value (
1) in the columns (
day) and return where the instances are true.
and-> Combines both
equalsearches to find all the instances where both the
monthcolumn == 1 and the
daycolumn == 1.
Comparisons are what you use to determine whether two items are equivalent or not. If so, the comparison will return
true, and if not, it will return
Explorer handles equality for you with the
equal function (as demonstrated above), so you don't need to worry about using
=== like you normally would.
For now, refer to the simple example below for a key point to be aware of when making comparisons.
:math.sqrt(2) |> :math.pow(2) == 2 # false
Notice how the result is
false. ButThe square root of 2 raised to the 2nd power should equal 2. Why is that? Looking at the first two operations will give you a hint.
:math.sqrt(2) |> :math.pow(2) # 2.0000000000000004
Although the answer should be 2, you can see that it isn't. This is due to computers having finite memory and needing to limit the number of decimals they can store. Thus, there is some error introduced due to rounding.
Notice how you have the same problem in the example below.
1 / 49 * 49 == 1 # false
Elixir does not have a
near function, like R does, to handle whether two numbers are approximately correct. One solution to this problem would be to use the
# Take the square root :math.sqrt(2) # Raise it to the power of 2 |> :math.pow(2) # Round the number to the nearest whole number |> Float.round() == 2 # true
The same is also true for the second problem
(1 / 49 * 49) |> Float.round() == 1 # true
Notice how, using the
round function, you were able to "undo" the effects of the rounding that the computer made when calculating the formula in the first place.
Logical operators can often be tricky and can get complex quickly when dealing with more than one. In the book R For Data Science, there is an excellent Venn diagram (shown below) to help you remember how and (&), or (|) and not (!) work together to create the right combination of two datasets.
To start, find all the flights that departed in November or December.
df |> DataFrame.filter_with( &(Series.equal(&1["month"], 11) |> Series.or(Series.equal(&1["month"], 12))) )
…but what if you also wanted to include October?
You can create a pipe within a pipe by combining the
or functions within the
Series module to solve this.
df |> DataFrame.filter_with( &(Series.equal(&1["month"], 11) |> Series.or(Series.equal(&1["month"], 12)) |> Series.or(Series.equal(&1["month"], 10))) )
Note: As of this writing, you cannot use the
*!*to create the opposite of the filter you are creating. Simply use the opposite function like
*less_equal*(<=) instead of
*greater*(>). Or, you could use
*greater_equal*(>=) instead of
*&&*will not work in place of
*Series.and*for combining conditions.
Next, try finding the flights that weren't delayed (arrival or departure) by more than 2 hours.
df |> DataFrame.filter_with( &(Series.less_equal(&1["arr_delay"], 120) |> Series.or(Series.less_equal(&1["dep_delay"], 120))) )
Note: 120 minutes is used in place of 2 hours since the
dep_delayare in minutes
Missing values occur in almost every dataset. Thus, you need to have a way to handle and work with them. Below are a couple of functions that you can use with and without DataFrames.
#N/A in R is equivalent to
nil in Elixir. The only difference is, in Elixir, you won't get an error when comparing
nil == nil # true
You can check for
x = nil is_nil(x) # true
Filter in Explorer does not automatically remove
nils from the dataset. You must explicitly do so if you want them removed.
Note: You can drop all rows with
nils by using the
df |> DataFrame.drop_nil()
Now, say you wanted to avoid removing every row with a
nil from your DataFrame, but simply wanted to remove all the
nils from a specific column. How would you go about that? This is where the
is_nil function comes in. It will find all the
nil rows in a column, so you can either keep or remove them.
With that, find all the missing
df |> DataFrame.filter_with(&Series.is_nil(&1["air_time"]))
Although you may have a feel for how filtering works in Explorer, you've just scratched the surface of what can be done with filtering from the examples above. If you'd like more examples, you can dig into the documentation.