Whittling Down Your Data

It is easy to get distracted by the sheer volume of data you will need to wade through. Most datasets include far more columns than you will end up needing for your analysis. A perfect example of this would be an id column. You would rarely want to include this column as part of your modeling effort.

Select gives you the ability to choose a subset of columns. Select works for columns in the same way Filter works for rows. This step becomes more useful when you are dealing with more than 100 columns or 10s of millions of rows. This has a twofold effect:

  1. Being efficient with your computer memory when working with larger datasets.

  2. Focusing on the information you need to answer the question driving your analysis.

How Can You Keep Columns?

Keeping any list of columns is as simple as using the select function. Start by keeping only the year, month, and day columns.

|> DataFrame.select(["year", "month", "day"])

Note: Selecting Single Column – You can pass the name of the column name (e.g., select("year")). Selecting Multiple Columns – You will need to pass a list of column names (e.g., select(\["year", "month", "day"\]))

How Can You Remove Columns?

One useful option for the select function is the keep_or_drop parameter. By default, this parameter is set to :keep, but you can use :drop to remove the columns specified as shown in the example below.

Notice how the columns' year, month, and day do not show up in the results.

|> DataFrame.select(["year", "month", "day"], :drop)

How Can You Select Adjacent Columns?

As of the time of this writing, it is not possible to select multiple adjacent columns with the names of the first and last columns like R does (e.g., year:day). One workaround is to specify the numbers for the first and last column that you want, as shown below.

|> DataFrame.select(0..2)

...and you can also :drop to remove adjacent columns as well.

|> DataFrame.select(0..2, :drop)

How Can You Select With String Functions?

In addition to lists (\["year", "month", "day"\]) and ranges (0..2), you can use standard Elixir String functions for more flexible ways to select columns as shown in the examples below.

  1. String.starts_with searches all the column names to see which ones exactly begin with the letters you supply. Below will find all the column names that start with the letter "d".
|> DataFrame.select(&String.starts_with?(&1, "d"))

2. String.ends_with searches all the column names to see which ones exactly end with the letters you supply. Below will find all the columns names that end with the letter "e".

|> DataFrame.select(&String.ends_with?(&1, "e"))

3. String.contains searches all the column names to see which ones contain anywhere the letters you supply. Below will find any column name that contains the letter "a".

|> DataFrame.select(&String.contains?(&1, "a"))

4. String.match matches the column names based on the regular expression you provide. Below, will show you all the columns that have _arr in their name.

|> DataFrame.select(&String.match?(&1, ~r/_[arr]/))

As you can see, there is a lot of flexibility that Elixir natively gives you for selecting columns. But there's even more options that what you've seen above. Refer to check out the Elixir String docs to see what else you can use.