It is easy to get distracted by the sheer volume of data you will need to wade through. Most datasets include far more columns than you will end up needing for your analysis. A perfect example of this would be an id
column. You would rarely want to include this column as part of your modeling effort.
Select
gives you the ability to choose a subset of columns. Select
works for columns in the same way Filter
works for rows. This step becomes more useful when you are dealing with more than 100 columns or 10s of millions of rows. This has a twofold effect:
Being efficient with your computer memory when working with larger datasets.
Focusing on the information you need to answer the question driving your analysis.
How Can You Keep Columns?
Keeping any list of columns is as simple as using the select
function. Start by keeping only the year
, month
, and day
columns.
df
|> DataFrame.select(["year", "month", "day"])
Note: Selecting Single Column – You can pass the name of the column name (e.g.,
select("year")
). Selecting Multiple Columns – You will need to pass a list of column names (e.g.,select(\["year", "month", "day"\])
)
How Can You Remove Columns?
One useful option for the select
function is the keep_or_drop
parameter. By default, this parameter is set to :keep
, but you can use :drop
to remove the columns specified as shown in the example below.
Notice how the columns' year
, month
, and day
do not show up in the results.
df
|> DataFrame.select(["year", "month", "day"], :drop)
How Can You Select Adjacent Columns?
As of the time of this writing, it is not possible to select multiple adjacent columns with the names of the first and last columns like R does (e.g., year:day
). One workaround is to specify the numbers for the first and last column that you want, as shown below.
df
|> DataFrame.select(0..2)
...and you can also :drop
to remove adjacent columns as well.
df
|> DataFrame.select(0..2, :drop)
How Can You Select With String Functions?
In addition to lists (\["year", "month", "day"\]
) and ranges (0..2
), you can use standard Elixir String functions for more flexible ways to select
columns as shown in the examples below.
String.starts_with
searches all the column names to see which ones exactly begin with the letters you supply. Below will find all the column names that start with the letter "d".
df
|> DataFrame.select(&String.starts_with?(&1, "d"))
2. String.ends_with
searches all the column names to see which ones exactly end with the letters you supply. Below will find all the columns names that end with the letter "e".
df
|> DataFrame.select(&String.ends_with?(&1, "e"))
3. String.contains
searches all the column names to see which ones contain anywhere the letters you supply. Below will find any column name that contains the letter "a".
df
|> DataFrame.select(&String.contains?(&1, "a"))
4. String.match
matches the column names based on the regular expression you provide. Below, will show you all the columns that have _arr
in their name.
df
|> DataFrame.select(&String.match?(&1, ~r/_[arr]/))
As you can see, there is a lot of flexibility that Elixir natively gives you for selecting columns. But there's even more options that what you've seen above. Refer to check out the Elixir String docs to see what else you can use.