Organizing Your Data
2 min read
Many people live by the adage that "cleanliness is next to Godliness." They find this way of living to be much less stressful. The problem is that Data is similar to the stuff in our lives. It becomes messy once it's out of the box and played with for a while.
Organizing your information is an important piece of the Data Transformation workflow and takes effort to make your data tidy. Being organized is not just good practice, but will make you more productive as you move further into your analysis.
Organized data is essential for building models that forecast based on time series data, such as Autoregressive Integrated Moving Average (ARIMA). The model's output would be worthless without being organized by date.
There are three primary scenarios of when you want to sort:
Date → Show by either the newest or oldest data
Number → Show by either the least or greatest number
Text → Show in alphabetical (A-Z) or reverse alphabetical (Z-A) order
Arrange is how you can sort the rows of your dataset into an order you choose. To accomplish this, you'll be using the
In it, you can pass each column you want to sort on along with which way you would like to sort it (i.e., descending or ascending). The dataset is sorted based on the order you provide the columns to the
For example, sort the flights dataset based on
day in descending order.
df|> DataFrame.arrange(desc: "month", desc: "day")
Notice how the dataframe was sorted by starting with December 31st and working it's way backwards. You could have just as easily shown the data starting with January 1st by using
asc instead of
All missing values (
nil) will be sorted at the beginning of the column. You can see the example below by sorting the
dep_time column in ascending order.
df|> DataFrame.arrange(asc: "dep_time")
arrange is a simple function with a lot of organizational power. You can come up with some pretty complex ways of sorting by getting clever with the columns you provide and the order in which they are provided. You will find this little function indispensable on your Data Science journey.