If you’ve ever worked on data, reporting or analytics, chances are you’ve already heard of Big Query, Amplitude, Google Analytics, Snowflake, Redshift, and a few other data products. Additionally, you may have heard about Postgres, MySQL, Oracle while developing applications at some point.
At their core, all these products excel in managing data, and providing reporting on large amounts of data. Have you ever wondered how these technologies can efficiently answer complex business reporting in a matter of seconds or even less than that? On the surface, every dataset can be represented as a Table. But the underlying storage may be entirely different. This post aims to provide a brief overview of two major data storage formats used by these database systems — Row Oriented and Column Oriented Storage.
This name sounds fancy, but Row oriented storage is what you can find in most Relational Databases. MySQL, PostgreSQL, and other database systems. They store a table’s records as Rows. Each row is a contiguous collection of columns that are defined by the schema. And these Rows are what you can query when accessing and manipulating the underlying data.
Let’s take an example to understand quickly understand what storage and queries within row-oriented systems looks like.
The diagram shown above shows a simple Employee table.
employee_id, salary and title.
A row storage layout stores the entire row as bytes on disk. All columns in a row are stored next to each other. To read a particular column within a row, the database must load the entire row, go to the specific column, and then lookup the column value.
Considering this, can you figure out how a database would look up an employee with the title ‘Executive Administrator’ ?
Since we don’t have an index, our database must perform a full scan starting at the first row. It would load the entire row in memory, get the value for the last column, and see if the text matches our query. And then it repeats the search until we find an employee with the given title. Since this employee doesn’t exist in our table, we would not get any rows within our results.
If we quantify the amount of data read by our database during this query, it would be 32 Bytes Per Row * 3 = 66 Bytes. Imagine a real world database where we are scanning a column without an index. It could be 100s of megabytes for every query across thousands of rows.
Now consider another scenario — Find the average salary of employees with the title ‘Director’ or ‘Manager’. In this case as well, we start our search from the first row. We can first see if the title matches a ‘Manager’ or ‘Director’. If it matches, we can collect the salary and store it in a temporary variable along with the count of rows matched. Similarly, we can repeat the second and third row with the same operations. If we find a title match, we sum up our temporary variable and, in the end, compute an average by dividing the sum and count. And we have our average.
An important thing to note here is that every time we found a match, we had the salary column value already loaded in memory. That is one of the benefits of row-oriented storage. Once a row is loaded, we can access and modify any column within the row. This means by reading 32 bytes, all columns can be accessed for every row. Whether it’s a **select *** or selecting a single column, the disk read cost is the same. ( Of course, I'm discounting memory, network, and bandwidth cost. )
For the purposes of this post, and logically, a row storage layout always has to access and scan the entire row on disk no matter how many columns are present in the query. This is one of the reasons why row based storages are very good at transactional processing. An entire row can be locked, and certain guarantees can be provided on the transaction.
Please note that there may be query optimizations in modern databases which i’m not covering. When i say scan, i strictly mean the disk access part that allows the database to physically read and filter data. Disk access patterns are somewhat different from memory access.
To further complicate things, let’s look at another dataset and queries. Assume that the dataset below has 100,000 Rows. With the 12 columns shown below, also assume that each row would be around 250 Bytes on average.
Egypt Home Prices ( Borrowed from Kaggle )
Here’s the question we want to answer.
Let’s walk through these queries independently. For the first one, since it’s a sum operation without any filter, our database will try to go through each row, get the price column, and keep summing up. In terms of disk cost, it will go through 100k Rows with 250 Bytes in each row. This sums up to approximately 25 Mb of data scanned for computing the sum.
A similar computation would happen for the second query. Our database will scan all rows, and pick the maximum value for the Bedrooms. Still another 25 MB of disk data scanned.
If we wanted to get the result of both the queries in one single shot, it would still be 25 MB.
Absolutely! This is where Columnar storage comes into Picture. By changing the way the underlying data is stored, columnar storage aims at speeding up queries massively. Let’s take the same examples and see how columnar data storage is represented and queried.
Imagine you could organize the housing table such that each column is a separate file entirely. So, on disk, instead of storing all columns in a single row, you could split up the table and store each column as its own independent file.
For the example on Egypt house prices, that would mean we would now have 12 files on disk. And the contents of each file are just the column values ( as shown below ) .
Now, let’s try to get the sum of All prices. We know that we have 100K rows. And each price column may be 4 bytes long ( integer assumption). So ,if we were to calculate the sum of all rows, we would be just reading this entire file and summing up the data. That in turn means that we are scanning only 800KB of data. Way less than what we had before. Imagine the speed of the query if we organized our columns like this and our data size grew 20 times. That still would involve a smaller number of bytes scanned than in the first case. This setup and storage layout is especially useful when we have analytics queries that touch only one or a few columns at a time. And analytics data by it’s very nature can span across petabytes worth of data points. At its core, this is what columnar storage is all about. Let’s look at our initial example again to take a look at organization.
If we were to compare, this diagram just transposes the previous table.
A columnar storage format stores all values of a column as a single record. That means all columns for the dataset are stored in a contiguous row. Hence, a row in a columnar storage represents all values for that column.
The benefits of using a Column oriented storage are mostly around analytics and aggregation queries. These typically involve filtering, computations, and aggregates on a single column on a certain set of columns. Scanning all records of a single column is extremely fast in Columnar data storage.
Another reason queries and aggregates run faster in columnar storage is because the data type for all entries in a column is the same. So, performing operations on them is extremely efficient. There are certain optimizations and computations such as SIMD that can speed up queries a lot in case of columnar data storage systems.
It’s not just compute and aggregates that columnar storage helps with. Columnar storage also results in better compression. The reason being the same. In a columnar storage, column values will be of the same data types. The values within a column may be repeating, or more coherent with types — thereby giving us an edge for compressing data better. Performance gains can be huge when dealing with 100s of Gigabytes of data.
It may seem quite simple to use columnar storage for analytics, but there are certain limitations to columnar storage as well. Since we now store data in a new format, and each column is stored independently, we won’t have the benefits of transactional nature of rows when compared with row storage.
Additionally, the cost of scanning in a Columnar database is proportional to the number of columns. Imagine if your database has 200+ columns, then the cost of reconstructing an entire row of 200 columns for a select * is extremely high, as compared to a row-oriented storage. One must always avoid doing a select * on a columnar database. Unless your data has a very few number of columns.
This post tried to give an understanding of how data is organized in both row and column-oriented databases. Given that these two data storage techniques result in performance gains for different use cases, it’s important to consider and model data to be as close to the queries that are expected. Be it row oriented or column-oriented database, understanding query patterns and fixing on them helps us model data better.
For instance, in a row-oriented storage, if I'm given the task of creating a reporting summary without the use of a columnar database (and some other considerations) , I may even choose a separate table with only the dimensions required and aggregate them instead of doing everything on the base table. That may save a considerable number of resources if we’re operating on a base table with hundreds of columns vs an ad hoc table with just a couple of dimensions.
On the other hand, the beauty of columnar data stores is that no matter what scale of data we look at, the optimization and gains for reporting can be quite significant. Therefore, it is the preferred choice for a lot of Big Data Systems. Every provider may advertise their own distinctive features. But almost all big data analytics warehouses and systems have columnar storage at their core.
Now that you know what columnar storage is, my next set of articles will focus on an unerstanding of Apache Parquet — a popular columnar storage format, and its’ implementation.