Dremel made simple with Parquet

Dremel made simple with Parquet
By @J_Wednesday, 11 September 2013 [

](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html#)[

](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html#)[

](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html#)[

](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html#)

Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases.
The goal is to keep I/O to a minimum by reading from a disk only the data required for the query. Using Parquet at Twitter, we experienced a reduction in size by one third on our large datasets. Scan times were also reduced to a fraction of the original in the common case of needing only a subset of the columns. The principle is quite simple: instead of a traditional row layout, the data is written one column at a time. While turning rows into columns is straightforward given a flat schema, it is more challenging when dealing with nested data structures.
We recently introduced Parquet, an open source file format for Hadoop that provides columnar storage. Initially a joint effort between Twitter and Cloudera, it now has many other contributors including companies like Criteo. Parquet stores nested data structures in a flat columnar format using a technique outlined in the Dremel paper from Google. Having implemented this model based on the paper, we decided to provide a more accessible explanation. We will first describe the general model used to represent nested data structures. Then we will explain how this model can be represented as a flat list of columns. Finally we’ll discuss why this representation is effective.
To illustrate what columnar storage is all about, here is an example with three columns.

Dremel made simple with Parquet
Dremel made simple with Parquet

In a row-oriented storage, the data is laid out one row at a time as follows:
Dremel made simple with Parquet
Dremel made simple with Parquet

Whereas in a column-oriented storage, it is laid out one column at a time:
Dremel made simple with Parquet
Dremel made simple with Parquet

There are several advantages to columnar formats.
Organizing by column allows for better compression, as data is more homogenous. The space savings are very noticeable at the scale of a Hadoop cluster.
I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
As we store data of the same type in each column, we can use encodings better suited to the modern processors’ pipeline by making instruction branching more predictable.

The model
To store in a columnar format we first need to describe the data structures using a schema. This is done using a model similar to Protocol buffers. This model is minimalistic in that it represents nesting using groups of fields and repetition using repeated fields. There is no need for any other complex types like Maps, List or Sets as they all can be mapped to a combination of repeated fields and groups.
The root of the schema is a group of fields called a message. Each field has three attributes: a repetition, a type and a name. The type of a field is either a group or a primitive type (e.g., int, float, boolean, string) and the repetition can be one of the three following cases:
required: exactly one occurrence
optional: 0 or 1 occurrence
repeated: 0 or more occurrences

For example, here’s a schema one might use for an address book:
message AddressBook {required string owner;repeated string ownerPhoneNumbers;repeated group contacts {required string name;optional string phoneNumber;}}
Lists (or Sets) can be represented by a repeating field.

Dremel made simple with Parquet
Dremel made simple with Parquet

A Map is equivalent to a repeating field containing groups of key-value pairs where the key is required.
Dremel made simple with Parquet
Dremel made simple with Parquet

Columnar format
A columnar format provides more efficient encoding and decoding by storing together values of the same primitive type. To store nested data structures in columnar format, we need to map the schema to a list of columns in a way that we can write records to flat columns and read them back to their original nested data structure. In Parquet, we create one column per primitive type field in the schema. If we represent the schema as a tree, the primitive types are the leaves of this tree.
AddressBook example as a tree:
Dremel made simple with Parquet
Dremel made simple with Parquet

To represent the data in columnar format we create one column per primitive type cell shown in blue.
Dremel made simple with Parquet
Dremel made simple with Parquet

The structure of the record is captured for each value by two integers called repetition level and definition level. Using definition and repetition levels, we can fully reconstruct the nested structures. This will be explained in detail below.
Definition levels
To support nested records we need to store the level for which the field is null. This is what the definition level is for: from 0 at the root of the schema up to the maximum level for this column. When a field is defined then all its parents are defined too, but when it is null we need to record the level at which it started being null to be able to reconstruct the record.
In a flat schema, an optional field is encoded on a single bit using 0 for null and 1 for defined. In a nested schema, we use an additional value for each level of nesting (as shown in the example), finally if a field is required it does not need a definition level.
For example, consider the simple nested schema below:
message ExampleDefinitionLevel {optional group a {optional group b {optional string c;}}}
It contains one column: a.b.c where all fields are optional and can be null. When c is defined, then necessarily a and b are defined too, but when c is null, we need to save the level of the null value. There are 3 nested optional fields so the maximum definition level is 3.
Here is the definition level for each of the following cases:
Dremel made simple with Parquet
Dremel made simple with Parquet

Dremel made simple with Parquet
Dremel made simple with Parquet

The maximum possible definition level is 3, which indicates that the value is defined. Values 0 to 2 indicate at which level the null field occurs.
A required field is always defined and does not need a definition level. Let’s reuse the same example with the field b now required:
message ExampleDefinitionLevel {optional group a {required group b {optional string c;}}}
The maximum definition level is now 2 as b does not need one. The value of the definition level for the fields below b changes as follows:
Dremel made simple with Parquet
Dremel made simple with Parquet

Making definition levels small is important as the goal is to store the levels in as few bits as possible.
Repetition levels
To support repeated fields we need to store when new lists are starting in a column of values. This is what repetition level is for: it is the level at which we have to create a new list for the current value. In other words, the repetition level can be seen as a marker of when to start a new list and at which level. For example consider the following representation of a list of lists of strings:
Dremel made simple with Parquet
Dremel made simple with Parquet

The column will contain the following repetition levels and values:
Dremel made simple with Parquet
Dremel made simple with Parquet

The repetition level marks the beginning of lists and can be interpreted as follows:
0 marks every new record and implies creating a new level1 and level2 list
1 marks every new level1 list and implies creating a new level2 list as well.
2 marks every new element in a level2 list.

On the following diagram we can visually see that it is the level of nesting at which we insert records:

Dremel made simple with Parquet
Dremel made simple with Parquet

A repetition level of 0 marks the beginning of a new record. In a flat schema there is no repetition and the repetition level is always 0. Only levels that are repeated need a Repetition level: optional or required fields are never repeated and can be skipped while attributing repetition levels.
Striping and assembly
Now using the two notions together, let’s consider the AddressBook example again. This table shows the maximum repetition and definition levels for each column with explanations on why they are smaller than the depth of the column:
Dremel made simple with Parquet
Dremel made simple with Parquet

In particular for the column contacts.phoneNumber, a defined phone number will have the maximum definition level of 2, and a contact without phone number will have a definition level of 1. In the case where contacts are absent, it will be 0.
AddressBook {owner: "Julien Le Dem",ownerPhoneNumbers: "555 123 4567",ownerPhoneNumbers: "555 666 1337",contacts: {name: "Dmitriy Ryaboy",phoneNumber: "555 987 6543",},contacts: {name: "Chris Aniszczyk"}}AddressBook {owner: "A. Nonymous"}
We’ll now focus on the column contacts.phoneNumber to illustrate this.
Once projected the record has the following structure:
AddressBook {contacts: {phoneNumber: "555 987 6543"}contacts: {}}AddressBook {}
The data in the column will be as follows (R = Repetition Level, D = Definition Level)
Dremel made simple with Parquet
Dremel made simple with Parquet

To write the column we iterate through the record data for this column:
contacts.phoneNumber: “555 987 6543”new record: R = 0
value is defined: D = maximum (2)

contacts.phoneNumber: nullrepeated contacts: R = 1
only defined up to contacts: D = 1

contacts: nullnew record: R = 0
only defined up to AddressBook: D = 0

The columns contains the following data:

Dremel made simple with Parquet
Dremel made simple with Parquet

Note that NULL values are represented here for clarity but are not stored at all. A definition level strictly lower than the maximum (here 2) indicates a NULL value.
To reconstruct the records from the column, we iterate through the column:
R=0, D=2, Value = “555 987 6543”:R = 0 means a new record. We recreate the nested records from the root until the definition level (here 2)
D = 2 which is the maximum. The value is defined and is inserted.

R=1, D=1:R = 1 means a new entry in the contacts list at level 1.
D = 1 means contacts is defined but not phoneNumber, so we just create an empty contacts.

R=0, D=0:R = 0 means a new record. we create the nested records from the root until the definition level
D = 0 => contacts is actually null, so we only have an empty AddressBook

Storing definition levels and repetition levels efficiently
In regards to storage, this effectively boils down to creating three sub columns for each primitive type. However, the overhead for storing these sub columns is low thanks to the columnar representation. That’s because levels are bound by the depth of the schema and can be stored efficiently using only a few bits per value (A single bit stores levels up to 1, 2 bits store levels up to 3, 3 bits can store 7 levels of nesting). In the address book example above, the column owner has a depth of one and the column contacts.name has a depth of two. The levels will always have zero as a lower bound and the depth of the column as an upper bound. Even better, fields that are not repeated do not need a repetition level and required fields do not need a definition level, bringing down the upper bound.
In the special case of a flat schema with all fields required (equivalent of NOT NULL in SQL), the repetition levels and definition levels are omitted completely (they would always be zero) and we only store the values of the columns. This is effectively the same representation we would choose if we had to support only flat tables.
These characteristics make for a very compact representation of nesting that can be efficiently encoded using a combination of Run Length Encoding and bit packing. A sparse column with a lot of null values will compress to almost nothing, similarly an optional column which is actually always set will cost very little overhead to store millions of 1s. In practice, space occupied by levels is negligible. This representation is a generalization of how we would represent the simple case of a flat schema: writing all values of a column sequentially and using a bitfield for storing nulls when a field is optional.
Get Involved
Parquet is still a young project; to learn more about the project see our README or look for the “pick me up!” label on GitHub. We do our best to review pull requests in a timely manner and give thorough and constructive reviews.
You can also join our mailing list and tweet at @ApacheParquet to join the discussion.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,923评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,154评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,775评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,960评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,976评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,972评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,893评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,709评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,159评论 1 308
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,400评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,552评论 1 346
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,265评论 5 341
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,876评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,528评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,701评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,552评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,451评论 2 352

推荐阅读更多精彩内容