How to use the QUERY function in Google Sheets

1. QUERY Function

1.1 Data Manipulation

         Data manipulation involves processes of converting raw data sets to output data where we can easily derive meaningful interpretations and analyses. It can be as simple as rearranging the structure of the data to filtering and summarizing the data (e.g., average, sum, etc.). Aside from these concerns, there’s also a need to automate these processes for convenience, and there’s no better function than Google Sheets’ Query Function

1.2 What does the QUERY do in Google Sheets?

         The QUERY function can do a wide range of data manipulation processes: summarize, sort, filter, organize, extract and format. Aside from these, it can also perform operations and other functions such as SUM, AVERAGE and so on .  It can do all of these processes with just one line, which makes it very powerful and arguably the most useful function in Google Sheets. It’s important to emphasize here the extraction of data because the QUERY function basically extracts the data to a new location. 

The processes are triggered by syntaxes, which are called clauses. Some of the properties and rules of these clauses are interrelated. This makes the QUERY function a bit complicated. We’ll discuss each of these in more detail in the succeeding sections. 

1.3 QUERY Syntax

Below is the syntax for QUERY:

There are three inputs in the QUERY function: data, query and headers. The first input is data, which is the data set that we want to manipulate. The query is the set of clauses or syntaxes that will trigger the processes that we want to perform in the data set. Here’s the list of the clauses (Fig. 1.1):

Figure 1.1 List of Clauses

It’s important to note here that these 10 clauses must follow the above hierarchical structure when they are used together. Another important property of the query input is that it can be omitted from the function. It’s not evident from the syntax above, but it behaves like an optional input. If we omit it from the formula, the QUERY function will just extract the whole data set (Fig. 1.2). Notice that the format of the source data doesn’t carry over to the extracted data.

Figure 1.2 Query with Data only as Input. The data set from A1:C9 is extracted to E1:G9.

The last input is the headers. It is an optional input that defines the header row. The default input is -1, which means that the function will guess the headers of the extracted data based on the source data. This last input is actually a bit confusing, but this will be cleared in the succeeding sections. Let’s just settle first with it’s normal usage, which is omitting it from the formula.

Let’s look at an example to help us build an intuition on how the function works (Fig. 1.3). Let’s define the inputs first.The data is the data set from A1:C9. The function utilizes two of the query clauses: select and where. The headers are omitted from the formula, and therefore, it will take its default value, which is -1. Now, let’s try to make sense of what’s happening with the data. The select clause has two basic functions: filtering and organizing data. As seen in the formula below,  the select clause is followed by the column IDs (i.e., A and C), which correspond to the column headers in the spreadsheet. The data set is filtered based on the defined column IDs. In the case below, only Column A and Column C are extracted. Another thing is the arrangement. The order of column IDs dictates the arrangement of the output  data. Therefore, Column A will be the first column and will be followed by Column C. 

Let’s move on to the next clause, which is the where clause. Notice the space after the column IDs and before the where the clause. The space is the separator between clauses. The where clause filters the rows based on a condition, which in this case is C>150. Therefore, the whole data set is filtered by applying the criterion (>150) in Column C. The extracted data set is shown in E1:F6. 

In summary, the query input defines the data manipulation processes. We can see some important elements that are necessary for each of the clauses such as the column IDs, space separator, conditions and the data types. We’ll discuss each of these in the following sections. 

Figure 1.3 Basic example of QUERY Process

  1. Breaking Down the Formula

Figure 2.1 QUERY Function

In this section, we’ll talk about the rules and criteria for each QUERY input, how to write them, how QUERY processes these inputs and how the inputs affect the output. We’ll build an intuition to make it easier for us to visualize and create an QUERY formula for your problems.

2.1 QUERY Inputs/Arguments

2.1.1 Data

Here are the important rules that you need to consider for the data:

2.1.1.1 Syntax and Location of Data

  1. Cell References

Typically, our data set is stored in a sheet, and we want to be able to use this set of data in our formulas. In order to do that, we write the references of the cells where the data set is located. In the formula above, the reference A1:C9 means that the values start from column A, row 1 to column C, row 9. The reference defines the address of the cell, and the formula uses the reference to access the value within the referenced cell. 

The main advantage of referencing the cells is that the formula becomes dynamic. We can change the values within the referenced cells, and the formula will update automatically in response to the changes.

We can write the reference in the formula or highlight the cell. If we perform the latter, the spreadsheet automatically writes the references for us (Fig. 2.2), so we don’t have to write the references manually.

Figure 2.2 Referencing cells by highlighting them

Also, it’s important to know that we can refer to the cells from any sheet. This means that the cells can be:

  1. in the same sheet with the function
  2. in  another sheet within the same workbook
  3. Or in another sheet in another workbook (using IMPORTRANGE function)
  1. Hard-Coded Values:

The values can be hard-coded as constants, but there’s not much use to this method. 

=QUERY({1,2,3,4,5})

  1. Arrays from other Functions

The data input can also be an array from another function. The most common would be from the result of an IMPORTRANGE formula. There are scenarios where the data will be from a different workbook. The IMPORTRANGE function can only extract the data but can not filter, sort or perform other data manipulation processes. Notice the syntax of the column IDs. If the data input is an array, then the syntax for the column ID must be “Col” + Column Index (e.g., Col1, Col2, Col3 and so on). This syntax is case-sensitive. 

=QUERY(IMPORTRANGE(“https://docs.google.com/spreadsheets/d/1I942QipA0qwijVWMFqyaiEMx0A-LiW1OL6T91y_-bHg”,”Data!A1:C9″),”SELECT Col1, Col3 WHERE Col3>150″)

For now, we can think of an array as a list of values that are not located in any sheet. To write them directly, we need to write the list inside the curly brackets {} (see hard-coded formula).

2.1.1.2 Dimension and Size of Data

In spreadsheets, the dimension is related to rows and the columns. If our data is written in one column but multiple rows or one row but multiple columns, the dimension of our data is one-dimensional (1D). If our data is in a table with multiple rows and columns, then the dimension is two dimensional (2D). 

The rows and columns serve as coordinates of the cells within the spreadsheet. For 1D data, we only need to specify one coordinate (row or column) to know where the value is located. For 2D, we need to define the row and column to locate the value. 

The size of the data is related to the number of rows and columns, which also translate to the number of entries. 

The data can be 1D or 2D:

  1. One-Dimensional (1D):
    1. Horizontal List: The QUERY function in the scenario below only extracts the first row of the data set (Fig. 2.3). 
  • Cells 

Figure 2.3 1D Data: one row, multiple columns

  • Hard-coded Arrays: The array here is the same as the range A2:C2 in Fig. 2.4. (values in a horizontal list are separated by commas).

=QUERY({“Sub 1”,DATEVALUE(“1/1/2021”),100})

  1. Vertical List: 
  • Cells (Fig. 2.4)

Figure 2.4 1D Data: one column, multiple rows

  • Hard-coded Arrays: (values in a vertical list are separated by semicolons)

=QUERY({“Subs”;”Sub 1″; “Sub 2″;”Sub 1”})

  1. Two-Dimensional (2D):
  • Cells: (Fig. 2.5)

Figure 2.5 2D Data: Multiple Rows and Columns

  • Hard-Coded Arrays: (the array below is a 2 (row) by 3 (column) array)

=QUERY({“Sub 1”,DATEVALUE(“1/1/2021″),100;”Sub 2”, DATEVALUE(“1/12/2021”) ,150} )

2.1.1.3 Data Type of Data

The most important property of this input is the consistency of the data type per column. Each of the columns in the data must have one data type. If there are more than one data type, then the QUERY function will assign the data type that has the largest occurrence in a column. Let’s look at the example below what happens to the extracted data. 

Column C consists of two data types: number and text (Fig. 2.6). There’s a single text located in C3. Now, let’s look at the corresponding cell in the extracted data, G3. The value that is returned here is empty. The majority of Column C are numbers, and therefore, the assigned data type is a number. The rest of the values that are not numbers will be ignored in the extracted data. 

Figure 2.6 Data Type Consistency

2.1.2 Query

Unlike normal inputs, the query is not a condition or a value but a series of keywords that execute actions. It seems a bit complicated, but we can look at each of the clauses as functions, and like functions, these clauses require inputs and conditions. One important rule for the query input is that it must be written in text format. Another rule is that the clauses must be separated by spaces.

The most important thing when writing the syntax for the query is to visualize the output data. We must determine the necessary information that we want to display, and it won’t be difficult to construct the formula. Let’s look at the different aspects of this input. 

2.1.2.1 Functions and Formulas

We can nest functions and formulas within the clauses, but it’s important to note that the functions in the query can only be used together with the clauses and can only accept a column ID as an input. Another thing is that the names of the functions are not case-sensitive. Let’s look at each of these functions and formulas. 

  1. Aggregation  Functions

Functions that generate a single value as their output no matter the size of the input are called aggregation functions. Here is the list of aggregation functions that can be used in the clauses:

  1. SUM
  2. AVERAGE
  3. COUNT
  4. MIN
  5. MAX

The aggregation functions are typically used for summarizing the data and can only be used in the following clauses: select, order by, label and format. Let’s look at the simple example below (Fig. 2.7). 

The whole Amount Column is summed into one value, which is 1750. Going back to the definition of the aggregation functions, even if the input is an array, the output will always be a single value. This process of summing doesn’t replace Column C but instead creates a new column to contain the results of the aggregation functions. The header name for the new column is also created automatically as shown below. Another thing about the aggregation functions is that they follow groupings (by group by or pivot clauses).

Figure 2.7 Aggregation Function: SUM

  1. Scalar Functions

Functions that output data corresponding to the size of its input are called scalar functions. Most of these are date and time functions. Here’s the list of the scalar functions that are supported in query: YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, QUARTER, DAYOFWEEK, NOW, DATEDIFF, TODATE, UPPER AND LOWER. The last two functions are string functions. 

Scalar functions are also used for summarizing data and can only be used in certain clauses: select, where, group by, pivot, order by, label and format. Let’s look at the simple example below: 

Figure 2.8 Scalar Function:UPPER

The UPPER function is applied to each of the entries in Column A, and just like the aggregation functions, a new column is created to contain the results (Column F).

  1. Arithmetic Formulas

We can also employ arithmetic formulas to summarize the data. Let’s look at the example below: 

Figure 2.9 Arithmetic Formula

The operands can be column IDs, constants and values from functions, but there are combinations of operands that are not valid. We can perform arithmetic operations with a combination of Column IDs, constants and scalar functions or a combination of constants and aggregate functions. If we try to perform arithmetic operations with a column ID and an aggregation function (Fig. 2.10) or with a scalar function and aggregation function, the QUERY function will return an error. 

Figure 2.10 Operands: Column ID and Aggregation Function

2.1.2.2 Clauses

We’ve seen in Fig. 1.1 the list of clauses and the order of their usage in query. The syntax starts with the name of the clause followed by a space that separates the clause and their inputs. If a clause has multiple inputs, the comma will serve as the separator. In terms of the inputs, the input can be column IDs, constants or an array from the nested functions. Let’s discuss each of the clauses, their inputs, rules and properties. Column IDs must be written based on the column headers of the spreadsheet (e.g., A, B, C, etc.). This is case-sensitive. Therefore, we can’t substitute lowercase letters (e.g., a, b, c, etc.). If the data  input is an array, the syntax will be Col1, Col2, Col3 and so on. Again, this is case-sensitive. 

  1. Select

As previously mentioned in section 1.3, the select clause has two basic functions: filtering and reorganizing the data set, but aside from these two, it can also summarize the data. This latter function of the select clause can be a bit complicated because most of the time, we need other clauses to have a meaningful summary (this will be expanded later). Let’s start with the two basic functions first.

The select clause will dictate the columns that will be extracted. If we want to extract all of the columns, then we can use the asterisk (*) or omit the select clause to simplify the syntax  (Fig. 2.11). If we want to extract specific columns, we need to identify the column IDs (Fig. 2.12). 

Figure 2.11 Select: All

Figure 2.12 Select: Column A and B

The arrangement of the input column IDs will dictate the arrangement of the columns in the extracted data. As shown in Fig. 2.12, B is followed by A in the input. Therefore, the output data will start with Column B and then followed by Column A. 

We’ve already seen in section 2.1.2.1 how the select clause can incorporate the functions and formulas to summarize the data. For scalar functions and formulas, there won’t be a problem with summarizing the data since the size of the new columns will be the same as the row size of the data. The problem arises with using the aggregation functions in combination with other column IDs or scalar functions. There will be a size difference, which will cause an error. To utilize the aggregation functions more effectively, we need to apply other data manipulation processes that can output a better summary like the example below (Fig. 2.13). 

The task is to summarize the data set in A1:C9 by calculating the total amount for each of the subs. We added the group by clause to perform the grouping on Column A. This makes the summary more meaningful, and this way, we can work with aggregation functions with column IDs and scalar functions. As mentioned in section 2.1.2.1, the aggregation functions conform to groupings. 

Figure 2.13 Select: Aggregation Function

  1. Where

The where clause has only one function, and that is to filter the rows of the data. We can use the where clause without the other clauses. In terms of its input, we need to define the condition for the filter. Let’s look at an example the below: 

Figure 2.14 Where Clause

The task in the above example is to filter the rows based on the condition applied in Column C (i.e., C>150) (Fig. 2.14). Let’s focus on the conditional statement that we can use for the where clause. The above example uses one of the logical operators (e.g., >, <. >=, <=). This is the most common way to express a conditional statement. For multiple conditions, we can use the AND (Fig. 2.15) and OR to join more than one condition. 

Figure 2.15 Where with multiple conditions

We can also perform partial matching for a conditional statement. We can’t use the normal wildcards in Google Sheets, but there are keywords that work or behave similar to the wildcards. Let’s look at each of them:

  1. Contains

Contains is the same as using the asterisk (*) wildcard to enclose a keyword (i.e., “*keyword*”) (Fig. 2.16). 

Figure 2.16 Where: Contains

  1. Starts With, Ends With

Starts With and Ends With is the same as using the asterisk (*) wildcard to replace the values before (i.e., “*keyword”) and after (i.e., “keyword*”) the keywords, respectively (Fig. 2.17). 

Figure 2.17 Where: Ends With

  1. Matches

The Matches keyword is equivalent to an equal sign (=) (Fig. 2.18). Therefore, it can’t perform a partial match by itself since it behaves like an equal sign, but together with  a special wildcard (i.e., dot asterisk, .*), we can perform a partial match that behaves similar to the asterisk wildcard (*). Therefore:

‘.*keyword’ =    “*keyword” (Fig. 2.19)

‘keyword.*’ =    “keyword*”

=QUERY(A1:C9,”SELECT A,C WHERE A Matches ‘XX.*'”)

 ‘.*keyword.*’ =   “*keyword*”

=QUERY(A1:C9,”SELECT A,C WHERE A Matches ‘.*1.*'”)

Figure 2.18 Where: Matches

Figure 2.19 Where: Matches with Wildcard (.*)

  1. Like

The Like keyword is the same as the Matches (Fig. 2.20), but it can use two wildcards: percent (%) and underscore (_). The % behaves like the asterisk (*) wildcard while the _ is equivalent to the question mark (?) wildcard. 

=QUERY(A1:C9,”SELECT A,C WHERE A Like ‘%1′”) or

=QUERY(A1:C9,”SELECT A,C WHERE A Like ‘___ 1′”)  (3 underscores replacing “SUB”)

Figure 2.20 Where: Like

Most of the aforementioned keywords behave similarly, and therefore, we can substitute one keyword to another. Just use the keywords based on your preferences. If we omit the where clause, the QUERY function will extract all of the rows. 

  1. Group By

The group by clause is used to summarize the data. It has two functions: group and sort. It works really well with the select clause that has aggregation functions as inputs, and as a matter of fact, we can’t use the group by clause without the select clause. Let’s look at the example below: 

Figure 2.21 Group By

Let’s look at the select clause first, and then build an intuition out of this. The select clause filters the column that will be extracted, reorganizes the columns and adds new columns for summary using functions and formulas.  In this case, Column A will be extracted and then followed by a new column from the COUNT function (Fig. 2.21). The COUNT function is an aggregation function, and therefore, it will return a single value. This doesn’t match with the size of Column A with multiple values. It’s important to note here that the columns of the output data of the QUERY function must have the same row sizes. If this rule is not satisfied, then an error is returned instead. Therefore, in the case of the example above, this will return an error, but if we are thinking of a more meaningful summary, we can group the subs on their own categories using the group by clause. The aggregation functions will adjust with the grouping, and instead of applying the COUNT function to the whole of Column A, it will be applied per group. 

If there are multiple column IDs or scalar functions together with aggregation functions in the select clause, then the column IDs and scalar functions must also be in the group by clause (Fig. 2.22), but the aggregation functions can’t be inputs for the group by clause. Column A and the array from the MONTH(B)+1 formula will have the same row sizes but not with the COUNT function. We must take the first two columns and make them as input for the group by clause. Another important rule is that the arrangement of column inputs in the group by clause has an effect on the grouping and sorting. The group by clause automatically sorts the extracted data in an ascending order based on the first column input.

Figure 2.22 Group By: Multiple Column IDs and Scalar Functions

  1. Pivot

The pivot clause has the same functionality with the group by clause, which is to summarize and sort the data, but they differ in presenting the summary.  It’s important to note here that both clauses group values into their own categories and both have the same rules except for the multiple column IDs and scalar functions. The pivot clause can’t also accept aggregation functions as inputs. Let’s look at the example below (Fig. 2.23). 

Figure 2.23 Pivot

If we look at the output data, it has the same information as the result of the group by clause in Fig. 2.21, but the resulting array is transposed to a column. This transposed effect is the main effect if the pivot clause is used only with select, but together with the group by clause, a more powerful manipulation is actually happening. Let’s look at the example below (Fig. 2.24). 

Figure 2.24 Pivot after Group By

With the group clause, the pivot clause can add a new dimension or variable to our data. The column input of the pivot clause becomes the new header row. Therefore, the column inputs of the pivot clause must not be in the select clause. The select clause filters what columns to show while the pivot groups the values of a column into their own categories and converts the resulting group to a new header row.  The group by clause groups the values of the column but the arrangement of the grouping is still a vertical list unlike the pivot clause. Therefore, we can add new variables or dimensions through the group by and pivot clauses. In essence, the pivot clause also reorganizes the data set. It also utilizes the aggregation functions from the select clause. We can input multiple aggregation functions in the select clause, and the pivot clause will distribute the results to their corresponding groups (Fig. 2.25). It’s important to note here that the pivot clause requires at least one aggregation function in the select clause. 

Figure 2.25 Pivot: Multiple Aggregation Functions in the Select Clause

We can also have multiple column inputs for the pivot table, but it will combine the column inputs into combinations of headers in one row (Fig. 2.26). Another important rule is that column inputs of the select clause that are not in the pivot clause must be placed as inputs for the group by clause. This makes sense if we remember the rule for  difference in sizes for column inputs. Therefore,  this is the only valid way in order for the formula to work.  

Figure 2.26 Pivot: Multiple Column Inputs

  1. Order By

The order by is a straightforward clause and can be used without the other clauses. It sorts the data based on its column inputs (Fig. 2.27). It can have multiple column inputs, and it will sort the extracted data following the order of its column inputs. We can also set the order to ascending or descending by using the keywords (i.e., ASC and DESC, respectively) after a column input. By default, the order is ascending. Therefore, if the order keyword is omitted in its input, it will sort the data in an ascending order. If the order by clause is preceded by the pivot clause, then it can’t take any aggregation functions as inputs. 

Figure 2.27 Order By

  1. Limit

The limit clause has only one function, which is to limit the number of rows starting from the first row of the data set excluding the header. It only requires an integer input for the number of rows (Fig. 2.28). Just like the order by clause, it can also work without the other clauses. 

Figure 2.28 Limit

  1. Offset

The offset clause prevents a given number of rows starting from the first row of data excluding the headers to be extracted (Fig. 2.29). Therefore, the starting row for the extracted data is determined by displacing a number of rows from the first row of the given data. The rows above the starting row will be excluded in the extraction. We can also use the offset clause without the other clauses. Remember the keywords “starting row” whenever you’re using the offset clause. We are just defining which row will be the first row for our extracted data. 

If the offset clause is used together with the limit clause, the offset will define the starting row. The limit clause will start counting from the starting row set by the offset clause. See the example in Fig. 2.30.  

Figure 2.29 Offset

Figure 2.30 Offset after Limit

  1. Label

The label clause labels the headers. It basically renames the headers of the columns (Fig. 2.31). We can also use the label clause without the other clauses. Here’s an important rule for it’s syntax: column input+space+label (text format). We need to follow this syntax arrangement in order for this to work properly. We can also label multiple columns as shown in the figure below. 

Figure 2.31 Label

  1. Format

The format clause sets the formatting of the values in a column (Fig. 2.32). The syntax of the inputs for the format clause is the same with the arrangement of the syntax for the label clause. The codes for the other formats can be seen in Format->Number->More Formats, and we can select date and time or numbers. The format clause can also be used without the other clauses, but if the select clause is present, the choice of column inputs for the format clause will be limited based on the column inputs of the select clause. Therefore, we can only input columns that are present in the select clause. Otherwise, the QUERY function will return an error. 

Figure 2.32 Format

  1. Options

The options clause is not well documented. If we try to use it in the function, it doesn’t seem to have an effect. It’s possible that this clause doesn’t work in Google Sheets and only works for the Google Visualization Query Language. We were not successful in showing what it can do. It’s also possible that it’s usage in Google Sheets Query is not yet understood. 

2.1.2.3 Literals

Functions, formulas and clauses require inputs. Aside from the column IDs, some of their inputs are numbers, texts, dates and other data types, which are called literals in the query language. Let’s look at the rules and properties of each of the literals in the query input. 

  1. Numbers and Boolean Values

The limit clause requires an integer input as shown in the formula below. The syntax for numbers doesn’t require any keyword for them to be evaluated. We just write it just like how we write numbers in other Google Sheets functions. This is also true for Boolean values (i.e., TRUE and FALSE). 

=QUERY(A1:C9,”LIMIT 3″)

  1. Texts

Just like numbers, texts in query input don’t require any keywords. We must also enclose it with quotation marks, but the query input is also in text format. In this case, for a text within a text, we need to use single quotations (e.g., ‘text’) instead of the normal quotation marks (e.g., “text”). The format clause requires a formatting code that must be written in text format. Therefore, we need to enclose the formatting code with single quotation marks as shown in the formula below. 

=QUERY(A1:C9,”SELECT B FORMAT B ‘mmmm-dd-yyyy'”)

  1. Dates

Dates in query require a “Date” keyword before the date itself in order for them to be evaluated (Fig. 2.33). The dates must also be written in this format, yyyy-mm-dd, and must be enclosed with single quotations for it to be recognized in the query input (see the formula below). The keyword is not case-sensitive. 

Figure 2.33 Dates

2.1.3 Headers 

The headers is an optional input, which accepts a single integer as input. If this input is omitted, it will take its default value, which is -1. If it’s empty, this will be equivalent to 0. We can store it in a cell and reference it, but it’s normally omitted or hard-coded in the formula. The headers input determine which of the top rows are part of the header row. The -1 input means that the QUERY function will identify it for us (Fig. 2.34). In the case below, it utilizes the values of the top row (A1:C1) as the headers. If we input 0, then it will not add the headers and will treat the first row as part of the data set. If the headers input is more than one, then it will utilize the first number of rows as headers. Their values will be combined, and they will be converted into one row. 

Figure 2.34 Headers: If it is omitted in the formula, it will take its default value, which is -1.

2.2 The QUERY Process

2.2.1 How does the QUERY work?

We already had a glimpse of how it works from section 1.4 and 2.1, and in this section we’ll look at more detailed examples of how the QUERY processes its inputs.

2.2.1.1 More than One Clauses

We’ve already seen more than one clause in the query input from previous examples. Some of them won’t work without the other clauses and will change how they manipulate the data depending on other clauses. The order of execution doesn’t follow the order of the clauses in the syntax, but we can intuitively predict how the QUERY function will process the syntax. Before we try to elaborate the processes, let’s categorize the clauses first (Fig. 2.35). We can see in the categorization that a clause can perform multiple data manipulation processes. Some of the clauses that are in the same category are interrelated, but it doesn’t mean that they have the effects. 

Figure 2.35 QUERY: Data Manipulation

It’s easier to predict the results of nested functions and formulas because we can determine the order of evaluation. For clauses, this is quite difficult because the order of clauses in the syntax doesn’t reflect the order of evaluation. You might ask, why is it important to predict the result? The most important thing in building complex formulas is to have a visualization of the result. It’s not much about the values, but the size, dimension and arrangement of the results. If we can visualize the possible result, it means that we have a good understanding about the processes and their order of evaluation. Let’s try to make sense of how the QUERY function processes its query input. 

Figure 2.36 QUERY:All Clauses except the Options Clause

All of the clauses are present in the formula above except for the options clause (Fig. 2.36). We don’t have an idea how Google Sheets processes all of the clauses in the background, but we can formulate rules based on the results of the QUERY function. Looking at the output data (E1:H4), we can conclude that the where clause was executed first (Fig. 2.37). It’s not a definite conclusion, but we can base our rules or understanding on this as long as it works for us. The corresponding rows with cells in C that are less than or equal to 100 are removed from the data and are not included in the SUM function. Another thing is that the filter function of the select clause was not executed first because we can filter the data through the where clause using columns that are not mentioned in the select clause. Therefore, we can conclude that the select clause was not evaluated first. 

Figure 2.37 QUERY: Filter 

Next is performing the reorganization and summary through select, group by and pivot clauses. We can argue that the scalar functions that are in the select clause can be the first thing that were evaluated before the where clause, but it doesn’t really matter which is first between the where clause and the addition of new columns through the scalar functions because it doesn’t affect the results. It doesn’t matter, but it’s important to keep this possibility in mind. One important thing about scalar functions is that they are applied to each of the cells in the column. It’s important to emphasize it here because the QUERY function assigns a data type for each of the columns based on the data type that makes up the majority of the values for each of the columns. The minority data types are ignored or excluded in the extraction, but the functions will not exclude these values.

Going back to the example, the MONTH formula is evaluated in the select clause. This will be applied to each of the cells in B. We can imagine that a new column that will contain the result of the MONTH formula is added to the data (Fig. 2.38). It’s clear that the addition of new columns is only possible through the select clause. Its summary function enables this addition of columns and new information. 

Figure 2.38 QUERY: Add New Columns through Select with Scalar Function 

The reorganization of the select clause and the pivot clause are performed and followed by the grouping of group by and pivot clauses and the aggregation of the values based on the grouping (i.e., aggregation functions). In this case, the result from the MONTH formula is grouped by the group by clause while Column A is grouped by the pivot clause (Fig. 2.39). The results from the MONTH formula is positioned as the first column based on the order of column inputs from the select clause. This is followed by the new header rows. The aggregation function is evaluated for each pair of the MONTH formula group and the Subs group. 

If we look at the arrangement of the output data, we can observe that the combination of the select, group by and pivot must only require at least three variables. It’s perfect for three dimensions of variables: header row, header column (1st column) and the values that each of the coordinates represent. In this case, the header row is A (Subs) and the header column is the month number. A pair of these two headers is the same as coordinates (or a cell) within a spreadsheet, and each coordinate represents a value, which in this case is the sum of the amounts. The two headers (row vs. column) don’t have to be equal in size. In essence, if we are planning to display relationships between three or more variables, then using the combination of select, group by and pivot clauses is the best method. Although this is perfect for three variables, we can add more. The select and group by clauses can extract or add more columns that are grouped together while the pivot clause can combine multiple column inputs into one header row. The pivot headers are always positioned after the columns that are set by the select clause. After the grouping and reorganization, both the pivot headers and the grouped columns are sorted in an ascending order. 

It’s also clear now why the aggregation functions can’t be inputs for other clauses. It can’t work with group by and pivot because the aggregations functions aggregate multiple values into one. Therefore, there’s nothing to group. This argument can be used with the other clauses that are not able to accept aggregation functions as their inputs.

Figure 2.39 QUERY: Summarize, Group and Reorganization

The next process is the sorting process by the order by clause (Fig. 2.40). It was discussed in section 2.1.2.2 that if the order clause is together with the pivot clause, then it can’t accept the aggregation functions as its inputs. We can use our intuition to formulate a reason behind this rule. If we look at Fig. 2.39 and 2.40, the sum function is now distributed to multiple columns. It’s no longer a 1D list. Clauses that accept column inputs can only accept 1D inputs except for the format clause. Therefore, 2D inputs won’t work. 

Figure 2.40 QUERY: Sort using Order By

Next, let’s talk about the limit and offset clauses. These two clauses also filter the rows (Fig. 2.41). The offset clause is executed first before the limit clause. The offset clause determines the starting row of the extracted data by removing a number of rows starting from the first row of the data set. The limit clause will limit the number of rows that will be extracted. The limit is referenced from the first row, which in this case is a new first row set by the offset clause. The remaining rows after executing the offset clause in Fig. 2.41 is three. Therefore, there are no changes after applying the limit, but since the data set is dynamic, any changes from it will update the extracted data.  

Figure 2.41 QUERY: Limit and Offset

The last two processes are about the display of the data. Let’s start with the label first (Fig. 2.42). We can change the label of the headers using the label clause, but it can only accept 1D input. Therefore, changing the labels for the pivot headers won’t work. Unlike the other clauses, the format clause can accept 2D inputs like the distributed SUM function (Fig. 2.43). Note that the format clause can’t format the pivot headers. 

Figure 2.42 QUERY: Label

Figure 2.43 QUERY: Format

2.2.1.2 Headers

The extracted data can either have one header row or no headers at all. Let’s try to understand how the QUERY function processes the headers input. 

The values for the header row is determined from the headers input. We can think of headers as row numbers. If the headers is 1, then it will take the values from the cells of the first row. If it’s greater than 1, then the values per each column are concatenated with space as separator (Fig. 2.44).  The headers in the example below is 2. Therefore, the header for the first column is equivalent to: A1&” “&A2 = “Subs Sub1.” It takes the cells for the first two rows and concatenates their values and separates them using space. This process is repeated for the other columns. 

Figure 2.44 Headers > 1

If the headers input is omitted or if the headers is -1, then the QUERY function will determine the headers. How does the QUERY function process this? Again, we don’t know what’s happening in the background, but we can formulate hypotheses and test them. Before we delve in deeper, let’s start with a rule where we can base our hypotheses. If we remember the rule for the data types of the data input, each of the columns are assigned with a data type that has the highest occurrence within a column. The minority of data types within a column will be ignored and not extracted. Therefore, based on this rule, we can say that the headers are determined by comparing the data types of the top rows to the data types of the columns. Most of the time, the headers and the data are different in terms of data types. Headers are typically texts. Therefore, this difference can be one of the factors used by the QUERY function. Here’s the hypothesis: if the majority of the cells in the top row are different from their corresponding column data type, then that row is a header row. Let’s look at examples to see if this is the case. 

There two header rows in the data set (A1:C10) (Fig. 2.45). The 2nd and 3rd columns of header rows have different data types than the data in their corresponding columns. Therefore, based on our hypothesis, the first two rows will be headers. This aligns with the results of the QUERY function, which identified that the 1st and 2nd row are the headers. Let’s try to manipulate the values of the headers.

Figure 2.45 Headers = -1: Example 1 – Headers = 2

We changed the C2 to a boolean value (i.e., TRUE) (Fig. 2.46). Notice the headers in the extracted data. In the 1st row, the majority of the cells have a different data type from their column data type.  In the 2nd row, only one cell is different from the data type of its column, B2. Although C2 is boolean, it has an equivalent number value, and this seems to be a factor in selecting the headers. Similar data types are considered the same for selecting headers. It’s not strict as the rule for the column data type. A2 is text, which is the same to the data type of Column A. 

Using our hypothesis, only row 1 should be the headers, and this agrees with the results below. B2 and C2 are ignored in the extracted data (F2 and G2, respectively). Column B contains dates. Therefore, B2 is ignored or excluded. Column C are numbers, but C2 is boolean. Although the value of C2 has an equivalent number, it’s data type is still different from the data type number. As mentioned before, the rules for the data types for each column are strict unlike the rules for selecting the headers. 

If we change the first row instead of the 2nd row (Fig. 2.47), the same hypothesis still applies. In this case, row 1 won’t be considered as a header row. The question is how about row 2? Row 2 satisfied the condition of our hypothesis, but the results showed that the QUERY function selected no headers. Therefore, we can conclude that the rows succeeding the first row can only be part of the header row if the first row also met the condition for the headers. If the first row is not a header row, then the next rows will not be considered for the header row. 

Figure 2.46 Headers = -1: Example 2 – Headers = 1

Figure 2.47 Headers = -1: Example 3 – Headers = 0

2.2.1.3 Formatting

As we’ve seen from the examples, the formatting of the cells of the data set is not copied to the extracted data, but the formatting of the values are retained (Fig. 2.48). If a column contains more than one type of value formatting, the formatting of the first value in the column is applied to the whole column (Column F). Also, value formatting is not applied to the results of the functions (Column G). 

Figure 2.48 Formatting

2.2.1.4 Hidden Clause: Skipping 

There’s one hidden clause that is rarely mentioned in the query clauses, and that is the skipping clause. If we look at a certain parse error from the QUERY function, we can see a list of clauses, and after the order clause, the next clause is skipping and not limit. Therefore, there’s a clause that is not included in the known list of clauses. What is the function of the skipping clause? Let’s look at an example. 

Figure 2.49 Parse Error with a List of Clauses

The data set below contains three sets of data of different categories: name, address and number (Fig. 2.50). This arrangement is common in scenarios where the data is copied from pdfs to spreadsheets. The task is to clean and organize the data set by separating the three sets of data into their own columns. 

Figure 2.50 Transpose every N rows

One of the known solutions for the problem above is to use the QUERY function to select and extract the data with the same categories (Fig. 2.50). The skipping clause will filter the data by skipping some number of rows, which in this case, is based on the number of categories (i.e., 3). The way it selects and extracts the data is shown in Fig. 2.50.  

Figure 2.51 QUERY: Skipping Clause

The QUERY function successfully extracted the names into one column. The next process is to use the TRANSPOSE function to convert the vertical list to a horizontal list (Fig. 2.52). Now, the data are separated into their own columns. Next, we copy or drag the formula from B1 up to nth number of rows, which in this case is 3 (Fig. 2.53). 

Figure 2.52 Applying the TRANSPOSE function

The data input of the QUERY function is a relative reference. Therefore, if we copy or drag the formula, it will adjust based on its position. In B2, the reference A2:A10 becomes A3:A11. It also moves one row. Therefore, the skipping will now start at A3 and will follow the count from it. The same process also applies to the QUERY formula in  B3. 

Figure 2.53 Copy the formula down Nth number of times 

The example above showcases the function and application of the skipping clause. It’s just another way to filter the data. 

2.3 QUERY Output

2.3.1 Dimension and Size of the Output

We’ve seen in section 2.2.1.2 that the dimension, size and arrangement of the output data of the QUERY function will be highly dependent on the query input. Just make sure that the destination cells of the outputs are vacant cells. If there are values that are occupying the cells where the results should be displayed, then an error will be returned instead of the results (Fig. 2.54). 

Figure 2.54 The QUERY function returns a #REF error due to G1 blocking the results

2.3.2 Data Type of the Output

The data type of the columns of the output will be the assigned data type for each of the columns. 

  1. Applications of QUERY

Most of the examples from the previous chapters showcased the power of the QUERY function. We’ve seen most of its important applications. Therefore, in this chapter, we’ll explore more applications that require a little bit of tricks to be able to perform the required tasks. 

3.1 How to Change the Pivot Headers in QUERY

The task below is to perform a summary that shows the sum of the amounts for each pair of subs and months (Fig. 3.1). The month should be the header row while the subs should be the header column. We used the formula below:

=ArrayFormula(TRANSPOSE(QUERY({A2:A9,EOMONTH(B2:B9,0),C2:C9},”SELECT Col2,SUM(Col3) GROUP BY Col2 PIVOT Col1 LABEL Col2 ‘Subs’ FORMAT Col2 ‘MMMM'”)))

Figure 3.1 Changing Pivot Headers

Let’s break down the formula above. The important keypoint here is to convert the dates to their corresponding months. We’ll do this before applying the QUERY function. The data set is dissected and recombined using the curly brackets {}. We normally use curly brackets to write arrays in formulas. The curly brackets combine and arrange the constants into an array depending on how we set the arrangement (using commas for columns and semicolons for rows). In Google Sheets, the curly brackets can go beyond constants and can use ranges of cells. This makes it more powerful because we can use this to combine ranges of cells into one list. Therefore, we can combine two or more vertical lists or horizontal lists to produce new tables. 

We use the EOMONTH function to return the date of the last day of the month. It has two inputs: the start_date and months. The start_date is the reference date while the months is an integer that will determine the number of months from the start_date. It can be negative or positive, which means past or future months, respectively. The months input determines the month that will be returned by the EOMONTH function. If start_date is 1/1/2021 and the months input is 1, then the month that will be returned is February, and the EOMONTH function will return the last day of February. The goal here is to make the dates uniform so that we can group them in the QUERY function. 

Next, we reorganize the table using the curly brackets. Notice that we did not include the header of the data set since it will interfere with the usage of the EOMONTH function, and the joining of the columns won’t work if the lists have different sizes. It’s important to emphasize here that when we are joining ranges of cells in this manner, the row sizes of the columns must match. If both columns have different no. of rows, then the joining will result to an error. This is also true if we add rows of data. The column sizes of the data sets that will be joined must be the same. The result of the joining by the curly brackets is shown in Fig. 3.2. Notice the ArrayFormula function. The EOMONTH function normally accepts single value input, and if the input is an array or a range of cells, then we need to apply the ArrayFormula function. 

Figure 3.2 Applying the Curly Brackets

Next is to use the results of the curly brackets as the data input for the QUERY function. Since the data now is an array, we can’t use the column letters. Normally, we’ll arrange the column inputs in the select clause in this order: Col1 (Subs), SUM(Col3). Then, we’ll group the Col1 with the group by clause and use the pivot clause to group and transpose the Col2 (Dates). The problem is we need to format the pivot headers to month names. We need to use the format clause to do this, but the problem is that the format clause can only format columns that are in the select clause. Since the dates are in the pivot clause, the format clause won’t be able to accept it as input. We need to use a little trick here. We’ll use Col2, SUM(Col3) as the inputs for the select clause (Fig. 3.3). Then, we group Col2 with the group by clause and use pivot clause to group and transpose Col1 instead. With this setting, we can use the format clause to format the dates to month. 

Figure 3.3 Applying the Query Function (Select, Group By and Pivot)

We can also use the label clause to change the header of the dates to “Subs.” The reason for this is because we will transpose the result of the QUERY function using the TRANSPOSE function.  The TRANSPOSE function will convert the columns to rows and the rows to columns. The final result is shown in Fig. 3.1. Note that the ArrayFormula function must always enclose the nested formulas. 

3.2 How to Nest Other Google Sheets Functions in QUERY

We’ve discussed in section 2.1.2.1 that the combination of aggregation functions and scalar functions or column IDs as operands won’t work, but we can work our way around this problem by concatenating other Google Sheets functions with the query input. Let’s look at the example below.

The task is to extract the Subs Column and the percentage of each of the amounts from the total amount (Fig. 3.4). We must total the amount first to calculate the percentage. The problem is that the combination of aggregation functions and scalar functions or column IDs as operands doesn’t work. To do this, we’ll calculate the total using the SUM function of Google Sheets and not the SUM function of the QUERY function. We’ll concatenate the result of the SUM function to the select clause as shown in the formula in Fig. 3.4. Since the query input is in text format, this concatenation is valid. The SUM function is evaluated first, and its value is concatenated to the select clause. Also, we must use this concatenation if we want to perform other data manipulation on this new column like the label clause in the example below. This solution shows us that we can nest other Google Sheets Functions within the QUERY function by concatenating them in the query input.  

Figure 3.4 Nesting Google Sheet Functions in Query

3.3 How to do QUERY with Dynamic Columns (Nested Query)

If the data set is dynamic in terms of the columns, where there are column additions or deletions, then we must convert the Query formula into a dynamic formula where it can adjust with the changes. Let’s look at an example (Fig. 3.5).

The task in the example below is to extract the Subs Column and the amount column of the current month. Let’s break down the formula. Let’s start with the first QUERY function. The objective of the first QUERY function is to organize and extract the required columns. Notice that the data input of the first QUERY is enclosed by the curly brackets. This will convert the range of cells to arrays, and therefore, we’ll use the Col+column index as the syntax instead of the column letters. With this syntax, we can convert the QUERY function to a more dynamic formula. There are two inputs for the select clause of the first QUERY: Col1 and Col + the last column index. The last column index is determined by the COUNTA function. The COUNTA function counts the number of cells that contain a value. The input used in the COUNTA function is the whole first row (1:1). The COUNTA function is performed first, and then, its result is joined to the select clause. The array result of the first QUERY is used as the data input for the second QUERY function. The purpose of the second QUERY is to apply more data manipulations using other clauses such as where clause and so on. (We can also use the MATCH function in other dynamic scenarios.)

Figure 3.5 Nested Query functions

3.4 How to do QUERY with Inputs in Another Sheet

We can have inputs of QUERY from another sheet. The process of constructing the formula is still the same. We just need to reference the inputs from another sheet. 

3.4.1 QUERY with Inputs in Another Sheet (same Workbook)

The data set is stored in the Data Sheet (Fig. 3.6) while the extracted data is in the Query Sheet (Fig. 3.7). Here are the steps:

  1. Write = QUERY(
  1. Highlight (or write) the data:

We can’t simply write A1:C9 because the data is now from a different sheet. We need to reference the cells together with the reference sheet. The fastest way to do this is to highlight the cells. While the cursor is inside the QUERY function, we go to the Data Sheet tab. Then, we highlight the cells A1:C9, the spreadsheet automatically writes the cell reference including its sheet reference (Fig. 3.8). The sheet reference always starts with the sheet name followed by an exclamation mark (!). Then, this is followed by the cell reference.

  1. Write a comma, and then, go back to the destination sheet and write the query(Fig. 3.7).

Make sure that the query is enclosed with quotation marks. We can omit the headers or write its value after the query. Then, press ENTER. 

Figure 3.6 The data set is located in a different sheet (Data Sheet) 

Figure 3.7 QUERY Output in Another Sheet

Figure 3.8 Highlighting the Cells from a different Sheet

3.4.2 QUERY with Inputs from Another Sheet (in another Workbook)

If the inputs are located from another workbook, then we need to use the IMPORTRANGE function in order to pull the data from another workbook. The IMPORTRANGE requires two inputs: the URL of the workbook and the sheet and cell references. Both inputs for the IMPORTRANGE must be in text format. Therefore, we need to enclose the URL and reference with quotation marks. IMPORTRANGE requires permission to pull the data from another workbook. Once access is granted, it will automatically pull the data. 

Below is an example on how to apply the IMPORTRANGE to QUERY:

=QUERY(IMPORTRANGE(“https://docs.google.com/spreadsheets/Example”,”Data!A1:C9″),”SELECT Col1,Col3″)
The IMPORTRANGE function also returns an array of data, and therefore, we must use the syntax: Col+column index.