GATE 2018 NOTES .. : Query Processing & Optimization

Query Processing and Optimization

1. What do you mean by query processing? What are the various steps involved in query processing? Explain with the help of a block diagram.

Ans: Query processing includes translation of high-level queries into low-level expressions that can be used at the physical level of the file system, query optimization and actual execution of the query to get the result. It is a three-step process that consists of parsing and translation, optimization and execution of the query submitted by the user (see Figure ). These steps are discussed below:

Figure 8.1 Query-processing Steps

1. Parsing and translation: Whenever a user submits a query in high-level language (such as SQL) for execution, it is first translated into its internal representation suitable to the system. The internal representation of the query is based on the extended relational algebra. Thus, an SQL query is first translated into an equivalent extended relational algebra expression. During translation, the parser checks the syntax of the user's query according to the rules of the query language. It also verifies that all the attributes and relation names specified in the query are the valid names in the schema of the database being queried.

2. Optimization: Generally, there are several possible ways of executing a query (known as execution strategies), and different execution strategies can have different costs. It is the responsibility of the query optimizer, a component of DBMS, to choose a least costly execution strategy. This process of choosing a suitable execution strategy for processing a query is knows as query

optimization. The metadata stored in the special tables called DBMS catalog is used to find the best way of evaluating a query.

3. Execution: The chosen execution strategy is finally submitted to the query-evaluation engine for actual execution of the query to get the desired result.

2. What is a query tree? How is it different from a query evaluation plan? Explain with the help of an example.

Ans: The relational-algebra expressions are typically represented as query trees or parse trees. A query tree (also known as operator tree or expression tree) is a tree data structure in which the input relations are represented as the leaf nodes and the relational-algebra operations as the internal nodes. When the query tree is executed, first the internal node operation is executed whenever its operands are available. Then the internal node is replaced by the relation resulted after the execution of the operation. The execution terminates when the root node is executed and the result of the query is produced. For example, consider the following relational-algebra expression:

The query tree representation of this expression is shown in Figure .

Figure 8.2 Query Tree

Each operation in the query (such as select, project, join, etc.) can be evaluated using one of the several different algorithms, which are discussed later in this chapter. The query tree representation does not specify how to evaluate each operation in the query. To fully specify how to evaluate a query, each operation in the query tree is annotated with the instructions that specify the algorithm or the index to be used to evaluate that operation. The resultant tree structure is known as the query-evaluation plan or query-execution plan or simply plan. One possible query-evaluation plan for our example query tree is given in Figure 8.3.

In this figure, the select operator s is annotated with the instruction Linear Scan that specifies a complete scan of the relation.

Figure 8.3 Query-evaluation Plan

3. Discuss the external sort-merge algorithm.

Ans: If a relation entirely fits in the main memory, in-memory sorting (called internal sorting) can be performed using any standard sorting techniques such as quick sort. If the relation does not fit entirely in the main memory, external sorting is required. In external sorting, some part of the relation is read in the main memory, sorted and written back to the disk. This process continues until the entire relation is sorted. The most commonly used external sorting algorithm is the external sort-merge algorithm. The external sort-merge algorithm is divided into two phases, namely sort phase and merge phase.

Sort phase: In this phase, the records in the file to be sorted are divided into several groups, called runs such that each run can fit in the available buffer space. An internal sort is applied to each run and the resulting sorted runs are written back to the disk as temporarily sorted runs. The number of initial runs (n_i) is computed as b_R/M, where b_R is the number of blocks containing records of relation R and M is the size of available buffer in blocks.

Merge phase: In this phase, the sorted runs created during the sort phase are merged into larger runs of sorted records. The merge continues until all records are in one large run. The output of merge phase is the sorted relation.

Figure 8.4 External Sort-merge Algorithm

4. Define degree of merging. Explain with the help of an example.

Ans: In an external sort-merge algorithm, if the number of initial runs (n_i) is greater than the size of the available buffer space in blocks (M), then it is not possible to perform the merge operation in one pass; rather multiple passes are required. In each pass, M-1 buffer blocks are used to hold one block from each of the M-1 input runs, and one buffer block is kept for holding one block of the merge result. The number of runs that can be merged together in each pass is known as degree of merging (d_M). Thus, the value of d_M is the smaller of (M-1) or n_i. If two runs are merged in each pass, it is known as two-way merging. In general, if N runs are merged in each pass, it is known as N-way merging. The number of passes required can be calculate as

(log_dM (n_R)

Figure 8.5 shows two-way merging where we have assumed that the main memory holds at most three page frames out of which two are used as input and one for output during the merge phase. In this figure, d_M = 2, and n_R = 4; therefore,

(log₂ (4)

= 2 passes are required. In simple terms, with d_M = 2, 4 initial sorted runs would be merged into 2 at the end of first pass, which are then merged into 1 sorted run at the end of the second pass.

Figure 8.5 Example of Two-way Merging

5. Compare the linear search and binary search algorithms for select operation. Which one is better?

Ans: Algorithm 1 (Linear search algorithm): In linear search (also called brute force) algorithm, every file block is scanned and all the records are tested to see whether their attribute values satisfy the selection condition. It is the simplest algorithm and can be applied to any file, regardless of the ordering of the file or the availability of indexes. However, it may be slower than the other algorithms that are used for implementing selection. Since the linear search algorithm searches all the blocks of the file, it requires b_R block transfers, where b_R is the total number of blocks of relation R. If the selection condition involves a key attribute, the scan is terminated when the required record is found, without looking further at other records of the relation. The average cost in this case is equal to b_R/2 block transfers if the record is found. In the worst case, the select operations involving the key attribute may also require the scanning of all the records (b_R block transfers) if no record satisfies the condition or the record is at the last location.

Algorithm 2 (Binary search): If the file is ordered on an attribute and the selection condition involves an equality comparison on that attribute, binary search can be used to retrieve the desired tuples. Binary search is more efficient than the linear search as it requires scanning of lesser number of file blocks. For example, for the relational-algebra operation σ_{P_ID=“P002”} (PUBLISHER), binary search algorithm can be applied if the PUBLISHER relation is ordered on the basis of the attribute P_ID. The cost of this algorithm is equal to

log₂ (b_R)

1 block transfers. If the selection condition involves a non-key attribute, it is possible that more than one block contains the required records.

In general, binary search algorithm is not suited for searching purposes in the database since sorted files are not used unless they have a corresponding primary index file also.

6. What are file scans and index scans? Discuss the search algorithms for select operations that involve the use of indexes.

Ans: The search algorithms that scan the records of file to locate and retrieve the records satisfying the selection condition are known as file scans. The search algorithms that involve the use of indexes are known as index scans. An index is also referred to as access path as it provides a path through which the data can be found and retrieved. Though indexes provide a fast, direct and ordered access, they also add an extra overhead of accessing those blocks containing the index. The algorithms that use an index are described here.

Algorithm 3 (use of primary index, equality on key attribute): If the selection condition involves an equality comparison on a key attribute with a primary index, the index can be used to retrieve the record satisfying the equality condition. For example, for the relational-algebra operation σ_{P_ID=“P002”} (PUBLISHER), the primary index on the key attribute P_ID (if available) can be used to directly retrieve the desired record. If B⁺-tree is used, the cost of this algorithm is equal to the height of the tree plus one block transfer to retrieve the desired record.

Algorithm 4 (use of clustering index, equality on non-key attribute): If the selection condition involves an equality comparison on the non-key attribute with a clustering index, the index can be used to retrieve all the records satisfying the selection condition. For example, for the relational-algebra expression σ_{P_ID=“P002”} (BOOK), the clustering index on the non-key attribute P_ID of the BOOK relation (if available) can be used to retrieve the desired records. The cost of this algorithm is equal to the height of the tree plus the number of blocks containing the desired records, say b.

Algorithm 5 (use of secondary index, equality on key or non-key attribute): If the selection condition involves an equality condition on an attribute with a secondary index, the index can be used to retrieve the desired records. This search method can retrieve a single record if the indexing field is a key (candidate key) and multiple records if the indexing field is a non-key attribute. In the first case, only one record is retrieved; thus, the cost is equal to the height of the tree plus one block transfer to retrieve the desired record. In the second case, multiple records may be retrieved and each record may reside on a different disk block; thus, the cost is equal to the height of the tree plus n block transfers, where n is the number of records fetched. For example, for the relational-algebra expression σ_{Pname=“Hills Publications”} (PUBLISHER), the secondary index on the candidate key Pname (if available) can be used to retrieve the desired record.

Algorithm 6 (use of primary index, comparison): If the selection condition involves comparisons of the form A>v, A>=v, A<v or A<=v, where A is an attribute with a primary index and v is the attribute value, the primary ordered index can be used to retrieve the desired records. The cost of this algorithm is equal to the height of the tree plus b_R/2 block transfers.

For comparison conditions of the form A>=v, the record satisfying the corresponding equality condition A=v is first searched using the primary index. A file scan starting from that record up to the end of the file returns all the records satisfying the selection condition. For A>v, the file scan starts from the first tuple that satisfies the condition A>v—the record satisfying the condition A=v is not included.

For the conditions of the form A<v, the index is not searched, rather the file is simply scanned starting from the first record until the record satisfying the condition A=v is encountered. For A<=v, the file is scanned until the first record satisfying the condition A>v is encountered. In both the cases, the index is not useful.

Algorithm 7 (use of secondary index, comparison):The secondary ordered index can also be used for the retrievals based on comparison conditions such as <, <=, > or >=. For the conditions of the form A<v or A<=v, the lowest-level index blocks are scanned from the smallest value up to v and for the conditions of the form A>v or A>=v, the index blocks are scanned from v up to the end of the file. The secondary index provides the pointers to the records and not the actual records. The actual records are fetched using the pointers. Thus, fetching each record may require a disk access, since records may be on different blocks. This implies that using the secondary index may even be more expensive than using the linear search if the number of fetched records is large.

7. What are the various algorithms for implementing the select operations involving complex conditions?

Ans: If the select operation involves complex conditions, made up of several simple conditions connected with logical operators AND (conjunction) or OR (disjunction), some additional algorithms are used, which are discussed here.

Algorithm 8 (conjunctive selection using one index): If an attribute involved in one of the simple conditions specified in the conjunctive condition has an access path (such as indexes), any one of the selection algorithms 2 to 7 can be used to retrieve the records satisfying that condition. Each retrieved record is then checked to determine whether it satisfies the remaining simple conditions. For example, for the relational-algebra operationσ_{P_ID=“P001” _^ Category=“Textbook”} (BOOK), if a clustering index is available on the attribute P_ID (algorithm 4), that index can be used to retrieve the records with P_ID="P001". These records can then be checked for the other condition. The records that do not satisfy the condition Category="Textbook" are discarded.

Algorithm 9 (conjunctive selection using composite index): If a select operation specifies an equality condition on two or more attributes, the composite index (if available) on these combined attributes can be used to directly retrieve the desired records. For example, consider a relational-algebra operation σ_{R_ID=“A001” ^ ISBN=“002-678-980-4”} (REVIEW). If an index is available on the composite key (ISBN, R_ID) of the REVIEW relation, it can be used to directly retrieve the desired records.

Algorithm 10 (conjunctive selection by intersection of record pointers): If indexes or other access paths with record pointers (and not block pointers) are available on the attributes involved in the individual conditions, then each index can be used to retrieve the set of record pointers that satisfy an individual condition. The intersection of all the retrieved record pointers gives the pointers to the tuples that satisfy the conjunctive condition. For example, consider the relational-algebra expression σ_{P_id=“P001” ^ Category=“Textbook”} (BOOK). if secondary indexes are available on the attributes P_ID and Category of BOOK relation, then these indexes can be used to retrieve the pointers to the tuples satisfying the individual condition. The intersection of these record pointers yields the record pointers of the desired tuples. These record pointers are then used to retrieve the actual tuples (see Figure ).

Algorithm 11 (disjunctive selection by union of record pointers): If indexes or other access paths are available on the attributes involved in the individual conditions, each index can be used to retrieve the set of record pointers that satisfy an individual condition. The union of all retrieved pointers gives the pointers to the tuples that satisfy the disjunctive condition. For example, consider the relational-algebra expressionσ_{P_ID=“P001” v Category=“Textbook”} (BOOK). if secondary indexes are available on the attributes P_ID and Category of BOOK relation, then these indexes can be used to retrieve the pointers to the tuples satisfying an individual condition. The union of these record pointers yields the pointers to the desired tuples. These record pointers are then used to retrieve the actual tuples (see Figure).

Figure 8.6 Example of Search Algorithms 10 and 11

8. Discuss the nested-loop join and its several variations. Also analyze their respective costs in terms of block transfers.

Ans: Nested-loop join: The nested-loop join algorithm consists of a pair of nested for loops, one for each relation say, R and S. One of these relations, say R is chosen as the outer relation and the other relation S as the inner relation. The outer relation is scanned row by row and for each row in the outer relation the inner relation is scanned and looked for the matching rows. In other words, for each tuple r in R, every tuple s in S is retrieved and then checked whether the two tuples satisfy the join condition r[A]=s[B]. If the join condition is satisfied, then the values of these two tuples are concatenated, which is then added to the result. The tuple constructed by concatenating the values of the tuples r and s is denoted by r.s. Since this search scans the entire relation, it is also called naïve nested-loop join.

For natural join, this algorithm can be extended to eliminate the repeated attributes using a project operation. The nested-loop algorithms for equijoin and natural join operations are given in Figure . This algorithm is similar to the linear search algorithm for the select operation in a way that it does not require any indexes and it can be used with any join condition. Thus, it is also called the brute force algorithm.

An extra buffer block is required to hold the tuples resulted from the join operation. When this block is filled, the tuples are appended to the disk file containing the join result (known as result file). This buffer block can then be reused to hold additional result records.

Figure 8.7 Nested-loop Join

The two variations of nested-loop join algorithm are block nested-loop join and indexed nested-loop join algorithm, which are discussed here.

Block nested-loop join: In this algorithm, the relations are processed on a per-block basis rather than on a per-tuple basis. Thus, each block in the inner relation S is read only once for each block in the outer relation R (instead of once for each tuple in R). The algorithm for block nested-loop join operation is given in Figure .

Figure 8.8 Block Nested-loop Join

Index nested-loop join: If an index is available on the join attribute of one relation say S, it is taken as the inner relation, and the index scan (instead of file scan) is used to retrieve the matching tuples. In this algorithm, each tuple r in R is retrieved, and the index available on the join attribute of relation S is used to directly retrieve all the matching tuples. Indexed nested-loop join can be used with the existing as well as with a temporary index—an index created by the query optimizer and destroyed when the query is completed. If a temporary index is used to retrieve the matching records, the algorithm is called temporary indexed nested-loop join. If indexes are available on both the relations, the relation with fewer tuples can be used as the outer relation for better efficiency. The algorithm for indexed nested-loop join operation is given in Figure .

Figure 8.9 Indexed Nested-loop Join

Cost analysis

To analyze the cost of these join algorithms, consider the following operation on PUBLISHER relation of the Online Book database:

Further, assume the following information about these two relations. This information will be used to estimate the cost of each algorithm.

number of tuples of PUBLISHER = n_PUBLISHER = 2000

number of blocks of PUBLISHER = b_PUBLISHER = 100

number of tuples of BOOK = n_BOOK = 5000

number of blocks of BOOK = b_BOOK = 500

Nested-loop join: The nested-loop join algorithm is expensive, as it requires scanning of every tuple in S for each tuple in R. The pairs of tuples that need to be scanned are n_R *n_S, where n_R and n_S denote the number of tuples in R and S, respectively. For example, if PUBLISHER is taken as the outer relation and BOOK as the inner relation, 2000*5000 = 10⁷ pairs of tuples need to be scanned.

In terms of the number of block transfers, the cost of this algorithm is equal to b_R+ (n_R *b_S), where b_R and b_S denote the number of blocks of R and S, respectively. For example, if PUBLISHER is taken as the outer relation and BOOK is taken as the inner relation, 100+(2000*500) = 1,000,100 block transfers are required, which is very high. On the other hand, if BOOK is taken as the outer relation and PUBLISHER as the inner relation, then 500+(5000*100) = 5,00,500 block transfers are required. Thus, by making the smaller relation as the inner relation, the overall costs of the join operation can be reduced.

If both or one of the relations fit entirely in the main memory, the cost of the join operation will be b_R+b_S block transfers. This is the best-case scenario. For example, if either of the relations PUBLISHER and BOOK (or both) fits entirely in the memory, 100+500=600 block transfers are required, which is optimal.

Block nested-loop join: In the worst case, when neither of the relations fit entirely in the memory, this algorithm requires b_R+ (b_R *b_S) block transfers. It is efficient to use the smaller relation as the outer relation. For example, if PUBLISHER is taken as the outer relation and BOOK as the inner relation, 100+(100*500) = 50,100 block transfers are required (which is much lesser than 1,000,100 or 5,00,500 block transfers as in the nested-loop join). In the best case, if one of the relations fits entirely in the main memory, the algorithm requires b_R +b_S block transfers, which is the same as that of the nested-loop join. In this case, the smaller relation is chosen as the inner relation.

Indexed nested-loop join: The cost of this algorithm can be computed as b_R +n_R *c, where c is the cost of traversing the index and retrieving all the matching tuples of S.

9. If the tuples of the relations (on which the join operation is to be performed) are physically sorted by the value of their respective join attributes, which algorithm can be used to perform the join operation? Explain with the help of an example. Also discuss its cost in terms of block transfers.

Ans: If the tuples of two relations say, R and S are physically sorted by the value of their respective join attributes, the sort-merge join algorithm can be used to compute equijoins and natural joins. Both the relations are scanned concurrently in the order of the join attributes to match the tuples that have the same values for the join attributes. Since the relations are in a sorted order, each tuple needs to be scanned only once and hence, each block is also read only once. Thus, if both the relations are in sorted order, the sort-merge join requires only a single pass through both the relations. The sort-merge join algorithm is given in Figure (a) and an example of sort-merge join algorithm is given in Figure (b). In this figure, the relations R and S are sorted on the join attribute a₁

Figure Implementing Sort-merge Join

The number of block transfers is equal to the sum of the number of blocks in both the relations R and S, that is, b_R +b_S. Thus, the join operation BOOKPUBLISHER requires 100+500=600 block transfers. If the relations are not sorted, they can be first sorted using external sorting.

10. Explain the hash join algorithm. Analyze its cost in terms of block transfers. What is the additional optimization in hybrid hash join?

Ans: In a hash join algorithm, a hash function h is used to partition the tuples of each relation into sets of tuples with each set containing tuples that have the same hash value on the join attributes A of relation R and B of relation S. The algorithm is divided into two phases, namely, partitioning phase and probing phase. In the partitioning (also called building) phase, the tuples of relations R and S are partitioned into the hash file buckets using the same hash function h. Let P_r0, P_r1, …, P_rn be the partitions of tuples in relation R and P_s0, P_s1, …, P_sn be the partitions of tuples in relation S. Here, n is the number of partitions of relations R and S. During the partitioning phase, a single in-memory buffer with size one disk block is allocated for each partition to store the tuples that hash to this partition. Whenever the in-memory buffer gets filled, its contents are appended to a disk subfile that stores this partition. For a simple two-way join, the partitioning phase has two iterations. In the first iteration, the relation R is partitioned into n partitions, and in the second iteration, the relation S is partitioned in n partitions.

In the probing (also called matching or joining) phase, n iterations are required. In the i^th iteration, the tuples in the partition P_ri are joined with the tuples in the corresponding partition P_si.

Cost analysis

The hash join algorithm requires 3(b_R +b_S) +4n block transfers. The first term 3(b_R +b_S) represents that the block transfer occurs three times—once when the relations are read into the main memory for partitioning, second when they are written back to the disk and third, when each partition is read again during the joining phase. The second term 4n represents an extra overhead (2n for each relation) of reading and writing of each partition to and from the disk.

Variation of the hash join technique

If the memory size is relatively large, a variation of hash join called hybrid hash join can be used for better performance. In hybrid hash join technique, the probing phase for one of the partitions is combined with the partitioning phase. The main objective of this technique is to join as many tuples during the partitioning phase as possible in order to save the cost of storing these tuples back to the disk and accessing them again during the joining (probing) phase.

To better understand the hybrid hash join algorithm, consider a hash function h(K)=K mod n that creates n partitions of the relation, where n<M and consider the join operation BOOKPUBLISHER. In the first iteration, the algorithm divides the buffer space among n partitions such that all the blocks of first partition of the smaller relation PUBLISHER completely reside in the main memory. A single input buffer (with size one disk block) is kept for each of the other partitions, and they are written back to the disk as in the regular hash join. Thus, at the end of the first iteration of the partitioning phase, the first partition of the relation PUBLISHER resides completely in the main memory and other partitions reside in a disk subfile.

In the second iteration of the partitioning phase, the tuples of the second relation BOOK are being partitioned. If a tuple hashes to the first partition, it is immediately joined with the matching tuple of PUBLISHER relation. If the tuple does not hash to the first partition, it is partitioned normally. Thus, at the end of second iteration, all the tuples of BOOK relation that hash to the first partition have been joined with the tuples of PUBLISHER relation. Now, there are n-1 pairs of partitions on the disk. Therefore, during the joining phase, n-1 iterations are needed instead of n.

11. What do you mean by skewed partitioning? How can it be handled?

Ans: The main difficulty of the hash join algorithm is to ensure that the partitioning hash function is uniform, that is, it creates partitions of almost the same size. If the partitioning hash function is non-uniform, then some partitions may have more tuples than the average, and they might not fit in the available main memory. This type of partitioning is known as skewed partitioning. This problem can be handled by further partitioning the partition into smaller partitions using a different hash function.

12. Describe recursive partitioning.

Ans: If the number of partitions (n) is less than the number of page frames (M) in the main memory, the hash join algorithm is very simple and efficient to execute as both the relations can be partitioned in one pass. However, if n is greater than or equal to M, then the relations cannot be partitioned in a single pass; rather, partitions have to be done in repeated passes. In the first pass, the input relation is partitioned into M-1 partitions (one page frame is kept to be used as input buffer) using a hash function h. In the next pass, each partition created in the previous pass is partitioned again to create smaller partitions using a different hash function. This process is repeated until each partition of the input relation can fit entirely in the memory. This technique is called recursive partitioning.

13. What are the two techniques used to eliminate duplicates during project operation? Discuss the algorithm to implement project operation by eliminating duplicates.

Ans: A project operation of the form π_{<attribute list>} (R) is easy to implement if the <attribute_list> includes the key attribute of the relation R. However, if the <attribute_list> does not contain the key attribute, duplicate tuples may exist and must be eliminated. Duplicate tuples can be eliminated by using either sorting or hashing.

Sorting: The result of the operation is first sorted, and then the duplicate tuples that appear adjacent to each other are removed. If external sort-merge technique is used, the duplicate tuples can be found while creating runs, and they can be removed before writing the runs on the disk. The remaining duplicate tuples can be removed during merging. Thus, the final sorted run contains no duplicates.

Hashing: As soon as a tuple is hashed and inserted into in-memory hash file bucket, it is first checked against those tuples that are already present in the bucket. The tuple is inserted in the bucket if it is not already present; otherwise, it is discarded. All the tuples are processed in the same way.

The algorithm to implement project operation by eliminating duplicates using sorting is given in Figure 8.11.

Figure 8.11 Algorithm for Project Operation

14. Discuss the algorithms for implementing union, intersection and set different operations.

Ans: The set operations, namely, union, intersection, and set difference can be implemented by first sorting the relations on the basis of same attribute and then scanning each sorted relation once to produce the result. For example, the operation R

S can be implemented by concurrently scanning and merging both the sorted relations—whenever same tuple exists in both the relations, only one of them is kept in the merged result. The operation R ∩ S can be implemented by adding only those tuples to the merged result that exist in both the relations. Finally, the operation R - S is implemented by retaining only those tuples that exist in R but are not present in S.

If the relations are initially sorted in the same order, all these operations require b_R +b_S block transfers. If the relations are not sorted initially, the cost of sorting the relations has to be included. The algorithms for the operations R

S, R ∩ S, and R-S using sorting are given in Figure 8.12.

Figure 8.12 Implementing Set Operations

15. Discuss the two ways of evaluating a query. Which one is better and why? Does the available buffer space affect the query evaluation speed? If yes, how?

Ans: There are two ways of evaluating a query, namely, materialized evaluation and pipelined evaluation.

Materialized evaluation: In this approach, each operation in the expression is evaluated one by one in an appropriate order and the result of each operation is materialized (created) in a temporary relation, which becomes input for the subsequent operations. For example, consider this relation-algebra expression:

This expression consists of three relational operations, join, select and project. The query tree of this expression is given in Figure 8.13.

In materialized evaluation, the lowest-level operations (operations at the bottom of operator tree) are evaluated first. The inputs to the lowest-level operations are the relations in the database. For example, in Figure 8.13, the operation σ_{Category=“Novel”} (BOOK) is evaluated first and the result is stored in a temporary relation. The temporary relation and the relation PUBLISHER are then given as inputs to the next-level operation, that is, the join operation. The join operation then performs a join on these two relations and stores the result in another temporary relation, which is given as input to the project operation, which is at the root of the tree.

Figure 8.13 Operator Tree for the Expression

Pipelined evaluation: In this type of evaluation, the operations form a queue and results are passed from one operation to another as they are calculated; rather than storing them in temporary relations. For example, the operations given in Figure 8.13 can be placed in a pipeline. As soon as a tuple is generated from the select operation, it is immediately passed to the join operation for processing. Similarly, a tuple generated from the join operation is passed immediately to the project operation for processing. Thus, the evaluation of these three operations in a pipelined manner directly generates the final result without creating temporary relations.

It is clear from our discussion that pipelined evaluation is better as it increases the query-evaluation efficiency by reducing the number of temporary relations. The system maintains one buffer for each pair of adjacent operations to hold the tuples being passed from one operation to the next. Moreover, since the results of the operations are not stored for a long time, lesser amount of memory is required in the pipelining approach. However, the queries that involve aggregate or grouping functions cannot be evaluated using pipelined evaluation because the inputs are not available to the operations for processing all at once in the pipelining.

The available buffer space definitely affects the query evaluation speed. If a single buffer is used to hold the intermediate result, it forces the CPU to wait for an I/O operation to complete. This is because while the filled buffer is being written back to the disk, the CPU has nothing to do. On the other hand, in case of double buffering, two buffers are used—one of them continues with the execution of the algorithm while the other being written to the disk. This allows CPU activity to perform in parallel with I/O activity, which in turn results in faster query evaluation.

16. Illustrate the difference between demand-driven and producer-driven pipelining.

Ans: There are two ways of executing pipelines, namely, demand-driven pipeline and producer-driven pipeline.

Demand-driven pipeline: In this approach, the parent operation requests the next tuple from its child operations as required, in order to output its next tuple. Whenever an operation receives a request for the next set of tuples, it computes those tuples and then returns them. Since, the tuples are generated lazily, on demand; this technique is also known as lazy evaluation.

Producer-driven pipeline: In this approach, each operation at the bottom of the pipeline produces tuples without waiting for the request from the next higher-level operations. Each operation then puts the output tuples in the output buffer associated with it. This output buffer is used as input buffer by the operation at next higher level. The operation at any other level consumes the tuples from its input buffer to produce the output tuples, and puts them in its output buffer. In any case, if the output buffer is full, the operation has to wait until tuples are consumed by the operation at the next higher level. As soon as the buffer has some space for more tuples, the operation again starts producing the tuples. This process continues until all the tuples are generated. Since the tuples are generated eagerly, this technique is also known as eager pipelining.

17. What are the two techniques of implementing query optimization? How would you estimate the cost of the query?

Ans: There are two main techniques of implementing query optimization, namely, cost-based optimization and heuristics-based optimization.

Cost-based optimization: A cost-based query optimizer generates a number of query-evaluation plans from a given query using a number of equivalence rules, and then chooses the one with the minimum cost. The cost of executing a query include:

cost of accessing secondary storage—cost of searching, reading and writing disk blocks.

storage cost—cost of storing intermediate results.

computation cost—cost of performing in-memory operations.

memory usage cost—cost related to the number of occupied memory buffers.

communication cost—cost of communicating the query result from one site to another.

Heuristic query optimization: In cost-based optimization, generating a large number of query-evaluation plans for a given query, and finding the cost of each plan can be very time consuming and expensive. Thus, finding the optimal plan from such a large number of plans requires a lot of computational effort. To reduce this cost and computational effort, optimizers apply some heuristics (rules) to find the optimized query-evaluation plan. These rules are known as heuristics as they usually (but not always) help in reducing the cost of execution of a query. The heuristic query optimizers simply apply these rules without finding out whether the cost is reduced by this transformation.

The cost of each query can be estimated by collecting the statistical information about the relations such as relation sizes and available primary and secondary indexes from the DBMS catalog. Some of the statistics about database relations is stored in DBMS catalog include:

the number of tuples in relation R, denoted by n_R.

the number of blocks containing the tuples of relation R, denoted by b_R.

the blocking factor of relation R, denoted by f_R is the number of tuples that fit in one block.

the size of each tuple of relation R in bytes, denoted by S_R.

the number of distinct values appearing in relation R for the attribute A, denoted by V (A, R). This value is the same as the size of π _A (R). If A is a key attribute, then V (A, R) = n_R.

height of index i, that is, number of levels in i, denoted by H_i.

Assuming that the tuples of relation R are stored together physically in a file, then

In addition to the relation sizes and indexes height, real-world query optimizers also maintain some other statistical information to increase the accuracy of cost estimates. For example, most databases maintain histograms representing the distribution of values for each attribute. The values for the attribute are divided into a number of ranges, and the number of tuples whose attribute value falls in a particular range is associated with that range in the histogram. For example, the histograms for the attributes Price and Page_count of the BOOK relation are shown in Figure 8.14.

A histogram takes very little space so histograms on various attributes can be maintained in the system catalog. A histogram can be of two types, namely equi-width histogram and equi-depth histogram. In equi-width histograms, the range of values is divided into equal-sized sub-ranges. On the other hand, in equi-depth histograms, the sub-ranges are chosen in such a way that the number of tuples within each sub-range is equal.

Figure 8.14 Histograms on Price and Page_count

18. When are two relational-algebra expressions said to be equivalent? Explain with the help of an example.

Ans: An expression is said to be equivalent to a given relational-algebra expression if, on every legal database instance, it generates the same set of tuples as that of original expression irrespective of the ordering of the tuples. For example, consider a relational-algebra expression for the query "Retrieve the publisher name and category of the book C++".

In this expression, first the join operation between the BOOK and PUBLISHER relation is performed, and then the tuples with book title C++ are retrieved from the resultant relation, and finally, the project operation is applied to retrieve the attributes Pname and Category. When this sequence of operations is followed, it produces a large intermediate relation, which results from the join operation. However, the user is only interested in those tuples having Book_title="C++". Thus, another way to execute this query is to first select only those tuples from the BOOK relation satisfying the condition Book_title="C++", and then join it with the PUBLISHER relation. This reduces the size of the intermediate relation. The query can now be represented by the following expression:

This expression is equivalent to the original expression, however, with less number of tuples in the intermediate relation and thus, it is more efficient.

19. What is an equivalence rule? Discuss all equivalence rules.

Ans: An equivalence rule states that expressions of two forms are equivalent if an expression of one form can be replaced by expression of the other form, and vice versa. While discussing the equivalence rules, it is assumed that c, c₁, c₂, …, c_n are the predicates, A, A₁, A₂, …, A_n are the set of attributes and R, R₁, R₂, …, R_n are the relation names.

Rule 1: The selection condition involving the conjunction of two or more predicates can be deconstructed into a sequence of individual select operations. That is,

This transformation is called cascading of select operator.

Rule 2: Select operations are commutative. That is,

Rule 3: If a query contains a sequence of project operations, only the final operation is needed, the others can be omitted. That is,

This transformation is called cascading of project operator.

Rule 4: If the selection condition c involves only the attributes A₁, A₂, …, A_n that are present in the projection list, the two operations can be commuted. That is,

Rule 5: Selections can be combined with Cartesian products and joins. That is,

Rule 6: Cartesian product and join operations are commutative. That is,

Rule 7: Cartesian product and join operations are associative. That is,

Rule 8: The select operation distributes over the join operation in these two conditions:

a. If all the attributes in the selection condition c₁ involve only the attributes of one of the relations (say, R₁), then the select operation distributes. That is,

b. If the selection condition c₁ involves only the attributes of R₁ and c₂ involves the attributes of R₂, then the select operation distributes. That is,

Rule 9: The project operation distributes over the join operation in these two conditions:

a. If the join condition c involves only the attributes in A₁ U A₂, where A₁ and A₂ are the attributes of R₁ and R₂ respectively, the project operation distributes. That is,

b. If the attributes A₃ and A₄ of R₁ and R₂, respectively, are involved in the join condition c, but not in A₁

A₂, then,

Rule 10: The set operations union and intersection are commutative and associative. That is, a. Commutative

b. Associative

Rule 11: The select operation distributes over the union, intersection and set difference operations. That is,

Rule 12: The project operation distributes over the union operation. That is,

20. Explain with the help of an example the importance of optimal join ordering in query optimization.

Ans: An optimal join ordering of join operations results in reducing the size of intermediate results. It is the responsibility of the query optimizer to choose an optimal join order with the minimal cost. Consider the query "Retrieve the publisher name, rating and the name of the authors who have rated the C++ book." The corresponding relational-algebra expression for this query is

In this query, first the selection is performed on the BOOK relation. This will generate a set of tuples with book title C++. If the operation PUBLISHE

REVIEW is performed first, it results in the Cartesian product of these two relations as no common attributes exist between PUBLISHER and REVIEW, which results in a large intermediate relation. As a result, this strategy is rejected. Similarly, the operation PUBLISHE

AUTHOR is also rejected for the same reason. If the operation REVIEW

AUTHOR is performed, it also results in a large intermediate relation but the user is only interested in the authors who have reviewed the C++ book. Thus, this order of evaluating join operation is also rejected.

However, if the result obtained after the select operation on BOOK relation is joined with the PUBLISHER, which in turn joined with REVIEW relation, and finally with the AUTHOR relation, the number of tuples is reduced. Thus, the equivalent final expression is

21. Discuss the steps followed by heuristic query optimizer to transform an initial query tree into an optimal query tree. Give a suitable example in support of your answer.

Ans: The heuristic query optimizer follows these steps to transform initial query tree into an optimal query tree.

Conjunctive selection operations are transformed into cascading selection operations (equivalence rule 1—refer Question 23).

Since the select operation is commutative with other operations according to the equivalence rules 2, 4, 8, 11, it is moved as far down the query tree as permitted by the attributes involved in the selection condition. That is, select operation is performed as early as possible to reduce the number of tuples returned.

The select and join operations that are more restrictive (result in relations with the fewest tuples) are executed first. That is, the select and join operation are rearranged so that they are accomplished with the least amount of system overhead. This is done by reordering the leaf nodes of the tree among themselves by applying rule 6, 7 and 10 (commutativity and associativity of binary operations).

Any Cartesian product operation followed by a select operation is combined into a single join operation according to the equivalence rule 5(a).

The cascaded project operation is broken down and the attributes in the projection list are moved down the query tree as far as possible. New project operation can also be created as required. The attributes that are required in the query result and in the subsequent operations should only be kept after each project operation. This will eliminate the return of unnecessary attributes (rules 3, 4, 9 and 12).

The subtrees that represent groups of operations that can be executed by a single algorithm are identified.

For example, consider the following relational-algebra expression:

The initial query tree of this expression is given in Figure 8.15(a). After applying the select operation on the BOOK relation early, the number of tuples can be reduced. Similarly, after applying the project operation to extract the desired attributes P_ID, Pname and Address from the PUBLISHER relation, and P_ID from the result of a on BOOK relation (P_ID is the join attribute), the size of the intermediate result can be reduced further. The final expression now becomes

The transformed query tree of the final expression after applying the select and project operations earlier is given in Figure 8.15(b).

Figure 8.15 Transformation from Initial Query Tree to Final Query Tree

22. Consider a file with 10,000 pages and three available buffer pages. Assume that the external sort-merge algorithm is used to sort the file. Calculate the initial number of runs produced in the first pass. How many passes will it take to sort the file completely? How many buffer pages are required to sort the file completely in two passes?

Ans: In the first pass (Pass 0), the number of initial runs (n_i) can be computed as

b_R/M

, that is,

10000/3

= 3334 initial sorted runs each containing 3 pages except the last that will have only one page.

The number of passes required to sort the file completely, including the initial sorting pass, can be calculated as

log_dM (n_i)

+ 1. Here, d_M is 2 and n_i is 3334. Therefore,

log₂ (3334)

+ 1 = 12+1=13 passes are required.

In Pass 0,

b_R/M

runs are produced. In Pass 1, we must be able to merge these runs, that is, M-1 ≥

b_R/M

. This implies that M must at least be large enough to satisfy M* (M-1) ≥ b_R. Thus, with b_R = 1000R, M = 101. Thus, we need 101 buffer pages.

23. Consider three relations R (, B, C), S (, D, E) and T (, F). If R has 500 tuples, S has 1000 tuples, and T has 900 tuples. Compute the size of R

S

T, and give an efficient way for computing the join.

Ans: Since the join operation is commutative and associative in nature, the relations R, S and T can be joined in any order. Therefore, we will first perform the join operation on R and S based on the common attribute C. Further, the join operation will be performed on the relation T and the result obtained from the previous join operation based on the common attribute E. Thus, we will compute the size based on the strategy of ((R

S)

T).

The join operation on R and S will yield a relation of at most 500 tuples, because the number of tuples returned by the operation R

S is exactly the same as the number of tuples in R, if the common attribute (on the basis of which join is performed) is the primary key in S and foreign key in R.

Similarly, the join operation on T and the result of R

S will yield a relation of at most 500 tuples, because the common attribute E is the primary key in T and foreign key in the result obtained from R

S. Thus, the operation (R

S

T) results in 500 tuples.

The efficient way of computing (R

S

T) is to create an index on the attributes C and E of relations R and S respectively. The index on attribute C can be used to directly retrieve the matching tuples in R that satisfy the join condition. Similarly, the index on attribute E can be used to directly retrieve the matching tuples in the result obtained after R

S that satisfy the join condition.

24. How many different join orders are there for a query that joins 5 relations?

Ans: In general, for n relations, there are (2 (n-1))!/(n-1)! different join orderings. Therefore, for n=5, the join orderings are 1680.

25. Consider the following relational-algebra queries on the Online Book database. Draw the initial query trees for each of these queries.

Ans:

26. Consider the following SQL queries. Which of these queries is faster to execute and why? Draw the query tree also.

(a) SELECT Category, COUNT (*) FROM BOOK
    WHERE Category! =‘Novel’
    GROUP BY Category;

Ans: Query tree for this SQL query is given below:

(b) SELECT Category, COUNT (*) FROM BOOK
    GROUP BY Category
    HAVING Category! =‘Novel’;

Ans: Query tree for this SQL query is given below:

Query (a) is faster to execute than query because HAVING clause in query filters the selected tuples only after all the tuples have been fetched. Therefore, it will take more time to execute.

27. Consider the following SQL queries on Online Book database and transform these SQL queries into relational algebra expressions. Draw the initial query trees for each of these expressions, and then derive their optimized query trees after applying heuristics on them.

(a) SELECT P.P_ID, Pname, Address, Phone
    FROM BOOK B, PUBLISHER P
    WHERE P.P_ID = B.P_ID AND Category = ‘Language Book’;

Initial query tree of this expression is given below:

The optimized query tree is given below:

(b) SELECT ISBN, Book_title, Category, Price
    FROM BOOK B, PUBLISHER P
    WHERE P.P_ID = B.P_ID AND Pname=‘Bright Publications’
    AND Price>30;

The initial query tree of this expression is given below:

The optimized query tree is given below:

(c) SELECT ISBN, Book_title, Year, Page_count, Price
    FROM BOOK B, AUTHOR A, AUTHOR_BOOK AB
    WHERE B.ISBN = AB.ISBN AND AB.A_ID = A.A_ID
    AND Aname = ‘Charles Smith’;

The initial query tree of this expression is given below:

The optimized query tree is given below:

(d) SELECT Book_title, Pname, Aname
    FROM BOOK B, PUBLISHER P, AUTHOR A, AUTHOR_BOOK AB
    WHERE P.P_ID = B.P_ID AND AB.A_ID=A.A_ID
    AND B.ISBN=AB.ISBN AND Category= ‘Language book’;

The initial query tree of this expression is given below:

The optimized query tree is given below:

Multiple-choice Questions

1. A tree data structure in which the input relations are represented as the leaf nodes and the relational algebra operations as the internal nodes_________.

(a) Query tree

(b) Left-deep join tree

(d) Operator tree

2. Which of the following DBMS component is responsible for generating a least costly plan?

(a) Parser

(b) Translator

(d) Query-evaluation engine

3. Which of these is a phase in external sort-merge algorithm?

(a) Merge phase

(b) Sort phase

(d) None of these

4. If more than two relations are involved in a join, it is termed as_______.

(a) Two-way join

(b) Multi-way join

(d) Naïve nested-loop join

5. If there are two relations R and S, with n and m number of tuples, respectively, then nested-loop join requires how many pairs of tuples to be scanned?

(a) n + m

(b) n * m

(d) n

6. If there are two relations R and S containing 700 and 900 tuples respectively, then which relation should be taken as outer relation in block nested-loop join algorithm?

(a) S

(b) R

(d) None of these

7. An index created by the query optimizer and destroyed when the query is completed is known as________.

(a) Secondary index

(b) Optimized index

(d) Permanent index

8. Which of the following algorithms combines sort-merge join with indexes?

(a) Hybrid merge-join

(b) Nested merge-join

(d) Indexed merge-join

9. Which of these evaluations is also known as lazy evaluation?

(a) Materialized evaluation

(b) Demand-driven pipeline

(d) Both (a) and (b)

10. Which of the following operations is not commutative?

(a) Select

(b) Union

(d) Set difference

11. Which of these equivalence rules is known as cascading of project operator?

(a) π (σ_c (R)) = σ_c (π_{A1, A2,…,An} (R))

(b) σ _c1 (σ_c2 (R)) = σ_c2 (σ_c1 (R))

(d) σ_{c1 ^ c2} (R) = σ_c1 (σ_c2 (R))

12. Given four relations P, Q, R and S. In how many ways these relations can be joined? (a) 240

(a) 240

(b) 4

(d) 16

Answers

(1) (a)

(2) (c)

(3) (c)

(4) (b)

(5) (b)

(6) (b)

(7) (c)

(8) (a)

(9) (b)

(10) (d)

(11) (c)

(12) (c)

GATE 2018 NOTES ..

Wednesday, 29 March 2017

Query Processing & Optimization

Query Processing and Optimization

No comments:

Post a Comment

GATE 2018 NOTES

Amazing idea

Report Abuse