Query Processing and Optimization
1. What do you mean by query processing? What are the various steps involved in query processing? Explain with the help of a block diagram.
Ans: Query processing includes translation of
high-level queries into low-level expressions that can be used at the
physical level of the file system, query optimization and actual
execution of the query to get the result. It is a three-step process
that consists of parsing and translation, optimization and execution of
the query submitted by the user (see Figure ). These steps are discussed below:
Figure 8.1 Query-processing Steps
1. Parsing and translation: Whenever a user
submits a query in high-level language (such as SQL) for execution, it
is first translated into its internal representation suitable to the
system. The internal representation of the query is based on the
extended relational algebra. Thus, an SQL query is first translated into
an equivalent extended relational algebra expression. During
translation, the parser checks the syntax of the user's query according
to the rules of the query language. It also verifies that all the
attributes and relation names specified in the query are the valid names
in the schema of the database being queried.
2. Optimization: Generally, there are several
possible ways of executing a query (known as execution strategies), and
different execution strategies can have different costs. It is the
responsibility of the query optimizer, a component of DBMS, to
choose a least costly execution strategy. This process of choosing a
suitable execution strategy for processing a query is knows as query
optimization. The metadata stored in the special tables called DBMS catalog is used to find the best way of evaluating a query.
3. Execution: The chosen execution strategy is finally submitted to the query-evaluation engine for actual execution of the query to get the desired result.
2. What is a query tree? How is it different from a query evaluation plan? Explain with the help of an example.
Ans: The relational-algebra expressions are typically represented as query trees or parse trees. A query tree (also known as operator tree or expression tree)
is a tree data structure in which the input relations are represented
as the leaf nodes and the relational-algebra operations as the internal
nodes. When the query tree is executed, first the internal node
operation is executed whenever its operands are available. Then the
internal node is replaced by the relation resulted after the execution
of the operation. The execution terminates when the root node is
executed and the result of the query is produced. For example, consider
the following relational-algebra expression:
The query tree representation of this expression is shown in Figure .

Figure 8.2 Query Tree
Each operation in the query (such as select, project,
join, etc.) can be evaluated using one of the several different
algorithms, which are discussed later in this chapter. The query tree
representation does not specify how to evaluate each operation in the
query. To fully specify how to evaluate a query, each operation in the
query tree is annotated with the instructions that specify the algorithm
or the index to be used to evaluate that operation. The resultant tree structure is known as the query-evaluation plan or query-execution plan or simply plan. One possible query-evaluation plan for our example query tree is given in Figure 8.3.
In this figure, the select operator s is annotated with the instruction
Linear Scan that specifies a complete scan of the relation.
Figure 8.3 Query-evaluation Plan
3. Discuss the external sort-merge algorithm.
Ans: If a relation entirely fits in the main memory, in-memory sorting (called internal sorting)
can be performed using any standard sorting techniques such as quick
sort. If the relation does not fit entirely in the main memory, external sorting
is required. In external sorting, some part of the relation is read in
the main memory, sorted and written back to the disk. This process
continues until the entire relation is sorted. The most commonly used
external sorting algorithm is the external sort-merge algorithm. The
external sort-merge algorithm is divided into two phases, namely sort phase and merge phase.
ni) is computed as
bR/M
, where bR is the number of blocks containing records of relation R and M is the size of available buffer in blocks.
Figure 8.4 External Sort-merge Algorithm
4. Define degree of merging. Explain with the help of an example.
Ans: In an external sort-merge algorithm, if the number of initial runs (
.
ni) is greater than the size of the available buffer space in blocks (M), then it is not possible to perform the merge operation in one pass; rather multiple passes are required. In each pass, M-1 buffer blocks are used to hold one block from each of the M-1
input runs, and one buffer block is kept for holding one block of the
merge result. The number of runs that can be merged together in each
pass is known as degree of merging (dM). Thus, the value of dM is the smaller of (M-1) or ni. If two runs are merged in each pass, it is known as two-way merging. In general, if N runs are merged in each pass, it is known as N-way merging. The number of passes required can be calculate as (logdM (nR)
Figure 8.5 shows two-way merging where we have assumed
that the main memory holds at most three page frames out of which two
are used as input and one for output during the merge phase. In this
figure,
(
= 2 passes are required. In simple terms, with
dM = 2, and nR = 4; therefore, log2 (4)dM
= 2, 4 initial sorted runs would be merged into 2 at the end of first
pass, which are then merged into 1 sorted run at the end of the second
pass.
Figure 8.5 Example of Two-way Merging
5. Compare the linear search and binary search algorithms for select operation. Which one is better?
Ans: Algorithm 1 (Linear search algorithm): In linear search (also called brute force)
algorithm, every file block is scanned and all the records are tested
to see whether their attribute values satisfy the selection condition.
It is the simplest algorithm and can be applied to any file, regardless
of the ordering of the file or the availability of indexes. However, it
may be slower than the other algorithms that are used for implementing
selection. Since the linear search algorithm searches all the blocks of
the file, it requires
bR block transfers, where bR is the total number of blocks of relation R.
If the selection condition involves a key attribute, the scan is
terminated when the required record is found, without looking further at
other records of the relation. The average cost in this case is equal
to bR/2 block transfers if the record is found. In the worst case, the select operations involving the key attribute may also require the scanning of all the records (bR block transfers) if no record satisfies the condition or the record is at the last location.
Algorithm 2 (Binary search): If the file is
ordered on an attribute and the selection condition involves an equality
comparison on that attribute, binary search can be used to retrieve the
desired tuples. Binary search is more efficient than the linear search
as it requires scanning of lesser number of file blocks. For example,
for the relational-algebra operation σ
1 block transfers. If the selection condition involves a non-key
attribute, it is possible that more than one block contains the required
records.
P_ID=“P002” (PUBLISHER), binary search algorithm can be applied if the PUBLISHER relation is ordered on the basis of the attribute P_ID. The cost of this algorithm is equal to log2 (bR)
In general, binary search algorithm is not suited for
searching purposes in the database since sorted files are not used
unless they have a corresponding primary index file also.
6. What are file scans and index scans? Discuss the search algorithms for select operations that involve the use of indexes.
Ans: The search algorithms that scan the
records of file to locate and retrieve the records satisfying the
selection condition are known as file scans. The search algorithms that involve the use of indexes are known as index scans. An index is also referred to as access path
as it provides a path through which the data can be found and
retrieved. Though indexes provide a fast, direct and ordered access,
they also add an extra overhead of accessing those blocks containing the
index. The algorithms that use an index are described here.
P_ID=“P002” (PUBLISHER), the primary index on the key attribute P_ID (if available) can be used to directly retrieve the desired record. If B+-tree
is used, the cost of this algorithm is equal to the height of the tree
plus one block transfer to retrieve the desired record.P_ID=“P002” (BOOK), the clustering index on the non-key attribute P_ID of the BOOK
relation (if available) can be used to retrieve the desired records.
The cost of this algorithm is equal to the height of the tree plus the
number of blocks containing the desired records, say b.n block transfers, where n is the number of records fetched. For example, for the relational-algebra expression σPname=“Hills Publications” (PUBLISHER), the secondary index on the candidate key Pname (if available) can be used to retrieve the desired record.A>v, A>=v, A<v or A<=v, where A is an attribute with a primary index and v
is the attribute value, the primary ordered index can be used to
retrieve the desired records. The cost of this algorithm is equal to the
height of the tree plus bR/2 block transfers.A>=v, the record satisfying the corresponding equality condition A=v
is first searched using the primary index. A file scan starting from
that record up to the end of the file returns all the records satisfying
the selection condition. For A>v, the file scan starts from the first tuple that satisfies the condition A>v—the record satisfying the condition A=v is not included.A<v,
the index is not searched, rather the file is simply scanned starting
from the first record until the record satisfying the condition A=v is encountered. For A<=v, the file is scanned until the first record satisfying the condition A>v is encountered. In both the cases, the index is not useful.A<v or A<=v, the lowest-level index blocks are scanned from the smallest value up to v and for the conditions of the form A>v or A>=v, the index blocks are scanned from v
up to the end of the file. The secondary index provides the pointers to
the records and not the actual records. The actual records are fetched
using the pointers. Thus, fetching each record may require a disk
access, since records may be on different blocks. This implies that
using the secondary index may even be more expensive than using the
linear search if the number of fetched records is large.
7. What are the various algorithms for implementing the select operations involving complex conditions?
Ans: If the select operation involves complex conditions, made up of several simple conditions connected with logical operators
AND (conjunction) or OR (disjunction), some additional algorithms are used, which are discussed here.P_ID=“P001” ^ Category=“Textbook” (BOOK), if a clustering index is available on the attribute P_ID (algorithm 4), that index can be used to retrieve the records with P_ID="P001". These records can then be checked for the other condition. The records that do not satisfy the condition Category="Textbook" are discarded.R_ID=“A001” ^ ISBN=“002-678-980-4” (REVIEW). If an index is available on the composite key (ISBN, R_ID) of the REVIEW relation, it can be used to directly retrieve the desired records.P_id=“P001” ^ Category=“Textbook” (BOOK). if secondary indexes are available on the attributes P_ID and Category of BOOK
relation, then these indexes can be used to retrieve the pointers to
the tuples satisfying the individual condition. The intersection of
these record pointers yields the record pointers of the desired tuples.
These record pointers are then used to retrieve the actual tuples (see Figure ).P_ID=“P001” v Category=“Textbook” (BOOK). if secondary indexes are available on the attributes P_ID and Category of BOOK
relation, then these indexes can be used to retrieve the pointers to
the tuples satisfying an individual condition. The union of these record
pointers yields the pointers to the desired tuples. These record
pointers are then used to retrieve the actual tuples (see Figure).
Figure 8.6 Example of Search Algorithms 10 and 11
8. Discuss the nested-loop join and its several variations. Also analyze their respective costs in terms of block transfers.
Ans: Nested-loop join: The nested-loop join algorithm consists of a pair of nested
for loops, one for each relation say, R and S. One of these relations, say R is chosen as the outer relation and the other relation S as the inner relation.
The outer relation is scanned row by row and for each row in the outer
relation the inner relation is scanned and looked for the matching rows.
In other words, for each tuple r in R, every tuple s in S is retrieved and then checked whether the two tuples satisfy the join condition r[A]=s[B].
If the join condition is satisfied, then the values of these two tuples
are concatenated, which is then added to the result. The tuple
constructed by concatenating the values of the tuples r and s is denoted by r.s. Since this search scans the entire relation, it is also called naïve nested-loop join.
For natural join, this algorithm can be extended to
eliminate the repeated attributes using a project operation. The
nested-loop algorithms for equijoin and natural join operations are
given in Figure .
This algorithm is similar to the linear search algorithm for the select
operation in a way that it does not require any indexes and it can be
used with any join condition. Thus, it is also called the brute force algorithm.
An extra buffer block is required to hold the tuples
resulted from the join operation. When this block is filled, the tuples
are appended to the disk file containing the join result (known as result file). This buffer block can then be reused to hold additional result records.

Figure 8.7 Nested-loop Join
The two variations of nested-loop join algorithm are
block nested-loop join and indexed nested-loop join algorithm, which are
discussed here.
S is read only once for each block in the outer relation R (instead of once for each tuple in R). The algorithm for block nested-loop join operation is given in Figure .
Figure 8.8 Block Nested-loop Join
S,
it is taken as the inner relation, and the index scan (instead of file
scan) is used to retrieve the matching tuples. In this algorithm, each
tuple r in R is retrieved, and the index available on the join attribute of relation S is used to directly retrieve all the matching tuples. Indexed nested-loop join can be used with the existing as well as with a temporary index—an
index created by the query optimizer and destroyed when the query is
completed. If a temporary index is used to retrieve the matching
records, the algorithm is called temporary indexed nested-loop join.
If indexes are available on both the relations, the relation with fewer
tuples can be used as the outer relation for better efficiency. The
algorithm for indexed nested-loop join operation is given in Figure .
Figure 8.9 Indexed Nested-loop Join
Cost analysis
To analyze the cost of these join algorithms, consider the following operation on
PUBLISHER relation of the Online Book database:
Further, assume the following information about these
two relations. This information will be used to estimate the cost of
each algorithm.
number of tuples of
PUBLISHER = nPUBLISHER = 2000
number of blocks of
PUBLISHER = bPUBLISHER = 100
number of tuples of
BOOK = nBOOK = 5000
number of blocks of
BOOK = bBOOK = 500S for each tuple in R. The pairs of tuples that need to be scanned are nR *nS, where nR and nS denote the number of tuples in R and S, respectively. For example, if PUBLISHER is taken as the outer relation and BOOK as the inner relation, 2000*5000 = 107 pairs of tuples need to be scanned.
In terms of the number of block transfers, the cost of this algorithm is equal to
bR+ (nR *bS), where bR and bS denote the number of blocks of R and S, respectively. For example, if PUBLISHER is taken as the outer relation and BOOK
is taken as the inner relation, 100+(2000*500) = 1,000,100 block
transfers are required, which is very high. On the other hand, if BOOK is taken as the outer relation and PUBLISHER
as the inner relation, then 500+(5000*100) = 5,00,500 block transfers
are required. Thus, by making the smaller relation as the inner
relation, the overall costs of the join operation can be reduced.
If both or one of the relations fit entirely in the main memory, the cost of the join operation will be
bR+bS block transfers. This is the best-case scenario. For example, if either of the relations PUBLISHER and BOOK (or both) fits entirely in the memory, 100+500=600 block transfers are required, which is optimal.bR+ (bR *bS) block transfers. It is efficient to use the smaller relation as the outer relation. For example, if PUBLISHER is taken as the outer relation and BOOK
as the inner relation, 100+(100*500) = 50,100 block transfers are
required (which is much lesser than 1,000,100 or 5,00,500 block
transfers as in the nested-loop join). In the best case, if one of the
relations fits entirely in the main memory, the algorithm requires bR +bS
block transfers, which is the same as that of the nested-loop join. In
this case, the smaller relation is chosen as the inner relation.bR +nR *c, where c is the cost of traversing the index and retrieving all the matching tuples of S.
9. If the tuples of the relations (on
which the join operation is to be performed) are physically sorted by
the value of their respective join attributes, which algorithm can be
used to perform the join operation? Explain with the help of an example.
Also discuss its cost in terms of block transfers.
Ans: If the tuples of two relations say,
R and S are physically sorted by the value of their respective join attributes, the sort-merge join algorithm
can be used to compute equijoins and natural joins. Both the relations
are scanned concurrently in the order of the join attributes to match
the tuples that have the same values for the join attributes. Since the
relations are in a sorted order, each tuple needs to be scanned only
once and hence, each block is also read only once. Thus, if both the
relations are in sorted order, the sort-merge join requires only a
single pass through both the relations. The sort-merge join algorithm is
given in Figure (a) and an example of sort-merge join algorithm is given in Figure (b). In this figure, the relations R and S are sorted on the join attribute a1
Figure Implementing Sort-merge Join
The number of block transfers is equal to the sum of the number of blocks in both the relations
R and S, that is, bR +bS. Thus, the join operation BOOK
PUBLISHER requires 100+500=600 block transfers. If the relations are not sorted, they can be first sorted using external sorting.
10. Explain the hash join algorithm.
Analyze its cost in terms of block transfers. What is the additional
optimization in hybrid hash join?
Ans: In a hash join algorithm, a hash function
h
is used to partition the tuples of each relation into sets of tuples
with each set containing tuples that have the same hash value on the
join attributes A of relation R and B of relation S. The algorithm is divided into two phases, namely, partitioning phase and probing phase. In the partitioning (also called building) phase, the tuples of relations R and S are partitioned into the hash file buckets using the same hash function h. Let Pr0, Pr1, …, Prn be the partitions of tuples in relation R and Ps0, Ps1, …, Psn be the partitions of tuples in relation S. Here, n is the number of partitions of relations R and S.
During the partitioning phase, a single in-memory buffer with size one
disk block is allocated for each partition to store the tuples that hash
to this partition. Whenever the in-memory buffer gets filled, its
contents are appended to a disk subfile that stores this partition. For a
simple two-way join, the partitioning phase has two iterations. In the
first iteration, the relation R is partitioned into n partitions, and in the second iteration, the relation S is partitioned in n partitions.
In the probing (also called matching or joining) phase,
n iterations are required. In the ith iteration, the tuples in the partition Pri are joined with the tuples in the corresponding partition Psi.
Cost analysis
The hash join algorithm requires
3(bR +bS) +4n block transfers. The first term 3(bR +bS)
represents that the block transfer occurs three times—once when the
relations are read into the main memory for partitioning, second when
they are written back to the disk and third, when each partition is read again during the joining phase. The second term 4n represents an extra overhead (2n for each relation) of reading and writing of each partition to and from the disk.
Variation of the hash join technique
If the memory size is relatively large, a variation of hash join called hybrid hash join can be used for better performance. In hybrid hash join
technique, the probing phase for one of the partitions is combined with
the partitioning phase. The main objective of this technique is to join
as many tuples during the partitioning phase as possible in order to
save the cost of storing these tuples back to the disk and accessing
them again during the joining (probing) phase.
To better understand the hybrid hash join algorithm, consider a hash function
h(K)=K mod n that creates n partitions of the relation, where n<M and consider the join operation BOOK
PUBLISHER. In the first iteration, the algorithm divides the buffer space among n partitions such that all the blocks of first partition of the smaller relation PUBLISHER
completely reside in the main memory. A single input buffer (with size
one disk block) is kept for each of the other partitions, and they are
written back to the disk as in the regular hash join. Thus, at the end
of the first iteration of the partitioning phase, the first partition of
the relation PUBLISHER resides completely in the main memory and other partitions reside in a disk subfile.
In the second iteration of the partitioning phase, the tuples of the second relation
BOOK are being partitioned. If a tuple hashes to the first partition, it is immediately joined with the matching tuple of PUBLISHER
relation. If the tuple does not hash to the first partition, it is
partitioned normally. Thus, at the end of second iteration, all the
tuples of BOOK relation that hash to the first partition have been joined with the tuples of PUBLISHER relation. Now, there are n-1 pairs of partitions on the disk. Therefore, during the joining phase, n-1 iterations are needed instead of n.
11. What do you mean by skewed partitioning? How can it be handled?
Ans: The main difficulty of the hash join
algorithm is to ensure that the partitioning hash function is uniform,
that is, it creates partitions of almost the same size. If the
partitioning hash function is non-uniform, then some partitions may have
more tuples than the average, and they might not fit in the available
main memory. This type of partitioning is known as skewed partitioning. This problem can be handled by further partitioning the partition into smaller partitions using a different hash function.
12. Describe recursive partitioning.
Ans: If the number of partitions
(n) is less than the number of page frames (M)
in the main memory, the hash join algorithm is very simple and
efficient to execute as both the relations can be partitioned in one
pass. However, if n is greater than or equal to M,
then the relations cannot be partitioned in a single pass; rather,
partitions have to be done in repeated passes. In the first pass, the
input relation is partitioned into M-1 partitions (one page frame is kept to be used as input buffer) using a hash function h.
In the next pass, each partition created in the previous pass is
partitioned again to create smaller partitions using a different hash
function. This process is repeated until each partition of the input
relation can fit entirely in the memory. This technique is called recursive partitioning.
13. What are the two techniques used to
eliminate duplicates during project operation? Discuss the algorithm to
implement project operation by eliminating duplicates.
Ans: A project operation of the form π
<attribute list> (R) is easy to implement if the <attribute_list> includes the key attribute of the relation R. However, if the <attribute_list>
does not contain the key attribute, duplicate tuples may exist and must
be eliminated. Duplicate tuples can be eliminated by using either
sorting or hashing.
The algorithm to implement project operation by eliminating duplicates using sorting is given in Figure 8.11.

Figure 8.11 Algorithm for Project Operation
14. Discuss the algorithms for implementing union, intersection and set different operations.
Ans: The set operations, namely, union, intersection, and set difference
can be implemented by first sorting the relations on the basis of same
attribute and then scanning each sorted relation once to produce the
result. For example, the operation
R S
can be implemented by concurrently scanning and merging both the sorted
relations—whenever same tuple exists in both the relations, only one of
them is kept in the merged result. The operation R ∩ S can be implemented by adding only those tuples to the merged result that exist in both the relations. Finally, the operation R - S is implemented by retaining only those tuples that exist in R but are not present in S.
If the relations are initially sorted in the same order, all these operations require
bR +bS
block transfers. If the relations are not sorted initially, the cost of
sorting the relations has to be included. The algorithms for the
operations R S, R ∩ S, and R-S using sorting are given in Figure 8.12.

Figure 8.12 Implementing Set Operations
15. Discuss the two ways of evaluating a
query. Which one is better and why? Does the available buffer space
affect the query evaluation speed? If yes, how?
Ans: There are two ways of evaluating a query, namely, materialized evaluation and pipelined evaluation.
This expression consists of three relational
operations, join, select and project. The query tree of this expression
is given in Figure 8.13.
In materialized evaluation, the lowest-level
operations (operations at the bottom of operator tree) are evaluated
first. The inputs to the lowest-level operations are the relations in
the database. For example, in Figure 8.13, the operation σ
Category=“Novel” (BOOK) is evaluated first and the result is stored in a temporary relation. The temporary relation and the relation PUBLISHER
are then given as inputs to the next-level operation, that is, the join
operation. The join operation then performs a join on these two
relations and stores the result in another temporary relation, which is
given as input to the project operation, which is at the root of the
tree.
Figure 8.13 Operator Tree for the Expression
It is clear from our discussion that pipelined
evaluation is better as it increases the query-evaluation efficiency by
reducing the number of temporary relations. The system maintains one
buffer for each pair of adjacent operations to hold the tuples being
passed from one operation to the next. Moreover, since the results of
the operations are not stored for a long time, lesser amount of memory
is required in the pipelining approach. However, the queries that
involve aggregate or grouping functions cannot be evaluated using
pipelined evaluation because the inputs are not available to the
operations for processing all at once in the pipelining.
The available buffer space definitely affects the
query evaluation speed. If a single buffer is used to hold the
intermediate result, it forces the CPU to wait for an I/O operation to
complete. This is because while the filled buffer is being written back
to the disk, the CPU has nothing to do. On the other hand, in case of double buffering,
two buffers are used—one of them continues with the execution of the
algorithm while the other being written to the disk. This allows CPU
activity to perform in parallel with I/O activity, which in turn results
in faster query evaluation.
16. Illustrate the difference between demand-driven and producer-driven pipelining.
Ans: There are two ways of executing pipelines, namely, demand-driven pipeline and producer-driven pipeline.
17. What are the two techniques of implementing query optimization? How would you estimate the cost of the query?
Ans: There are two main techniques of implementing query optimization, namely, cost-based optimization and heuristics-based optimization.
The cost of each query can be estimated by collecting
the statistical information about the relations such as relation sizes
and available primary and secondary indexes from the DBMS catalog. Some
of the statistics about database relations is stored in DBMS catalog
include:
R, denoted by nR.R, denoted by bR.R, denoted by fR is the number of tuples that fit in one block.R in bytes, denoted by SR.R for the attribute A, denoted by V (A, R). This value is the same as the size of π A (R). If A is a key attribute, then V (A, R) = nR.i, that is, number of levels in i, denoted by Hi.
Assuming that the tuples of relation
R are stored together physically in a file, then
In addition to the relation sizes and indexes height,
real-world query optimizers also maintain some other statistical
information to increase the accuracy of cost estimates. For example,
most databases maintain histograms representing the distribution
of values for each attribute. The values for the attribute are divided
into a number of ranges, and the number of tuples whose attribute value
falls in a particular range is associated with that range in the
histogram. For example, the histograms for the attributes
Price and Page_count of the BOOK relation are shown in Figure 8.14.
A histogram takes very little space so histograms on
various attributes can be maintained in the system catalog. A histogram
can be of two types, namely equi-width histogram and equi-depth histogram. In equi-width histograms, the range of values is divided into equal-sized sub-ranges. On the other hand, in equi-depth histograms, the sub-ranges are chosen in such a way that the number of tuples within each sub-range is equal.

Figure 8.14 Histograms on
Price and Page_count
18. When are two relational-algebra expressions said to be equivalent? Explain with the help of an example.
Ans: An expression is said to be equivalent
to a given relational-algebra expression if, on every legal database
instance, it generates the same set of tuples as that of original
expression irrespective of the ordering of the tuples. For example,
consider a relational-algebra expression for the query "Retrieve the
publisher name and category of the book C++".
In this expression, first the join operation between the
BOOK and PUBLISHER relation is performed, and then the tuples with book title C++ are retrieved from the resultant relation, and finally, the project operation is applied to retrieve the attributes Pname and Category.
When this sequence of operations is followed, it produces a large
intermediate relation, which results from the join operation. However,
the user is only interested in those tuples having Book_title="C++". Thus, another way to execute this query is to first select only those tuples from the BOOK relation satisfying the condition Book_title="C++", and then join it with the PUBLISHER relation. This reduces the size of the intermediate relation. The query can now be represented by the following expression:
This expression is equivalent to the original
expression, however, with less number of tuples in the intermediate
relation and thus, it is more efficient.
19. What is an equivalence rule? Discuss all equivalence rules.
Ans: An equivalence rule states that
expressions of two forms are equivalent if an expression of one form can
be replaced by expression of the other form, and vice versa. While
discussing the equivalence rules, it is assumed that
c, c1, c2, …, cn are the predicates, A, A1, A2, …, An are the set of attributes and R, R1, R2, …, Rn are the relation names.
Rule 1: The selection condition involving the
conjunction of two or more predicates can be deconstructed into a
sequence of individual select operations. That is,
This transformation is called cascading of select operator.
Rule 2: Select operations are commutative. That is,
Rule 3: If a query contains a sequence of project operations, only the final operation is needed, the others can be omitted. That is,
This transformation is called cascading of project operator.
Rule 4: If the selection condition
c involves only the attributes A1, A2, …, An that are present in the projection list, the two operations can be commuted. That is,
Rule 5: Selections can be combined with Cartesian products and joins. That is,
Rule 6: Cartesian product and join operations are commutative. That is,
Rule 7: Cartesian product and join operations are associative. That is,
Rule 8: The select operation distributes over the join operation in these two conditions:
a. If all the attributes in the selection condition
c1 involve only the attributes of one of the relations (say, R1), then the select operation distributes. That is,
b. If the selection condition
c1 involves only the attributes of R1 and c2 involves the attributes of R2, then the select operation distributes. That is,
Rule 9: The project operation distributes over the join operation in these two conditions:
a. If the join condition
c involves only the attributes in A1 U A2, where A1 and A2 are the attributes of R1 and R2 respectively, the project operation distributes. That is,
b. If the attributes
A3 and A4 of R1 and R2, respectively, are involved in the join condition c, but not in A1 A2, then,
Rule 10: The set operations union and intersection are commutative and associative. That is, a. Commutative
b. Associative
Rule 11: The select operation distributes over the union, intersection and set difference operations. That is,

Rule 12: The project operation distributes over the union operation. That is,
20. Explain with the help of an example the importance of optimal join ordering in query optimization.
Ans: An optimal join ordering of join
operations results in reducing the size of intermediate results. It is
the responsibility of the query optimizer to choose an optimal join
order with the minimal cost. Consider the query "Retrieve the publisher
name, rating and the name of the authors who have rated the C++ book."
The corresponding relational-algebra expression for this query is
In this query, first the selection is performed on the 


BOOK relation. This will generate a set of tuples with book title C++. If the operation PUBLISHEREVIEW is performed first, it results in the Cartesian product of these two relations as no common attributes exist between PUBLISHER and REVIEW, which results in a large intermediate relation. As a result, this strategy is rejected. Similarly, the operation PUBLISHEAUTHOR is also rejected for the same reason. If the operation REVIEWAUTHOR
is performed, it also results in a large intermediate relation but the
user is only interested in the authors who have reviewed the C++ book. Thus, this order of evaluating join operation is also rejected.
However, if the result obtained after the select operation on
BOOK relation is joined with the PUBLISHER, which in turn joined with REVIEW relation, and finally with the AUTHOR relation, the number of tuples is reduced. Thus, the equivalent final expression is
21. Discuss the steps followed by
heuristic query optimizer to transform an initial query tree into an
optimal query tree. Give a suitable example in support of your answer.
Ans: The heuristic query optimizer follows these steps to transform initial query tree into an optimal query tree.
For example, consider the following relational-algebra expression:
The initial query tree of this expression is given in Figure 8.15(a). After applying the select operation on the
BOOK
relation early, the number of tuples can be reduced. Similarly, after
applying the project operation to extract the desired attributes P_ID, Pname and Address from the PUBLISHER relation, and P_ID from the result of a on BOOK relation (P_ID is the join attribute), the size of the intermediate result can be reduced further. The final expression now becomes
The transformed query tree of the final expression after applying the select and project operations earlier is given in Figure 8.15(b).

Figure 8.15 Transformation from Initial Query Tree to Final Query Tree
22. Consider a file with 10,000 pages and
three available buffer pages. Assume that the external sort-merge
algorithm is used to sort the file. Calculate the initial number of runs
produced in the first pass. How many passes will it take to sort the
file completely? How many buffer pages are required to sort the file
completely in two passes?
Ans: In the first pass (Pass 0), the number of initial runs (
, that is,
10000/3
= 3334 initial sorted runs each containing 3 pages except the last that will have only one page.
ni) can be computed as bR/M
The number of passes required to sort the file completely, including the initial sorting pass, can be calculated as
+ 1. Here, 
+ 1 = 12+1=13 passes are required.
logdM (ni) dM is 2 and ni is 3334. Therefore, log2 (3334)
In Pass 0, 
runs are produced. In Pass 1, we must be able to merge these runs, that is, 
. This implies that
bR/MM-1 ≥ bR/MM must at least be large enough to satisfy M* (M-1) ≥ bR. Thus, with bR = 1000R, M = 101. Thus, we need 101 buffer pages.
23. Consider three relations
,
,
,
R (B, C), S (D, E) and T (F). If R has 500 tuples, S has 1000 tuples, and T has 900 tuples. Compute the size of RST, and give an efficient way for computing the join.
Ans: Since the join operation is commutative and associative in nature, the relations 

R, S and T can be joined in any order. Therefore, we will first perform the join operation on R and S based on the common attribute C. Further, the join operation will be performed on the relation T and the result obtained from the previous join operation based on the common attribute E. Thus, we will compute the size based on the strategy of ((RS) T).
The join operation on
R and S will yield a relation of at most 500 tuples, because the number of tuples returned by the operation R S is exactly the same as the number of tuples in R, if the common attribute (on the basis of which join is performed) is the primary key in S and foreign key in R.
Similarly, the join operation on


T and the result of R S will yield a relation of at most 500 tuples, because the common attribute E is the primary key in T and foreign key in the result obtained from R S. Thus, the operation (RST) results in 500 tuples.
The efficient way of computing 

(RST) is to create an index on the attributes C and E of relations R and S respectively. The index on attribute C can be used to directly retrieve the matching tuples in R that satisfy the join condition. Similarly, the index on attribute E can be used to directly retrieve the matching tuples in the result obtained after R S that satisfy the join condition.
24. How many different join orders are there for a query that joins 5 relations?
Ans: In general, for
n relations, there are (2 (n-1))!/(n-1)! different join orderings. Therefore, for n=5, the join orderings are 1680.
25. Consider the following relational-algebra queries on the Online Book database. Draw the initial query trees for each of these queries.
Ans:

Ans:

Ans:

26. Consider the following SQL queries. Which of these queries is faster to execute and why? Draw the query tree also.
(a) SELECT Category, COUNT (*) FROM BOOK
WHERE Category! =‘Novel’
GROUP BY Category;
Ans: Query tree for this SQL query is given below:

(b) SELECT Category, COUNT (*) FROM BOOK
GROUP BY Category
HAVING Category! =‘Novel’;
Ans: Query tree for this SQL query is given below:

Query (a) is faster to execute than query because
HAVING
clause in query filters the selected tuples only after all the tuples
have been fetched. Therefore, it will take more time to execute.
27. Consider the following SQL queries on
Online Book database and transform these SQL queries into relational
algebra expressions. Draw the initial query trees for each of these
expressions, and then derive their optimized query trees after applying
heuristics on them.
(a) SELECT P.P_ID, Pname, Address, Phone
FROM BOOK B, PUBLISHER P
WHERE P.P_ID = B.P_ID AND Category = ‘Language Book’;
Initial query tree of this expression is given below:

The optimized query tree is given below:

(b) SELECT ISBN, Book_title, Category, Price
FROM BOOK B, PUBLISHER P
WHERE P.P_ID = B.P_ID AND Pname=‘Bright Publications’
AND Price>30;
The initial query tree of this expression is given below:

The optimized query tree is given below:

(c) SELECT ISBN, Book_title, Year, Page_count, Price
FROM BOOK B, AUTHOR A, AUTHOR_BOOK AB
WHERE B.ISBN = AB.ISBN AND AB.A_ID = A.A_ID
AND Aname = ‘Charles Smith’;

The initial query tree of this expression is given below:

The optimized query tree is given below:

(d) SELECT Book_title, Pname, Aname
FROM BOOK B, PUBLISHER P, AUTHOR A, AUTHOR_BOOK AB
WHERE P.P_ID = B.P_ID AND AB.A_ID=A.A_ID
AND B.ISBN=AB.ISBN AND Category= ‘Language book’;

The initial query tree of this expression is given below:

The optimized query tree is given below:


Multiple-choice Questions
1. A tree data structure in which the input
relations are represented as the leaf nodes and the relational algebra
operations as the internal nodes_________.
(a) Query tree
(b) Left-deep join tree
(c) Right deep join tree
(d) Operator tree
2. Which of the following DBMS component is responsible for generating a least costly plan?
(a) Parser
(b) Translator
(c) Query optimizer
(d) Query-evaluation engine
3. Which of these is a phase in external sort-merge algorithm?
(a) Merge phase
(b) Sort phase
(c) Both (a) and (b)
(d) None of these
4. If more than two relations are involved in a join, it is termed as_______.
(a) Two-way join
(b) Multi-way join
(c) Nested-loop join
(d) Naïve nested-loop join
5. If there are two relations
R and S, with n and m number of tuples, respectively, then nested-loop join requires how many pairs of tuples to be scanned?
(a)
n + m
(b)
n * m
(c)
m
(d)
n
6. If there are two relations
R and S
containing 700 and 900 tuples respectively, then which relation should
be taken as outer relation in block nested-loop join algorithm?
(a) S
(b) R
(c) Any of these
(d) None of these
7. An index created by the query optimizer and destroyed when the query is completed is known as________.
(a) Secondary index
(b) Optimized index
(c) Temporary index
(d) Permanent index
8. Which of the following algorithms combines sort-merge join with indexes?
(a) Hybrid merge-join
(b) Nested merge-join
(c) External merge-join
(d) Indexed merge-join
9. Which of these evaluations is also known as lazy evaluation?
(a) Materialized evaluation
(b) Demand-driven pipeline
(c) Producer-driven pipeline
(d) Both (a) and (b)
10. Which of the following operations is not commutative?
(a) Select
(b) Union
(c) Intersection
(d) Set difference
11. Which of these equivalence rules is known as cascading of project operator?
(a) π (σ
c (R)) = σc (πA1, A2,…,An (R))
(b) σ
c1 (σc2 (R)) = σc2 (σc1 (R))
(c) π
A1 (πA2 (… (πAn (R)) …)) = πA1 (R)
(d) σ
c1 ^ c2 (R) = σc1 (σc2 (R))
12. Given four relations P, Q, R and S. In how many ways these relations can be joined? (a) 240
(a) 240
(b) 4
(c) 120
(d) 16
Answers
(1) (a)
(2) (c)
(3) (c)
(4) (b)
(5) (b)
(6) (b)
(7) (c)
(8) (a)
(9) (b)
(10) (d)
(11) (c)
(12) (c)
No comments:
Post a Comment