Assigned: Wednesday, Mar 17, 2021
Due: Tuesday, Apr 6, 2021
In this lab, you will implement a query optimizer on top of SimpleDB. The main tasks include implementing a selectivity estimation framework and a cost-based optimizer. You have freedom as to exactly what you implement, but we recommend using something similar to the Selinger cost-based optimizer discussed in class (Lecture 9).
The remainder of this document describes what is involved in adding optimizer support and provides a basic outline of how you do so.
As with the previous lab, we recommend that you start as early as possible.
You should begin with the code you submitted for Lab 2. (If you did not submit code for Lab 2, or your solution didn't work properly, contact us to discuss options.)
We have provided you with extra test cases as well as source code files for this lab that are not in the original code distribution you received. We again encourage you to develop your own test suite in addition to the ones we have provided.
You will need to add these new files to your release. The easiest way to do this is to change to your project directory (probably called simple-db-hw) and pull from the master GitHub repository:
$ cd simple-db-hw
$ git pull upstream master
We suggest exercises along this document to guide your implementation, but you may find that a different order makes more sense for you. As before, we will grade your assignment by looking at your code and verifying that you have passed the test for the ant targets test and systemtest. See Section 3.4 for a complete discussion of grading and the tests you will need to pass.
Here's a rough outline of one way you might proceed with this lab. More details on these steps are given in Section 2 below.
Recall that the main idea of a cost-based optimizer is to:
In this lab, you will implement code to perform both of these functions.
The optimizer will be invoked from simpledb/Parser.java. You may wish to review the lab 2 parser exercise before starting this lab. Briefly, if you have a catalog file catalog.txt describing your tables, you can run the parser by typing:
java -jar dist/simpledb.jar parser catalog.txt
When the Parser is invoked, it will compute statistics over all of the tables (using statistics code you provide). When a query is issued, the parser will convert the query into a logical plan representation and then call your query optimizer to generate an optimal plan.
Before getting started with the implementation, you need to understand the overall structure of the SimpleDB optimizer. The overall control flow of the SimpleDB modules of the parser and optimizer is shown in Figure 1.
Figure 1: Diagram illustrating classes, methods, and objects used in the parser
The key at the bottom explains the symbols; you will implement the components with double-borders. The classes and methods will be explained in more detail in the text that follows (you may wish to refer back to this diagram), but the basic operation is as follows:
In the exercises to come, you will implement the methods that help physicalPlan devise an optimal plan.
Accurately estimating plan cost is quite tricky. In this lab, we will focus only on the cost of sequences of joins and base table accesses. We won't worry about access method selection (since we only have one access method, table scans) or the costs of additional operators (like aggregates).
You are only required to consider left-deep plans for this lab. See Section 2.3 for a description of additional "bonus" optimizer features you might implement, including an approach for handling bushy plans.
We will write join plans of the form p=t1 join t2 join ... tn
,
which signifies a left deep join where t1 is the left-most
join (deepest in the tree).
Given a plan like p
, its cost
can be expressed as:
scancost(t1) + scancost(t2) + joincost(t1 join t2) +
scancost(t3) + joincost((t1 join t2) join t3) +
...
Here, scancost(t1)
is the I/O cost of scanning table t1,
joincost(t1,t2)
is the CPU cost to join t1 to t2. To
make I/O and CPU cost comparable, typically a constant scaling factor
is used, e.g.:
cost(predicate application) = 1
cost(pageScan) = SCALING_FACTOR x cost(predicate application)
For this lab, you can ignore the effects of caching (e.g., assume that
every access to a table incurs the full cost of a scan) -- again, this
is something you may add as an optional bonus extension to your lab
in Section 2.3. Therefore, scancost(t1)
is simply the
number of pages in t1 x SCALING_FACTOR
.
When using nested loops joins, recall that the cost of a join between two tables t1 and t2 (where t1 is the outer) is simply:
joincost(t1 join t2) = scancost(t1) + ntups(t1) x scancost(t2) //IO cost
+ ntups(t1) x ntups(t2) //CPU cost
Here, ntups(t1)
is the number of tuples in table t1.
ntups
can be directly computed for a base table by
scanning that table. Estimating ntups
for a table with
one or more selection predicates over it can be trickier --
this is the filter selectivity estimation problem. Here's one
approach that you might use, based on computing a histogram over the
values in the table:
Figure 2: Diagram illustrating the histograms you will implement in Lab 5
In the next two exercises, you will code to perform selectivity estimation of joins and filters.
Exercise 1: IntHistogram.java
You will need to implement some way to record table statistics for selectivity estimation. We have provided a skeleton class, IntHistogram that will do this. Our intent is that you calculate histograms using the bucket-based method described above, but you are free to use some other method so long as it provides reasonable selectivity estimates.
We have provided a class StringHistogram that uses IntHistogram to compute selecitivites for String predicates. You may modify StringHistogram if you want to implement a better estimator, though you should not need to in order to complete this lab.
After completing this exercise, you should be able to pass the IntHistogramTest unit test (you are not required to pass this test if you choose not to implement histogram-based selectivity estimation).
Exercise 2: TableStats.java
The class TableStats contains methods that compute the number of tuples and pages in a table and that estimate the selectivity of predicates over the fields of that table. The query parser we have created creates one instance of TableStats per table, and passes these structures into your query optimizer (which you will need in later exercises).
You should fill in the following methods and classes in TableStats:
You may wish to modify the constructor of TableStats.java to, for example, compute histograms over the fields as described above for purposes of selectivity estimation.
After completing these tasks you should be able to pass the unit tests in TableStatsTest.
Finally, observe that the cost for the join plan p above includes expressions of the form joincost((t1 join t2) join t3). To evaluate this expression, you need some way to estimate the size (ntups) of t1 join t2. This join cardinality estimation problem is harder than the filter selectivity estimation problem. In this lab, you aren't required to do anything fancy for this, though one of the optional excercises in Section 2.4 includes a histogram-based method for join selectivity estimation.
While implementing your simple solution, you should keep in mind the following:
Exercise 3: Join Cost Estimation
The class JoinOptimizer.java includes all of the methods for ordering and computing costs of joins. In this exercise, you will write the methods for estimating the selectivity and cost of a join, specifically:
After implementing these methods, you should be able to pass the unit tests estimateJoinCostTest and estimateJoinCardinality in JoinOptimizerTest.java.
Now that you have implemented methods for estimating costs, you will implement the Selinger optimizer. For these methods, joins are expressed as a list of join nodes (e.g., predicates over two tables) as opposed to a list of relations to join as described in class.
Translating the algorithm given in lecture to the join node list form mentioned above, an outline in pseudocode would be:
1. j = set of join nodes
2. for (i in 1...|j|):
3. for s in {all length i subsets of j}
4. bestPlan = {}
5. for s' in {all length d-1 subsets of s}
6. subplan = optjoin(s')
7. plan = best way to join (s-s') to subplan
8. if (cost(plan) < cost(bestPlan))
9. bestPlan = plan
10. optjoin(s) = bestPlan
11. return optjoin(j)
To help you implement this algorithm, we have provided several classes and methods to assist you. First, the method enumerateSubsets(List v, int size) in JoinOptimizer.java will return a set of all of the subsets of v of size size. This method is VERY inefficient for large sets; you can earn extra credit by implementing a more efficient enumerator (hint: consider using an in-place generation algorithm and a lazy iterator (or stream) interface to avoid materializing the entire power set).
Second, we have provided the method:
private CostCard computeCostAndCardOfSubplan(Map<String, TableStats> stats,
Map<String, Double> filterSelectivities,
LogicalJoinNode joinToRemove,
Set<LogicalJoinNode> joinSet,
double bestCostSoFar,
PlanCache pc)
Given a subset of joins (joinSet), and a join to remove from this set (joinToRemove), this method computes the best way to join joinToRemove to joinSet - {joinToRemove}. It returns this best method in a CostCard object, which includes the cost, cardinality, and best join ordering (as a list). computeCostAndCardOfSubplan may return null, if no plan can be found (because, for example, there is no left-deep join that is possible), or if the cost of all plans is greater than the bestCostSoFar argument. The method uses a cache of previous joins called pc (optjoin in the psuedocode above) to quickly lookup the fastest way to join joinSet - {joinToRemove}. The other arguments (stats and filterSelectivities) are passed into the orderJoins method that you must implement as a part of Exercise 4, and are explained below. This method essentially performs lines 6--8 of the psuedocode described earlier.
Third, we have provided the method:
private void printJoins(List<LogicalJoinNode> js,
PlanCache pc,
Map<String, TableStats> stats,
Map<String, Double> selectivities)
This method can be used to display a graphical representation of a join plan (when the "explain" flag is set via the "-explain" option to the optimizer, for example).
Fourth, we have provided a class PlanCache that can be used to cache the best way to join a subset of the joins considered so far in your implementation of Selinger (an instance of this class is needed to use computeCostAndCardOfSubplan).
Exercise 4: Join Ordering
In JoinOptimizer.java, implement the method:
List<LogicalJoinNode> orderJoins(Map<String, TableStats> stats,
Map<String, Double> filterSelectivities,
boolean explain)
This method should operate on the joins class member, returning a new List that specifies the order in which joins should be done. Item 0 of this list indicates the left-most, bottom-most join in a left-deep plan. Adjacent joins in the returned list should share at least one field to ensure the plan is left-deep. Here stats is an object that lets you find the TableStats for a given table name that appears in the FROM list of the query. filterSelectivities allows you to find the selectivity of any predicates over a table; it is guaranteed to have one entry per table name in the FROM list. Finally, explain specifies that you should output a representation of the join order for informational purposes.
You may wish to use the helper methods and classes described above to assist in your implementation. Roughly, your implementation should follow the psuedocode above, looping through subset sizes, subsets, and sub-plans of subsets, calling computeCostAndCardOfSubplan and building a PlanCache object that stores the minimal-cost way to perform each subset join.
After implementing this method, you should be able to pass all the unit tests in JoinOptimizerTest. You should also pass the system test QueryTest.
In this section, we describe several optional excercises that you may implement for extra credit. These are less well defined than the previous exercises but give you a chance to show off your mastery of query optimization! Please clearly mark which ones you have chosen to complete in your report, and briefly explain your implementation and present your results (benchmark numbers, experience reports, etc.)
Bonus Exercises. Each of these bonuses is worth up to 5% extra credit:
You have now completed this lab. Good work!
You must submit your code (see below) as well as a short (2 pages, maximum) writeup describing your approach. This writeup should:
This lab should be manageable for a single person, but if you prefer to work with a partner, this is also OK. Larger groups are not allowed. Please indicate clearly who you worked with, if anyone, on your writeup.
We will be using gradescope to autograde all programming assignments. You should have all been invited to the class instance; if not, please let us know and we can help you set up. You may submit your code multiple times before the deadline; we will use the latest version as determined by gradescope. Place the write-up in a file called lab3-writeup.txt with your submission. You also need to explicitly add any other files you create, such as new *.java files.
The easiest way to submit to gradescope is with .zip
files containing your code. On Linux/MacOS, you can do so by
running the following command:
$ zip -r submission.zip src/ lab3-writeup.txt
SimpleDB is a relatively complex piece of code. It is very possible you are going to find bugs, inconsistencies, and bad, outdated, or incorrect documentation, etc.
We ask you, therefore, to do this lab with an adventurous mindset. Don't get mad if something is not clear, or even wrong; rather, try to figure it out yourself or send us a friendly email.
Please submit (friendly!) bug reports to 6.830-staff@mit.edu. When you do, please try to include:
test/simpledb
directory, compile, and run.HeapFileEncoder
.You can also post on the class page on Piazza if you feel you have run into a bug.
75% of your grade will be based on whether or not your code passes the system test suite we will run over it. These tests will be a superset of the tests we have provided. Before handing in your code, you should make sure it produces no errors (passes all of the tests) from both ant test and ant systemtest.
Important: before testing, gradescope will replace your build.xml, HeapFileEncoder.java and the entire contents of the test directory with our version of these files. This means you cannot change the format of .dat files! You should also be careful changing our APIs. You should test that your code compiles the unmodified tests.
You should get immediate feedback and error outputs for failed tests (if any) from gradescope after submission. The score given will be your grade for the autograded portion of the assignment. An additional 25% of your grade will be based on the quality of your writeup and our subjective evaluation of your code. This part will also be published on gradescope after we finish grading your assignment.
We had a lot of fun designing this assignment, and we hope you enjoy hacking on it!
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。