cs465-project

Specialised Query Executor for TPC-H Q18

Prerequisites

Python 3.8+
duckdb
pandas
pyarrow

Install dependencies with:

pip install -r requirements.txt

Directory Structure

project_root/
├── benchmarks/            # Benchmark strategies
├── data/                  # TPC-H tables in Parquet format
├── engine/
│   ├── duckdb_engine.py
│   └── custom_engine.py
├── results/
│   ├── benchmark/         # Custom engine query results (CSV)
│   └── target/            # DuckDB query results (CSV)
├── init.py                # Data generation script
├── main.py                # Main entry point
└── summary.csv

1. Initialize Project (Generate TPC-H Data)

Run the init command to generate TPCH tables in Parquet format and initialize result directories:

python main.py init

The following files will be created for scale factors [0.5, 1, 2, 5]

data/sf{0.5, 1, 2, 5}/
├── customer.parquet
├── lineitem.parquet
├── nation.parquet
├── orders.parquet
├── part.parquet
├── partsupp.parquet
├── region.parquet
└── supplier.parquet

2. Run the Benchmark

python main.py benchmark

Optional arguments:

--out summary.csv: Output file for the benchmark results (default: summary.csv)
--benchmark 5: Number of timed repetitions per scale factor after one warm-up run (default: 5)
--strategy <strategy>: Benchmark execution strategy:
- interweave (default)
- duckdb_first
- custom_engine_first
--enable_profiling: Enable detailed profiling for the custom engine

This will:

Clear previous results in results/benchmark and results/target
Run both engines for scale factors 0.5, 1, 2, 5
Run one warm-up iteration first, then average the next --benchmark timed runs
Save query results as CSVs in the results folders
Write benchmark results to summary.csv

3. Data Correctness Check

python main.py check

This will compare all matching CSV files in results/benchmark and results/target and print any mismatches.

This site is open source. Improve this page.