The objective of this benchmark is to assess performance of these two databases under conditions mirroring real world scenarios. Our approach involves running both databases on identical hardware configurations, ingesting identical dataset and running the same set of queries. For this reason we opted for the NYC Taxi Dataset.
The NYC Taxi Dataset is a massive dataset of over 1 Billion trips made by NYC Taxis and Uber, all originating in New York City since 2009.
The queries used were as follows:
SigLens is a column oriented database built from scratch for observability data. It offers dynamic compressions over each column requiring zero configuration.
SigLens uses columnar micro indices that are 1/100th size of an actual database index. This reduces the number of machines needed. SigLens uses AgileAggregationTree algorithm that makes aggregations queries run at sub-second speeds.
Setup the server, mount the storage, configure aws s3 cli by following these steps:
Step 1: Let’s find out if the NVM storage device is present on this server. To find that, ssh into your server and run lsblk. You should see something like the following, with the nvme1n1 item having 3.4 TB of storage.
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 21.3M 1 loop /snap/amazon-ssm-agent/7529
loop1 7:1 0 49.1M 1 loop /snap/core18/2794
loop2 7:2 0 59.3M 1 loop /snap/core20/2019
loop3 7:3 0 109.6M 1 loop /snap/lxd/24326
loop4 7:4 0 35.5M 1 loop /snap/snapd/20102
nvme0n1 259:0 0 8G 0 disk
├─nvme0n1p1 259:2 0 7.9G 0 part /
└─nvme0n1p15 259:3 0 99M 0 part /boot/efi
nvme1n1 259:1 0 3.4T 0 disk
Step 2: Let’s mount the storage device and create directories.
sudo mkfs.xfs /dev/nvme1n1
sudo mkdir /mnt/nvme1n1
sudo mount /dev/nvme1n1 /mnt/nvme1n1
Step 3: Verify Storage
You can check that it's mounted by running df -h and you should see something like this:
Filesystem Size Used Avail Use% Mounted on
/dev/root 7.6G 1.5G 6.2G 20% /
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 6.2G 948K 6.2G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/nvme0n1p15 98M 6.3M 92M 7% /boot/efi
tmpfs 3.1G 4.0K 3.1G 1% /run/user/1000
/dev/nvme1n1 3.5T 25G 3.4T 1% /mnt/nvme1n1
Step 4: Update Permissions
sudo chmod 777 /mnt/nvme1n1
Step 5: Configure AWS CLI
Note: This AWS config step is only needed if you are using a private S3 bucket, if you use our public S3 bucket you can skip this step.
sudo apt-get install awscli -y
aws configure
Note: You can skip this step and use our s3 bucket. It has the dataset in the correct format. If you want to prepare the dataset yourself then follow these steps else jump directly to the benchmark step to use our s3 bucket.
mkdir -p /mnt/nvme1n1/data
cd /mnt/nvme1n1/data
git clone git@github.com:siglens/siglens
python -m venv taxis
source taxis/bin/activate
pip install -r siglens/tools/nyc-taxi-benchmark/requirements.txt
python siglens/tools/nyc-taxi-benchmark/parquet_to_tsv.py . .
for year in {2011..2017}; do
for month in {01..12}; do
basefile="yellow_tripdata_$year-$month"
go run siglens/tools/sigclient/cmd/utils/converter.go --input
"$basefile.tsv" --output "$basefile.json" &
done
done
wait
cd /mnt/nvme1n1
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.11.4-linux-aarch64.tar.gz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.11.4-linux-aarch64.tar.gz.sha512
shasum -a 512 -c elasticsearch-8.11.4-linux-aarch64.tar.gz.sha512
tar -xzf elasticsearch-8.11.4-linux-aarch64.tar.gz
cd elasticsearch-8.11.4/
network.host: 0.0.0.0
discovery.type: single-node
Let’s give more memory to the elasticsearch process.
echo -e "-Xms24g\n-Xmx24g" > config/jvm.options.d/heap.options
sudo sysctl -w vm.max_map_count=262144
curl -k -u "elastic:" \
--location \
--request PUT 'https://:9200/_template/temp1' \
--header 'Content-Type: application/json' \
--data-raw '{
"index_patterns": "trips",
"settings": {
"number_of_shards": 6,
"number_of_replicas": 0
},
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"timestamp": {
"type": "date",
"format": "epoch_millis"
}
}
}
}'
bin/elasticsearch-reset-password -u elastic
The new password will be printed. Save it for later.
./bin/elasticsearch > elastic.log 2>&1
cd /mnt nvme1n1
git clone https://github.com/siglens/siglens.git
cd siglens/tools/nyc-taxi-benchmark
In ingester.py, make these changes:
mkdir dataset
for year in {2011..2017}; do
for month in {01..12}; do
{
basefile="yellow_tripdata_$year-$month"
aws s3 cp s3://siglens-benchmark-datasets/nyc-taxi-benchmark-data/json/$basefile.json.gz dataset/
gunzip dataset/$basefile.json.gz
python3 ingester.py dataset/$basefile.json
} &
done
wait
done
curl -k -u "elastic:<your-elastic-password>"-X POST "https://<server-ip>:9200/trips/_cache/clear"
The following should indicate 265 unique values.
curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search"
-H 'Content-Type: application/json'
-d '{
"size": 0,
"aggs": {
"distinct_PULocationID": {
"cardinality": {
"field": "PULocationID"
}
}
}
}' | python3 -m json.tool
Now let’s run the actual queries.
The responses will have a took field, indicating how long the query took in milliseconds.
# Query 1
curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search"
-H 'Content-Type: application/json'
-d '{
"size": 0,
"aggs": {
"improvement_surcharge_groups": {
"terms": {
"field": "improvement_surcharge",
"size": 10
},
"aggs": {
"count": {
"value_count": {
"field": "improvement_surcharge"
}
}
}
}
}
}' | python3 -m json.tool | less
# Query 2
curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search"
-H 'Content-Type: application/json'
-d '{
"size": 0,
"aggs": {
"passenger_count_groups": {
"terms": {
"field": "passenger_count",
"size": 36
},
"aggs": {
"average_total_amount": {
"avg": {
"field": "total_amount"
}
}
}
}
}
}' | python3 -m json.tool | less
# Query 3
curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search"
-H 'Content-Type: application/json'
-d '{
"size": 0,
"aggs": {
"passenger_count_groups": {
"terms": {
"field": "passenger_count",
"size": 36
},
"aggs": {
"PULocationID_groups": {
"terms": {
"field": "PULocationID",
"size": 265
}
}
}
}
}
}' | python3 -m json.tool | less
# Query 4
# For this query we don't want more than 10,000 buckets, so we'll reduce the "size" parameters.
curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search"
-H 'Content-Type: application/json'
-d '{
"size": 0,
"aggs": {
"passenger_count_groups": {
"terms": {
"field": "passenger_count",
"size": 10
},
"aggs": {
"PULocationID_groups": {
"terms": {
"field": "PULocationID",
"size": 10
},
"aggs": {
"trip_distance_groups": {
"terms": {
"field": "trip_distance",
"size": 100
},
"aggs": {
"count": {
"value_count": {
"field":
"trip_distance"
}
}
}
}
}
}
}
}
}
}' | python3 -m json.tool | less
Try 1 | Try 2 | Try 3 | Try 4 | Try 5 | Lowest | |
Q1 | 27,780 ms | 27,559 ms | 27,575 ms | 27,646 ms | 27,588 ms | 27,559 ms |
Q2 | 48,278 ms | 49,253 ms | 47,315 ms | 46,925 ms | 44,075 ms | 44,075 ms |
Q3 | 50,417 ms | 50,274 ms | 50,110 ms | 50,380 ms | 50,522 ms | 50,110 ms |
Q4 | 1,26,616 ms | 1,26,728 ms | 1,25,167 ms | 1,27,522 ms | 1,26,263 ms | 1,25,167 ms |
sudo apt update
sudo apt install golang -y
git clone https://github.com/siglens/siglens.git
cd siglens
## Start SigLens
sudo go run cmd/siglens/main.go --config server.yaml
INFO[2023-12-06 18:10:38] Extracting config from configFile: server.yaml
INFO[2023-12-06 18:10:38] Defaulting to 2160hrs (90 days) of retention...
INFO[2023-12-06 18:10:38] ----- Siglens server type SingleNode starting up -----
INFO[2023-12-06 18:10:38] ----- Siglens Ingestion server starting on 0.0.0.0:8081 -----
INFO[2023-12-06 18:10:38] ----- Siglens Query server starting on 0.0.0.0:5122 -----
INFO[2023-12-06 18:10:38] ----- Siglens UI starting on 0.0.0.0:5122 -----
curl -X POST -d '{
"tableName": "trips",
"groupByColumns": ["airport_fee", "passenger_count", "PULocationID", "trip_distance"],
"measureColumns": ["total_amount"]
}' http://localhost:5122/api/pqs/aggs
echo ""
{"message":"All OK","status":200}
cd /mnt/nvme1n1/siglens/tools/sigclient
mkdir dataset
for year in {2011..2017}; do
for month in {01..12}; do
{
basefile="yellow_tripdata_$year-$month"
aws s3 cp s3://siglens-benchmark-datasets/nyc-taxi-benchmark-data/json/$basefile.json.gz dataset/
gunzip dataset/$basefile.json.gz
python3 ../nyc-taxi-benchmark/ingester.py dataset/$basefile.json
} &
done
wait
done
curl -X POST -d '{
"searchText": "SELECT airport_fee, count(*) FROM trips GROUP BY airport_fee",
"indexName": "trips",
"startEpoch": "now-24h",
"endEpoch": "now",
"queryLanguage": "SQL"
}' http://localhost:5122/api/search | python3 -m json.tool
curl -X POST -d '{
"searchText": "SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count",
"indexName": "trips",
"startEpoch": "now-24h",
"endEpoch": "now",
"queryLanguage": "SQL"
}' http://localhost:5122/api/search | python3 -m json.tool
curl -X POST -d '{
"searchText": "SELECT passenger_count, PULocationID, count(*) FROM trips GROUP BY passenger_count, PULocationID",
"indexName": "trips",
"startEpoch": "now-24h",
"endEpoch": "now",
"queryLanguage": "SQL"
}' http://localhost:5122/api/search | python3 -m json.tool
curl -X POST -d '{
"searchText": "SELECT passenger_count, PULocationID, trip_distance, count(*) FROM trips GROUP BY passenger_count, PULocationID, trip_distance",
"indexName": "trips",
"startEpoch": "now-24h",
"endEpoch": "now",
"queryLanguage": "SQL"
}' http://localhost:5122/api/search | python3 -m json.tool
Try 1 | Try 2 | Try 3 | Try 4 | Try 5 | Lowest | |
Q1 | 60 ms | 72 ms | 63 ms | 41 ms | 42 ms | 41 ms |
Q2 | 43 ms | 43 ms | 44 ms | 43 ms | 43 ms | 43 ms |
Q3 | 116 ms | 304 ms | 442 ms | 509 ms | 471 ms | 116 ms |
Q4 | 9,815 ms | 9,229 ms | 10,316 ms | 9,093 ms | 9,092 ms | 9,092 ms |
Query Type | Elasticsearch | SigLens | Comparison |
---|---|---|---|
Query 1 | 27,559 ms | 55 ms | SigLens is 501x faster than Elasticsearch |
Query 2 | 44,075 ms | 43 ms | SigLens is 1025x faster than Elasticsearch |
Query 3 | 50,110 ms | 368 ms | SigLens is 136x faster than Elasticsearch |
Query 4 | 1,25,167 ms | 9509 ms | SigLens is 13x faster than Elasticsearch |
* Query 1 is a little different than Query 1 for SigLens. This is because Elasticsearch was unable to perform an aggregation on the `airport_fee` column because it was ingested as a text field. So instead, this Query 1 aggregates on the `improvement_surcharge` field. This should be comparable because the `airport_fee` column only has one bucket, while the `improvement_surcharge` column has only 2, and one of those only accounts for 360 rows of the more than 1 billion rows in the dataset.