SigLens beats Elasticsearch by an astounding 1025x in the NYC Taxi Dataset performance test. This test resulted in processing 1 billion records and SigLens performed magnitudes faster than Elasticsearch.

In this blog, we outlined the detailed steps of this benchmark. You will learn ingestion and querying of a massive dataset along with tuning systems settings. Let’s explore it.

Overview of the test, dataset

The objective of this benchmark is to assess performance of these two databases under conditions mirroring real world scenarios. Our approach involves running both databases on identical hardware configurations, ingesting identical dataset and running the same set of queries. For this reason we opted for the NYC Taxi Dataset.

The NYC Taxi Dataset is a massive dataset of over 1 Billion trips made by NYC Taxis and Uber, all originating in New York City since 2009.

The queries used were as follows:

Q1: Find the total number of rides.
Q2: Find the average ride cost grouped by the number of passengers.
Q3: Find the number of rides grouped by the number of passengers and the pick-up location.
Q4: Find the number of rides grouped by the number of passengers, pick-up location, and trip distance.

Introductions of the databases

Elasticsearch:

Elasticsearch is a distributed, open-source search and analytics engine. It offers a schema-free JSON document store for diverse data types. It supports complex queries and aggregations. Widely used for log and event data analysis by observability companies. It offers powerful full-text search capabilities and integrates with various data sources.

SigLens:

SigLens is a column oriented database built from scratch for observability data. It offers dynamic compressions over each column requiring zero configuration.

SigLens uses columnar micro indices that are 1/100th size of an actual database index. This reduces the number of machines needed. SigLens uses AgileAggregationTree algorithm that makes aggregations queries run at sub-second speeds.

Test Setup

We used an AWS im4gn.2xlarge instance for both SigLens and ClickHouse, it had the following hardware configuration:

8 vCPUs
32 GB of RAM
3.5 TB of storage
10 Gigabit networking

Common steps for both:

Setup the server, mount the storage, configure aws s3 cli by following these steps:

Step 1: Let’s find out if the NVM storage device is present on this server. To find that, ssh into your server and run lsblk. You should see something like the following, with the nvme1n1 item having 3.4 TB of storage.

NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0          7:0    0  21.3M  1 loop /snap/amazon-ssm-agent/7529
loop1          7:1    0  49.1M  1 loop /snap/core18/2794
loop2          7:2    0  59.3M  1 loop /snap/core20/2019
loop3          7:3    0 109.6M  1 loop /snap/lxd/24326
loop4          7:4    0  35.5M  1 loop /snap/snapd/20102
nvme0n1      259:0    0     8G  0 disk
├─nvme0n1p1  259:2    0   7.9G  0 part /
└─nvme0n1p15 259:3    0    99M  0 part /boot/efi
nvme1n1      259:1    0   3.4T  0 disk

Step 2: Let’s mount the storage device and create directories.

sudo mkfs.xfs /dev/nvme1n1
sudo mkdir /mnt/nvme1n1
sudo mount /dev/nvme1n1 /mnt/nvme1n1

Step 3: Verify Storage

You can check that it's mounted by running df -h and you should see something like this:

Filesystem       Size  Used Avail Use% Mounted on
/dev/root        7.6G  1.5G  6.2G  20% /
tmpfs             16G     0   16G   0% /dev/shm
tmpfs            6.2G  948K  6.2G   1% /run
tmpfs            5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p15   98M  6.3M   92M   7% /boot/efi
tmpfs            3.1G  4.0K  3.1G   1% /run/user/1000
/dev/nvme1n1     3.5T   25G  3.4T   1% /mnt/nvme1n1

Step 4: Update Permissions

sudo chmod 777 /mnt/nvme1n1

Step 5: Configure AWS CLI

Note: This AWS config step is only needed if you are using a private S3 bucket, if you use our public S3 bucket you can skip this step.

sudo apt-get install awscli -y
aws configure

Cleanup and Prepare Dataset:

Note: You can skip this step and use our s3 bucket. It has the dataset in the correct format. If you want to prepare the dataset yourself then follow these steps else jump directly to the benchmark step to use our s3 bucket.

mkdir -p /mnt/nvme1n1/data
cd /mnt/nvme1n1/data

Download the dataset from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page into above dir. We used the 2011-2017 yellow taxi trip parquet files

Next, you need to convert the parquet files to TSV.

git clone git@github.com:siglens/siglens
python -m venv taxis
source taxis/bin/activate
pip install -r siglens/tools/nyc-taxi-benchmark/requirements.txt
python siglens/tools/nyc-taxi-benchmark/parquet_to_tsv.py . .

The above cmd takes two parameters: input_dir , output_dir. It will read *.parquet files from the input_dir and convert them to tsv files and put them in output_dir.

Elasticsearch and SigLens both will use JSON. Let’s convert the tsv files to json now. Put the below content in some bash file and run it or run the below commands in a terminal.

for year in {2011..2017}; do
    for month in {01..12}; do
        basefile="yellow_tripdata_$year-$month"
        go run siglens/tools/sigclient/cmd/utils/converter.go --input 
"$basefile.tsv" --output "$basefile.json" &
    done
done
wait

Finally, compress the TSV and JSON files with gzip and upload them to your AWS S3 bucket. We ingested the TSV and JSON files into separate directories to make it easier to download all of one type.

Benchmark Elasticsearch

You'll want two terminals in your Elasticsearch server; the first will run Elasticsearch and the second will ingest data. Start with Terminal 1.

Download Elasticsearch

cd /mnt/nvme1n1
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.11.4-linux-aarch64.tar.gz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.11.4-linux-aarch64.tar.gz.sha512
shasum -a 512 -c elasticsearch-8.11.4-linux-aarch64.tar.gz.sha512

This should output elasticsearch-8.11.4-linux-aarch64.tar.gz: OK.

tar -xzf elasticsearch-8.11.4-linux-aarch64.tar.gz
cd elasticsearch-8.11.4/

Configure Elasticsearch

Let’s bind to all ports and run elasticsearch in a single node configuration.

Add the following to config/elasticsearch.yml:

network.host: 0.0.0.0
discovery.type: single-node

Also delete the line cluster.initial_master_nodes: ["ip-172-31-24-1"] from that file.

Let’s give more memory to the elasticsearch process.

echo -e "-Xms24g\n-Xmx24g" > config/jvm.options.d/heap.options

Elasticsearch uses mmapfs for storing indices. Let’s increase this setting to a larger number for this test.

sudo sysctl -w vm.max_map_count=262144

Add a template

We will use 6 shards so that we get better write and read performance. We will also disable replication to increase performance. Additionally let’s setup the timestamp field.

curl -k -u "elastic:" \
    --location \
    --request PUT 'https://:9200/_template/temp1' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "index_patterns": "trips",
        "settings": {
            "number_of_shards": 6,
            "number_of_replicas": 0
        },
        "mappings": {
            "_source": {
                "enabled": true
            },
            "properties": {
                "timestamp": {
                    "type": "date",
                    "format": "epoch_millis"
                }
            }
        }
    }'

Save the password for later

If you don't know the password for the elastic user, reset it with

bin/elasticsearch-reset-password -u elastic

The new password will be printed. Save it for later.

Run Elasticsearch

./bin/elasticsearch > elastic.log 2>&1

Ingest the Data

In Terminal 2, clone siglens to use an ingester script from it.

cd /mnt nvme1n1
git clone https://github.com/siglens/siglens.git
cd siglens/tools/nyc-taxi-benchmark

In ingester.py, make these changes:

Change index_data = '{"index": {"_index": "trips", "_type": "_doc"}}' to index_data = '{"index": {"_index": "trips"}}'
In the curl section, add the "-k", option, which allows faking SSL checks
In the curl section, add "-u", "elastic:<your-elastic-password>"
In the curl section, change "http://localhost:8081/elastic/_bulk", to "https://localhost:9200/elastic/_bulk"

Now download and ingest the data:

mkdir dataset

for year in {2011..2017}; do
    for month in {01..12}; do
        {
            basefile="yellow_tripdata_$year-$month"
            aws s3 cp s3://siglens-benchmark-datasets/nyc-taxi-benchmark-data/json/$basefile.json.gz dataset/
            gunzip dataset/$basefile.json.gz
            python3 ingester.py dataset/$basefile.json
        } &
    done
    wait
done

Run the Queries in Elasticsearch

To get benchmark results, we had to clear the Elasticsearch cache after every query. Otherwise, if we ran a query multiple times then the first time would take a while but every subsequent invocation would return much faster than the original search because it was returning the cached response. To clear the cache, run:

curl -k -u "elastic:<your-elastic-password>"-X POST "https://<server-ip>:9200/trips/_cache/clear"

The following should indicate 265 unique values.

curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search" 
    -H 'Content-Type: application/json'
    -d '{
        "size": 0,
        "aggs": {
            "distinct_PULocationID": {
                "cardinality": {
                    "field": "PULocationID"
                }
            }
        }
    }' | python3 -m json.tool

Now let’s run the actual queries.

The responses will have a took field, indicating how long the query took in milliseconds.

# Query 1
curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search" 
    -H 'Content-Type: application/json' 
    -d '{
        "size": 0,
        "aggs": {
            "improvement_surcharge_groups": {
                "terms": {
                    "field": "improvement_surcharge",
                    "size": 10
                },
                "aggs": {
                    "count": {
                        "value_count": {
                            "field": "improvement_surcharge"
                        }
                    }
                }
            }
        }
    }' | python3 -m json.tool | less
                               
# Query 2
curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search" 
    -H 'Content-Type: application/json' 
    -d '{
        "size": 0,
        "aggs": {
            "passenger_count_groups": {
                "terms": {
                    "field": "passenger_count",
                    "size": 36
            },
            "aggs": {
                "average_total_amount": {
                    "avg": {
                        "field": "total_amount"
                    }
                }
            }
        }
    }
}' | python3 -m json.tool | less
                                
# Query 3
curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search" 
    -H 'Content-Type: application/json' 
    -d '{
        "size": 0,
        "aggs": {
            "passenger_count_groups": {
                "terms": {
                    "field": "passenger_count",
                    "size": 36
                },
                "aggs": {
                    "PULocationID_groups": {
                        "terms": {
                            "field": "PULocationID",
                            "size": 265
                        }
                    }
                }
            }
        }
    }' | python3 -m json.tool | less
                                
# Query 4
# For this query we don't want more than 10,000 buckets, so we'll reduce the "size" parameters.
curl -k -u "elastic:<your-elastic-password>" "https://<server-ip>:9200/trips/_search" 
    -H 'Content-Type: application/json' 
    -d '{
        "size": 0,
        "aggs": {
            "passenger_count_groups": {
                "terms": {
                    "field": "passenger_count",
                    "size": 10
                },
            "aggs": {
                "PULocationID_groups": {
                    "terms": {
                        "field": "PULocationID",
                        "size": 10
                    },
                    "aggs": {
                        "trip_distance_groups": {
                            "terms": {
                                "field": "trip_distance",
                                "size": 100
                            },
                            "aggs": {
                                "count": {
                                    "value_count": {
                                        "field":
                                        "trip_distance"
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }' | python3 -m json.tool | less

Elasticsearch Query Results

We ran each query for 5 times, and took the lowest number from those runs.

	Try 1	Try 2	Try 3	Try 4	Try 5	Lowest
Q1	27,780 ms	27,559 ms	27,575 ms	27,646 ms	27,588 ms	27,559 ms
Q2	48,278 ms	49,253 ms	47,315 ms	46,925 ms	44,075 ms	44,075 ms
Q3	50,417 ms	50,274 ms	50,110 ms	50,380 ms	50,522 ms	50,110 ms
Q4	1,26,616 ms	1,26,728 ms	1,25,167 ms	1,27,522 ms	1,26,263 ms	1,25,167 ms

Benchmark SigLens

You'll want three terminals. Terminal 1 will run SigLens, Terminal 2 will do some setup and view the logs, and Terminal 3 will send the queries. Terminal 3 can run on your local machine if you set up the server to accept HTTP traffic, but Terminals 1 and 2 should be on the server. Start with Terminal 1.

Install Go and SigLens

SigLens is built in golang. In this step we will install the go dependencies, install SigLens and start it.

sudo apt update
sudo apt install golang -y
git clone https://github.com/siglens/siglens.git
cd siglens

## Start SigLens
sudo go run cmd/siglens/main.go --config server.yaml

Wait until SigLens is running. You'll see these lines in the terminal once it's up:

INFO[2023-12-06 18:10:38] Extracting config from configFile: server.yaml
INFO[2023-12-06 18:10:38] Defaulting to 2160hrs (90 days) of retention...
INFO[2023-12-06 18:10:38] ----- Siglens server type SingleNode starting up -----
INFO[2023-12-06 18:10:38] ----- Siglens Ingestion server starting on 0.0.0.0:8081 -----
INFO[2023-12-06 18:10:38] ----- Siglens Query server starting on 0.0.0.0:5122 -----
INFO[2023-12-06 18:10:38] ----- Siglens UI starting on 0.0.0.0:5122 -----

Prepare Agile Aggregations

In a continuous running ingest, you don’t need to do this step. However, since this is a static dataset ingestion, we need to set up the aggregations columns. SigLens creates aggregation trees during ingestion using these columns. These trees are used during query processing

Switch to Terminal 2 and run the following:

curl -X POST -d '{
    "tableName": "trips",
    "groupByColumns": ["airport_fee", "passenger_count", "PULocationID", "trip_distance"],
    "measureColumns": ["total_amount"]
}' http://localhost:5122/api/pqs/aggs
echo ""

You should get this response:

{"message":"All OK","status":200}

Ingest data into SigLens

Now let's ingest some data into SigLens. In Terminal 2 run

cd /mnt/nvme1n1/siglens/tools/sigclient 
mkdir dataset

Run the following script to download, decompress, and ingest the data into SigLens.

for year in {2011..2017}; do
    for month in {01..12}; do
        {
            basefile="yellow_tripdata_$year-$month"
            aws s3 cp s3://siglens-benchmark-datasets/nyc-taxi-benchmark-data/json/$basefile.json.gz dataset/
            gunzip dataset/$basefile.json.gz
            python3 ../nyc-taxi-benchmark/ingester.py dataset/$basefile.json
        } &
    done
    wait
done

Run Queries against SigLens

Run the following in Terminal 3. If Terminal 3 is on your local machine, make sure to replace localhost with the IP of the server. You can remove the | python3 -m json.tool if you want, it just formats the JSON response. Check the log file siglens.log for the query times.

curl -X POST -d '{
    "searchText": "SELECT airport_fee, count(*) FROM trips GROUP BY airport_fee",
    "indexName": "trips",
    "startEpoch": "now-24h",
    "endEpoch": "now",
    "queryLanguage": "SQL"
}' http://localhost:5122/api/search | python3 -m json.tool

curl -X POST -d '{
    "searchText": "SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count",
    "indexName": "trips",
    "startEpoch": "now-24h",
    "endEpoch": "now",
    "queryLanguage": "SQL"
}' http://localhost:5122/api/search | python3 -m json.tool

curl -X POST -d '{
    "searchText": "SELECT passenger_count, PULocationID, count(*) FROM trips GROUP BY passenger_count, PULocationID",
    "indexName": "trips",
    "startEpoch": "now-24h",
    "endEpoch": "now",
    "queryLanguage": "SQL"
}' http://localhost:5122/api/search | python3 -m json.tool

curl -X POST -d '{
    "searchText": "SELECT passenger_count, PULocationID, trip_distance, count(*) FROM trips GROUP BY passenger_count, PULocationID, trip_distance",
    "indexName": "trips",
    "startEpoch": "now-24h",
    "endEpoch": "now",
    "queryLanguage": "SQL"
}' http://localhost:5122/api/search | python3 -m json.tool

SigLens Query Results:

We ran each query for 5 times, and took the lowest number from those runs.

	Try 1	Try 2	Try 3	Try 4	Try 5	Lowest
Q1	60 ms	72 ms	63 ms	41 ms	42 ms	41 ms
Q2	43 ms	43 ms	44 ms	43 ms	43 ms	43 ms
Q3	116 ms	304 ms	442 ms	509 ms	471 ms	116 ms
Q4	9,815 ms	9,229 ms	10,316 ms	9,093 ms	9,092 ms	9,092 ms

Result Comparison

As you have seen, we used the exact same hardware, exact same dataset, exact same queries. We took the lowest of the 5 runs for each query type for both of these databases and compiled that result in this table:

Query Type	Elasticsearch	SigLens	Comparison
Query 1	27,559 ms	55 ms	SigLens is 501x faster than Elasticsearch
Query 2	44,075 ms	43 ms	SigLens is 1025x faster than Elasticsearch
Query 3	50,110 ms	368 ms	SigLens is 136x faster than Elasticsearch
Query 4	1,25,167 ms	9509 ms	SigLens is 13x faster than Elasticsearch

SigLens proves to be 13x-1025x faster on all 4 query types. These speed advantages of SigLens are due to its dynamic columnar compressions, dynamic columnar microindexing, AgileAggsTree. These innovative techniques help run your filter and aggregation queries on complex datasets at lightning fast speeds.

Try SigLens in less than 1 minute

SigLens gets installed in less than a minute with docker or helm chart. Give it a try and experience the difference.

Try SigLens

GitHub Repo

Benchmark Questions?

Both of these databases were given the exact same treatment, there was no unfair advantage given to SigLens.

You can verify these test results yourself by following the detailed steps we have outlined above. If you feel something is incorrect please submit an issue or a pull request at https://github.com/siglens/siglens.

* Query 1 is a little different than Query 1 for SigLens. This is because Elasticsearch was unable to perform an aggregation on the `airport_fee` column because it was ingested as a text field. So instead, this Query 1 aggregates on the `improvement_surcharge` field. This should be comparable because the `airport_fee` column only has one bucket, while the `improvement_surcharge` column has only 2, and one of those only accounts for 360 rows of the more than 1 billion rows in the dataset.