ExpandoDB

A JSON document database with super-fast search

Fork me on GitHub

features

ok, so what is this?

Well, as it says on the tin, ExpandoDB is a JSON document database.

It is meant to store any data that can be represented as JSON - this means text extracted from crawled html files, log files, emails, PDF files, MS-Office files, etc., or data imported from other databases.
More importantly, it is meant to enable super-fast search of the stored data. ExpandoDB’s search engine is powered by Lucene, the de facto standard for search engines. Unlike other Lucene-based search engines, ExpandoDB doesn’t require you to create Lucene index schemas - it auto-generates the schemas for you!
It supports a full range of query operations: keyword search, range search, wildcard search, regex search, proximity search, fuzzy search, etc. See this article for an overview of Lucene query syntax.
It highlights the part of the text that contains the search term(s).
It is easy to setup - you can download and install it in under 5 minutes! It’s packaged as a self-contained microservice, so it doesn’t have any dependencies other than the .NET runtime v4.5.1.
It has an easy-to-use REST API, and it comes bundled with a Swagger API spec and viewer.
It’s open-source, released under the Apache v2 license.

Sounds good? Why not download and try it out!

download

up and running in 5 minutes

Download the latest version from the ExpandoDB github releases page.
- ExpandoDB.Service.zip - this contains the main ExpandoDB binaries.
- (Optional) Loader.zip - this is an app that populates ExpandoDB with sample data (the Reuters-21578 dataset).

IMPORTANT You need to unblock the zip files in Windows before unzipping them. How? Just right click and ‘Unblock’, as explained in this article. ExpandoDB will not launch if you don’t do this step.

Unzip ExpandoDB.Service.zip to a suitable directory, e.g. d:\ExpandoDB. Likewise, unzip Loader.zip to a suitable directory, e.g. d:\Loader.
Open an Admin command prompt and cd to the ExpandoDB directory.
Enter the command ExpandoDB.Service.exe install. This will install ExpandoDB as a Windows service. On the same command prompt, enter the command net start ExpandoDB.Service. This will start the ExpandoDB service; it will start listening on port 9000 (you can change the port number in the application config file).
If you also downloaded and unzipped Loader.zip, cd to the Loader directory, and run Loader.exe. This will load a subset of the Reuters-21578 dataset into the system. The Reuters-21578 dataset is a collection of Reuters news articles from the 1980’s.
Open a web browser and go to this URL: http://localhost:9000/db/reuters. If you’re using Chrome and you have an extension for viewing JSON (such as JSONView), you should see something like the below screenshot. If you don’t have a JSON viewer, you’ll see the raw JSON.

Now that ExpandoDB is up and running, let’s go over to the REST API overview.

gone in 5 minutes

To uninstall, open an Admin command prompt and cd to the ExpandoDB directory.
Enter the command net stop ExpandoDB.Service. This will stop the ExpandoDB service if it is running.
Enter the command ExpandoDB.Service.exe uninstall. This will remove the ExpandoDB service from Windows.
Delete the ExpandoDB directory.
Delete the Loader directory.

rest api

it’s got an easy to use REST API

To insert a new Document, use the POST /db/{collection} endpoint. ExpandoDB will auto-create the target Document Collection if it doesn’t exist.
To find out about the schema of a specific Document Collection, use the GET /db/_schemas/{collection} endpoint.
To find out what Document Collections are in the Database and what their schemas are, use the GET /db/_schemas endpoint.
To search a Document Collection, use the GET /db/{collection} endpoint. This is the API endpoint you’ll be using the most, so do take time to read the documentation below.
To count items in a Document Collection, use the GET /db/{collection}/count endpoint.
To retrieve a single Document from a Document Collection, use the GET /db/{collection}/{id} endpoint.
To update an existing Document, use the PUT /db/{collection}/{id} endpoint. The Document that you send will replace the existing one.
To partially update an existing Document, use the PATCH /db/{collection}/{id} endpoint. This endpoint implements the JSON-Patch standard defined in RFC6902. Note that ExpandoDB only supports the following PATCH operations: add, remove, and replace.
To remove an existing Document, use the DELETE /db/{collection}/{id} endpoint.
To remove an entire Document Collection, use the DELETE /db/{collection} endpoint.
If ExpandoDB is set up and running locally on your machine, do load the ExpandoDB Swagger API spec into your browser and try out the endpoints.

show me

how do I …

… insert a new Document?

Using your favorite HTTP command line tool (e.g. curl), library (e.g. RestSharp), or Chrome app (e.g. PostMan), POST a JSON Document to the /db/{collection} endpoint. The target Document Collection will be auto-created if it doesn’t exist yet.
ExpandoDB will return a response like so:

… search for Documents?

Let’s go back to our reuters Document Collection. For search, we only need to send GET requests, so we can simply use Chrome (with JSONView).
Say we want to search for the top 10 news articles with the word petroleum in the title, sorted by the date in descending order; plus we only want to see the title, date, and text fields.
Now let’s search for the top 10 news articles that mention the words OPEC, petroleum, and price in any part of the article (tile, text). We want to sort the matching articles by date in descending order. We want to see the title, and date fields. We also want to see the _highlight field - which displays, for each Document, a text fragment that contains the search term(s); the matching search terms are enclosed in HTML tags that will render as higlights in a web browser. The _highlight field is typically used when the Documents are large (e.g. extracted from whole PDF or MS-Word documents) and it’s not practical to retrieve and display all Document fields in the search results UI.
Now let’s search for news articles with any word that starts with petrol* in any part of the article; we only want articles published in February 1987 (i.e. between 1987-02-01 and 1987-02-28).
To allow for minor misspellings in the query terms, we can do a fuzzy search by appending ~ to the search term. For example, let’s search for news articles with indonisea (note the misspelling) in the title. Fuzzy search is based on the Damerau–Levenshtein distance. Note that the max edit distance supported is 2, which is the default; so we could have written the query below as indonisea~.
Finally let’s search for news articles with no title. ExpandoDB uses a special token to denote missing (i.e. null) values: _null_. The token can be modified in the application config file.
See this article to learn more about the query syntax.

…update an existing Document?

Let’s go back to the books Document Collection and update all the fields of a specific book - let’s add the word ‘UPDATED’ to the title, author, and description. We do this using the PUT API.
Let’s check if the book was updated.
Now let’s make a few partial changes to our book. Let’s add a new array field called reviews. In addition, let’s update the title field and remove the word UPDATED from it. We do both of these modifications in one transaction using the PATCH API. To learn more about the syntax of the PATCH API, see the JSON-Patch standard defined in RFC6902. Note that ExpandoDB only supports the following PATCH operations: add, remove, and replace.
Let’s see if the book was updated.

…delete an existing Document?

Again using the books Document Collection, let’s delete a specific book. We do this using the DELETE API, specifying the book’s Document ID as parameter.
If we try to retrieve the book, we will get a 404.

…drop an entire Document Collection?

Lets drop the entire books Document Collection. We do this using the DELETE API, specifying just the Collection name.
If we try to view the books Document Collection, we will get a 404.

…see what Document Collections are available in the Database, and what their fields are?

Simply send a GET request to the /db/_schemas endpoint.
You will notice that ExpandoDB creates additional fields for each Document:
- _id is the unique identifier for the Document
- _createdTimestamp is the date/time (UTC) the Document was created
- _modifiedTimestamp is the date/time (UTC) the Document was last modified
- _full_text is the full-text representation of the Document (i.e. concatenation of all the fields of the Document).

At this point, we can now work with the full Reuters-21578 dataset.

Drop the reuters Collection.
Go to the Loader\reuters directory and extract the contents of reuters21578.tar.gz to the same directory.
Run Loader.exe to load the contents of all the files (this will take a few minutes to complete).
Try out the REST API endpoints against the full Reuters-21578 dataset.

metrics

In order to monitor and improve ExpandoDB’s performance, we need to collect runtime performance metrics. Why collect metrics? Coda Hale explains it beatifully in this presentation.

ExpandoDB uses the Metrics.NET library to provide the following metrics.

Request execution times for each REST endpoint
Error rates
Number of active requests
Sizes (in bytes) of POST and PUT requests
Process metrics such as CPU and memory usage, GC heap sizes, etc.
The following screenshot shows some of these metrics in action. To access the metrics dashboard, go to http://localhost:9000/metrics.

code

about the code

It’s open source (Apache v2 license) - feel free to fork the code on Github.
Huge thanks to the following excellent open source projects:
- Lucene and FlexLucene
- Lightning MDB and Lightning.NET
- Wire
- Jil
- lz4net
- NancyFx
- Metrics.NET
- TopShelf
- log4net and Common Logging
- SinglePaged
- Swagger
Contributors are welcome!