ExpandoDB

A JSON document database with super-fast search

Fork me on GitHub

features

ok, so what is this?

Well, as it says on the tin, ExpandoDB is a JSON document database.

  • It is meant to store any data that can be represented as JSON - this means text extracted from crawled html files, log files, emails, PDF files, MS-Office files, etc., or data imported from other databases.
  • More importantly, it is meant to enable super-fast search of the stored data. ExpandoDB’s search engine is powered by Lucene, the de facto standard for search engines. Unlike other Lucene-based search engines, ExpandoDB doesn’t require you to create Lucene index schemas - it auto-generates the schemas for you!
  • It supports a full range of query operations: keyword search, range search, wildcard search, regex search, proximity search, fuzzy search, etc. See this article for an overview of Lucene query syntax.
  • It highlights the part of the text that contains the search term(s).
  • It is easy to setup - you can download and install it in under 5 minutes! It’s packaged as a self-contained microservice, so it doesn’t have any dependencies other than the .NET runtime v4.5.1.
  • It has an easy-to-use REST API, and it comes bundled with a Swagger API spec and viewer.
  • It’s open-source, released under the Apache v2 license.

Sounds good? Why not download and try it out!

download

up and running in 5 minutes

  • Download the latest version from the ExpandoDB github releases page.
    • ExpandoDB.Service.zip - this contains the main ExpandoDB binaries.
    • (Optional) Loader.zip - this is an app that populates ExpandoDB with sample data (the Reuters-21578 dataset).
  • IMPORTANT You need to unblock the zip files in Windows before unzipping them. How? Just right click and ‘Unblock’, as explained in this article. ExpandoDB will not launch if you don’t do this step.
  • Unzip ExpandoDB.Service.zip to a suitable directory, e.g. d:\ExpandoDB. Likewise, unzip Loader.zip to a suitable directory, e.g. d:\Loader.
  • Open an Admin command prompt and cd to the ExpandoDB directory.
  • Enter the command ExpandoDB.Service.exe install. This will install ExpandoDB as a Windows service. On the same command prompt, enter the command net start ExpandoDB.Service. This will start the ExpandoDB service; it will start listening on port 9000 (you can change the port number in the application config file). Install
  • If you also downloaded and unzipped Loader.zip, cd to the Loader directory, and run Loader.exe. This will load a subset of the Reuters-21578 dataset into the system. The Reuters-21578 dataset is a collection of Reuters news articles from the 1980’s.
    Loader
  • Open a web browser and go to this URL: http://localhost:9000/db/reuters. If you’re using Chrome and you have an extension for viewing JSON (such as JSONView), you should see something like the below screenshot. If you don’t have a JSON viewer, you’ll see the raw JSON. First Look

Now that ExpandoDB is up and running, let’s go over to the REST API overview.


gone in 5 minutes

  • To uninstall, open an Admin command prompt and cd to the ExpandoDB directory.
  • Enter the command net stop ExpandoDB.Service. This will stop the ExpandoDB service if it is running.
  • Enter the command ExpandoDB.Service.exe uninstall. This will remove the ExpandoDB service from Windows.
  • Delete the ExpandoDB directory.
  • Delete the Loader directory.
rest api

it’s got an easy to use REST API

  • To insert a new Document, use the POST /db/{collection} endpoint. ExpandoDB will auto-create the target Document Collection if it doesn’t exist.
    Post Spec
  • To find out about the schema of a specific Document Collection, use the GET /db/_schemas/{collection} endpoint.
    Get Schema
  • To find out what Document Collections are in the Database and what their schemas are, use the GET /db/_schemas endpoint. Get Schemas
  • To search a Document Collection, use the GET /db/{collection} endpoint. This is the API endpoint you’ll be using the most, so do take time to read the documentation below. Search Collection
  • To count items in a Document Collection, use the GET /db/{collection}/count endpoint.
    Get Collection Count
  • To retrieve a single Document from a Document Collection, use the GET /db/{collection}/{id} endpoint.
    Get Document
  • To update an existing Document, use the PUT /db/{collection}/{id} endpoint. The Document that you send will replace the existing one. Put Document
  • To partially update an existing Document, use the PATCH /db/{collection}/{id} endpoint. This endpoint implements the JSON-Patch standard defined in RFC6902. Note that ExpandoDB only supports the following PATCH operations: add, remove, and replace. Patch Document
  • To remove an existing Document, use the DELETE /db/{collection}/{id} endpoint. Delete Document
  • To remove an entire Document Collection, use the DELETE /db/{collection} endpoint. Drop Collection
  • If ExpandoDB is set up and running locally on your machine, do load the ExpandoDB Swagger API spec into your browser and try out the endpoints.
show me

how do I …

… insert a new Document?

  • Using your favorite HTTP command line tool (e.g. curl), library (e.g. RestSharp), or Chrome app (e.g. PostMan), POST a JSON Document to the /db/{collection} endpoint. The target Document Collection will be auto-created if it doesn’t exist yet. curl Insert New
    RestSharp Insert New Postman Insert New
  • ExpandoDB will return a response like so: Insert response

… search for Documents?

  • Let’s go back to our reuters Document Collection. For search, we only need to send GET requests, so we can simply use Chrome (with JSONView).
  • Say we want to search for the top 10 news articles with the word petroleum in the title, sorted by the date in descending order; plus we only want to see the title, date, and text fields. Howto Search
  • Now let’s search for the top 10 news articles that mention the words OPEC, petroleum, and price in any part of the article (tile, text). We want to sort the matching articles by date in descending order. We want to see the title, and date fields. We also want to see the _highlight field - which displays, for each Document, a text fragment that contains the search term(s); the matching search terms are enclosed in HTML tags that will render as higlights in a web browser. The _highlight field is typically used when the Documents are large (e.g. extracted from whole PDF or MS-Word documents) and it’s not practical to retrieve and display all Document fields in the search results UI.
    Howto Search Highlight
  • Now let’s search for news articles with any word that starts with petrol* in any part of the article; we only want articles published in February 1987 (i.e. between 1987-02-01 and 1987-02-28). Howto Search Date Range
  • To allow for minor misspellings in the query terms, we can do a fuzzy search by appending ~ to the search term. For example, let’s search for news articles with indonisea (note the misspelling) in the title. Fuzzy search is based on the Damerau–Levenshtein distance. Note that the max edit distance supported is 2, which is the default; so we could have written the query below as indonisea~. Howto Search Fuzzy
  • Finally let’s search for news articles with no title. ExpandoDB uses a special token to denote missing (i.e. null) values: _null_. The token can be modified in the application config file. Howto Search Null
  • See this article to learn more about the query syntax.

…update an existing Document?

  • Let’s go back to the books Document Collection and update all the fields of a specific book - let’s add the word ‘UPDATED’ to the title, author, and description. We do this using the PUT API. Howto Update Replace
  • Let’s check if the book was updated. Howto Update Replace - Result
  • Now let’s make a few partial changes to our book. Let’s add a new array field called reviews. In addition, let’s update the title field and remove the word UPDATED from it. We do both of these modifications in one transaction using the PATCH API. To learn more about the syntax of the PATCH API, see the JSON-Patch standard defined in RFC6902. Note that ExpandoDB only supports the following PATCH operations: add, remove, and replace. Howto Update Patch
  • Let’s see if the book was updated. Howto Update Patch - Result

…delete an existing Document?

  • Again using the books Document Collection, let’s delete a specific book. We do this using the DELETE API, specifying the book’s Document ID as parameter. Howto Delete Document
  • If we try to retrieve the book, we will get a 404. Howto Delete Document - Result

…drop an entire Document Collection?

  • Lets drop the entire books Document Collection. We do this using the DELETE API, specifying just the Collection name. Howto Drop Collection
  • If we try to view the books Document Collection, we will get a 404. Howto Drop Collection - Result

…see what Document Collections are available in the Database, and what their fields are?

  • Simply send a GET request to the /db/_schemas endpoint. Howto Get Schema
  • You will notice that ExpandoDB creates additional fields for each Document:
    • _id is the unique identifier for the Document
    • _createdTimestamp is the date/time (UTC) the Document was created
    • _modifiedTimestamp is the date/time (UTC) the Document was last modified
    • _full_text is the full-text representation of the Document (i.e. concatenation of all the fields of the Document).

At this point, we can now work with the full Reuters-21578 dataset.

  • Drop the reuters Collection.
  • Go to the Loader\reuters directory and extract the contents of reuters21578.tar.gz to the same directory.
  • Run Loader.exe to load the contents of all the files (this will take a few minutes to complete).
  • Try out the REST API endpoints against the full Reuters-21578 dataset.
metrics

metrics

In order to monitor and improve ExpandoDB’s performance, we need to collect runtime performance metrics. Why collect metrics? Coda Hale explains it beatifully in this presentation.

ExpandoDB uses the Metrics.NET library to provide the following metrics.

  • Request execution times for each REST endpoint
  • Error rates
  • Number of active requests
  • Sizes (in bytes) of POST and PUT requests
  • Process metrics such as CPU and memory usage, GC heap sizes, etc.
  • The following screenshot shows some of these metrics in action. To access the metrics dashboard, go to http://localhost:9000/metrics. Metrics
code

about the code