BerryDB is a database that natively handles unstructured data such as image, video, audio, text and semi-structured data such as JSON. Unstructured data comprises the vast majority of data found in an organization, some estimates run as high as 80%. It can be anything from a few bytes (for example, a temperature reading from a sensor generating text or JSON data) to terabytes in size. Managing this scale with traditional file approaches becomes highly complex. There is a need for a native unstructured database to manage these assets.
With traditional approaches, the ability to share massive sets of unstructured data across geographies, corporate entities, and so on, has required moving compressed files via FTP or even emails and then governance becomes next to impossible. There is a need for a core technology that manages unstructured data natively and provides governance across these data sets. BerryDB was invented to solve this complexity.
BerryDB’s native management of unstructured data consists of the following main components
In-place processing of unstructured data
- BerryDB has built-in support for processing unstructured data. Users can process image, audio, video, text and JSON data. The resulting metadata automatically becomes part of the data set. Some examples:
- Audio/video data: Supports trancription of this data and the resulting transcription is automatically indexed for search
- Images: Supports automatic labeling of images using computer vision models
- Text: Supports NLP based labeling of text data
- BerryDB also supports writing extensions for custom processing
Annotation interface
- BerryDB provides an annotation UI to label every data type. Users can build a labeling UI for their dataset and share the UI widget with the labelers. The annotated data is then stored in the DB and automatically indexed for search
- BerryDB has support for automatic labeling using a built-in ML model.
Document store for embeddings and model training
- Users can store multi-modal documents used for model training and embeddings.
- Users can label, enrich and version control the data sets used for training the model
Universal search
BerryDB provides ability to search through unstructured data. It has the capability to perform database search for fields in the database, Full text search for large text in the data, Vector (similarity) search for images or audio/video, Annotation search for labeled data. All these search queries can be combined.
Scalable and millisecond response time
- BerryDB is an in-memory distributed JSON database that can scale to billions of objects and queries.
- BerryDB is super fast. The response time is designed to be in 10s of ms for most queries so that it can be used in critical production systems. It automatically generates indexes based on the schema
- BerryDB team is experienced in building large scale data systems. Members of BerryDB team led the delivery of the first version of Membase and designed it for PB scale data size