Data Science

NoSQL (Non-Relational Databases) – Foundation

1. Key-Value Stores

The most basic and efficient types of NoSQL databases. In these databases, data is stored as pairs of keys and their associated values, making them extremely fast and ideal for certain use cases like caching and storing session data.

Key: This is the unique identifier for a piece of data. Think of it as a “name” or “index” for the data.
Value: This is the actual data associated with the key. It can be any type of data: a string, number, JSON object, or even more complex data structures.

For example:

Key: userID_12345

Value: {"name": "Alice", "age": 30, "email": "alice@example.com"}

Here, the key is userID_12345 and the value is a JSON object representing the user’s data.

Characteristics of Key-Value Stores

Simplicity: The data is stored in a simple and efficient format, making it easy to implement and scale.
Speed: Extremely fast data access and retrieval, great for applications needing high performance and low latency.(e.g., web sessions, real-time data)
Scalability: Most key-value stores support horizontal scaling, across multiple servers or data centers.
No Schema: Allowing flexibility in the data types stored.

Popular Key-Value Stores

Redis:
- Use Cases: Redis is often used for caching, message brokering, and real-time analytics. It supports more advanced data structures like lists, sets, and hashes, in addition to simple key-value pairs.
- Example: Storing user sessions or frequently accessed data.
DynamoDB (AWS):
- Use Cases: DynamoDB is a fully managed, highly scalable key-value store. It’s used in scenarios that require high availability, such as web applications or mobile apps, and is particularly great for low-latency data access.
- Example: Storing product catalogs, user profiles, or mobile app data.

Disadvantages of Key-Value Stores

Limited Querying Capabilities: While you can efficiently retrieve data based on the key, performing complex queries or operations on the values is not as straightforward.
Data Structure Restrictions: Since data is stored as simple key-value pairs, there’s no inherent support for relationships, which can limit the database’s usefulness for certain types of applications that need complex data relationships.

When to Use Key-Value Stores

Caching: Storing frequently accessed data to speed up access times (e.g., session data, product details).
Session Management: Handling user session data in web applications.
Real-time Analytics: Storing and processing real-time events or logs, such as IoT data.
User Preferences: Storing settings or configurations for individual users (e.g., in e-commerce sites or mobile apps).

2. Document-Based Databases

Ideal for flexible data with varying structures (e.g., MongoDB, CouchDB).

The data in these databases is organized as collections of documents, where each document is a self-contained unit that can contain a wide variety of data types, including strings, numbers, arrays, and even nested documents.

Document: These documents are typically stored in structured formats like JSON (JavaScript Object Notation) or BSON (Binary JSON). Each document is typically identified by a unique ID (often a string or numeric value).
Collection: Documents are grouped into collections, which can be thought of as analogous to tables in relational databases, but with more flexibility.

- The document represents a single user record with various attributes, including nested objects (the address) and arrays (the emails).
- The document’s ID (in this case, 12345) uniquely identifies it within the collection.

Characteristics of Document-Based Databases

Flexible Schema: Unlike relational databases, document databases do not require a predefined schema, meaning the structure of the data can vary between documents within a collection. This flexibility allows for rapid development and easy handling of complex or evolving data.
Hierarchical Structure: Data within documents can be nested, meaning it can have sub-documents or arrays, making document databases suitable for handling complex and semi-structured data.
Efficient Querying: Documents are indexed by their unique ID, and many document databases support rich querying capabilities on the contents of documents, including range queries, text search, and more.
Scalability: Designed for horizontal scaling across multiple servers.

Advantages

Flexibility: You can store a wide variety of data types (e.g., strings, numbers, arrays, objects), and different documents within a collection can have different structures. This allows for rapid iteration and adaptation as application requirements change.
Complex Data Modeling: Since documents can store hierarchical data (including nested arrays and sub-documents), they are well-suited for representing complex, real-world entities such as users, orders, and products.
Scalability: Can scale across many servers to accommodate increasing amounts of data or traffic.
Faster Development: The lack of a rigid schema and support for flexible, nested data structures allows for faster development, especially in agile environments where data models evolve quickly.

Disadvantages

Limited Relationships: May not handle complex relationships between different data entities as efficiently as relational databases. However, techniques like embedding or referencing can sometimes address this issue.
Data Duplication: In document-based databases, data can sometimes be duplicated across documents (especially if the documents contain large amounts of nested data), leading to potential inefficiencies and challenges in maintaining consistency.
Consistency Concerns: Some document databases may sacrifice consistency in favor of availability and partition tolerance, which can be a challenge in applications where strict consistency is required.

When to Use Document-Based Databases

Content Management Systems (CMS)
E-commerce Platforms: Product catalogs in e-commerce platforms often have variable attributes (e.g., electronics vs. clothing), and document databases are well-suited for storing and querying these types of dynamic, diverse records.
Real-Time Applications: Applications like messaging platforms or social networks that need to handle frequent, rapid updates to user data or store rich user profiles benefit from the flexible nature of document databases.
Mobile Applications: With features like offline support and synchronization, document databases like Firestore make it easier to build responsive mobile apps that require real-time data sync.

Example


{
  "_id": "12345",
  "name": "Alice",
  "age": 30,
  "address": {
    "street": "123 Main St",
    "city": "Wonderland",
    "zip": "12345"
  },
  "emails": ["alice@example.com", "alice@work.com"]
}

3. Column-Based Stores or Columnar Databases

Data Organization: In a columnar database, data is stored in a column-oriented fashion. Instead of storing all data for a record (row) together, the database stores data for each column separately.
Column Families: Related columns are grouped together into column families, which help organize and optimize the access to frequently accessed columns.
Efficient Reads and Writes: Since all values of a particular column are stored together, reading or writing data in bulk from specific columns is much faster. This is especially beneficial for analytical queries that focus on a limited number of columns.
Customer Table (Row-Oriented):

Customer ID	Name	Age	Address	Purchase Amount
1	Alice	30	Wonderland	500
2	Bob	40	Dreamworld	600

In a Columnar Database, the data for each column is stored separately:

Name Column: [“Alice”, “Bob”]
Age Column: [30, 40]
Address Column: [“Wonderland”, “Dreamworld”]
Purchase Amount Column: [500, 600]

Characteristics of Column-Based Databases

Optimized for Read-Heavy Operations: Column stores are particularly efficient for reading specific columns across large datasets. Analytical queries that focus on a subset of columns (e.g., aggregation or filtering) can be processed much faster.
Efficient Compression: Since data in each column is often of the same type, column-based stores can achieve higher compression rates compared to row-based databases. This reduces storage costs and improves performance.
Scalability: Designed for horizontal scaling.
Optimized for Write-Heavy Workloads: While not always as fast as row-based stores for simple transactional workloads, columnar stores excel in scenarios that require high write throughput for large volumes of data (e.g., time-series data, IoT data, logs).

Popular Column-Based Databases

Apache Cassandra:
- Use Cases: Cassandra is highly scalable and provides continuous availability, making it ideal for applications that require fault tolerance and high write throughput. It is commonly used in industries like e-commerce, social media, and IoT.
- Example: Storing sensor data, logs, and real-time analytics in distributed systems.
HBase (Apache HBase):
- Use Cases: HBase is designed for use with large amounts of unstructured or semi-structured data, offering fast read and write capabilities. It’s built on top of Hadoop and is typically used in data lakes or as a part of big data infrastructure.
- Example: Storing and processing web logs, clickstream data, or large-scale data from distributed systems.
Google Bigtable:
- Use Cases: Google Bigtable is a highly scalable and managed columnar store.
- Example: Real-time data analysis for applications in advertising, analytics, and IoT.
Apache Hudi (Hadoop Upserts, Deletes, and Incrementals):
- Use Cases: Hudi is built on top of Apache Spark and Hadoop, optimized for large-scale data lakes. It allows for transactional capabilities (like updates and deletes) and is used in big data processing.
- Example: Managing large data sets for real-time streaming analytics or time-series data.

Advantages of Column-Based Databases

Performance for Analytical Queries: By storing data in columns, columnar databases allow for more efficient querying when analyzing a subset of columns in large datasets. This is beneficial for OLAP (Online Analytical Processing) operations.
Data Compression: Storing similar data types in the same column allows for better compression, reducing storage costs and improving the speed of data retrieval.
Efficient Data Retrieval: Column-based databases are particularly efficient for read-heavy operations and can speed up aggregations, filtering, and reporting queries, which are common in analytics workloads.
Scalability: Scale horizontally and handle petabytes of data in distributed systems, making them ideal for big data applications.

Disadvantages of Column-Based Databases

Not Ideal for Transactional Workloads: Not well-suited for OLTP (Online Transaction Processing), where operations involve frequent updates and small, random reads and writes across many columns (e.g., traditional banking applications).
Data Duplication: Storing each column separately can lead to duplication of information if a dataset contains redundant data, especially when each column is stored in a distributed manner.
Complexity in Data Modeling: Require more careful data modeling compared to relational databases. This can make it harder to design and manage for certain use cases.

When to Use Column-Based Databases

Big Data Analytics: Excel at handling large volumes of data and performing efficient analytics on large datasets. They are ideal for environments where you need to perform aggregation, filtering, and real-time analysis on big data.
Time-Series Data: They are well-suited for applications that collect large volumes of time-stamped data, such as logs, sensor data, or event tracking.
Data Warehousing and OLAP: Used for data warehousing, where large amounts of historical data are stored and analyzed. They perform particularly well for OLAP operations where large datasets are aggregated or analyzed across a few columns.
IoT Data: Need to be stored and processed efficiently. It can handle high write throughput and large-scale data processing.

4. Graph-Based Databases

Designed to store and manage connected data. In these databases, data is represented as a graph, consisting of nodes, edges, and properties. Well-suited for applications that involve complex relationships between entities, where traditional relational databases may struggle to represent or process the data efficiently.

How Graph-Based Databases Work

Nodes: Each node represents an individual entity, such as a person, product, or event. Nodes typically have attributes (also called properties) that store relevant information about the entity.
Edges: An edge connects two nodes and represents the relationship between them (e.g., “Person A is friends with Person B”). Edges also have properties, which describe the relationship in more detail (e.g., “Friendship started in 2015”).
Properties: Both nodes and edges can have properties. A property is a key-value pair that stores additional information. For example, a node representing a person might have properties like “name”, “age”, or “location”. Similarly, an edge might have properties like “since”, representing the year the relationship was established.

Key Characteristics of Graph-Based Databases

Optimized for Relationships: Unlike relational databases, where relationships are stored in separate tables and linked by foreign keys, graph databases represent relationships as first-class citizens. This makes them ideal for applications with complex, highly interconnected data.
Efficient for Traversals: Graph databases are optimized for graph traversals — following relationships between nodes. For example, if you wanted to find friends of friends in a social network, a graph database would perform this traversal much faster than a relational database.
Schema Flexibility: Graph databases tend to be schema-agnostic. This means that the structure of the graph (nodes, edges, and properties) can evolve over time without requiring complex database migrations.

Popular Graph-Based Databases

Neo4j:
- Known for its powerful querying capabilities using Cypher Query Language. It’s often used for social networks, recommendation engines, fraud detection, and network analysis.
- Example: Analyzing social connections or recommending products based on user preferences and behaviors.
Microsoft’s Azure CosmosDB (Graph API):
- Offers a Graph API to build scalable graph applications. It’s suitable for real-time applications that need global distribution and low-latency data retrieval.
- Example: Building social media platforms or systems that require global graph-based relationships across multiple regions.
ArangoDB:
- ArangoDB is a multi-model database that supports graph data along with document and key-value data models. It allows developers to leverage both graph-based queries and traditional document queries.
- Example: A flexible platform for applications requiring a combination of graph analytics and document storage.
OrientDB:
- OrientDB is another multi-model database that supports both graph and document models. It is designed for high-performance and scalability.
- Example: Used in applications for connected data, such as fraud detection, content recommendation systems, or network topology analysis.
Amazon Neptune:
- Fully-managed graph database service by AWS, optimized for storing and querying highly connected data. It supports both property graph and RDF graph models.
- Example: Real-time recommendations, social media data analysis, and knowledge graphs.

Advantages of Graph-Based Databases

Relationship Modeling: Excel at representing and querying complex relationships, especially when the relationships themselves are rich and need to be traversed to gain insights.
- Example: Social networks, fraud detection, recommendation engines.
Performance on Relationship Queries: Highly efficient for queries that involve following relationships, such as traversals, which can be cumbersome in relational databases due to JOIN operations.
Flexibility: The schema is flexible, i.e. can easily add new types of relationships or nodes without needing to redesign your entire database. Ideal for rapidly evolving applications.
Natural Data Representation: Many real-world systems (social networks, supply chains, etc.) are inherently graph-like, making graph databases a good fit.
Real-Time Data Processing: Real-time data analysis and decision-making, especially in scenarios involving large volumes of interconnected data.

Disadvantages of Graph-Based Databases

Not Ideal for Tabular Data: Applications that don’t require complex relationships or deep traversals.
Complexity for Simple Applications: Relational databases or key-value stores appropriate for simpler, tabular data.
Less Mature: Less mature than relational databases in terms of features, tooling, and community support.

Use Cases for Graph-Based Databases

Social Networks: Such as “friends”, “followers”, or “likes”. Complex queries like “friends of friends” or “mutual connections” are easily handled.
Recommendation Engines: In e-commerce relationships between users, products, or content to provide personalized recommendations (e.g., “customers who bought this also bought…”).
Fraud Detection: Such as money laundering, by analyzing complex relationships between entities (e.g., accounts, transactions, and users) and identifying suspicious patterns.
Network and IT Operations: Used to model network topology, where devices (nodes) are connected by communication links (edges).
Knowledge Graphs: To build knowledge graphs, which represent the relationships between concepts and entities, providing a rich context for searching and recommendation systems.