Index Table Pattern: Secondary Index in NoSQL

After this topic, you will be able to:

Implement index table pattern for efficient non-primary-key queries
Design index maintenance strategies for write-heavy workloads
Calculate storage and consistency trade-offs for index tables

TL;DR

The Index Table pattern creates separate lookup tables optimized for non-primary-key queries, trading storage space and write complexity for fast read performance. Instead of scanning entire tables, queries use dedicated index tables that map search attributes to primary keys. Essential for NoSQL systems lacking native secondary indexes and high-throughput read-heavy workloads.

Cheat Sheet: Index table = separate table storing (search_key → primary_key) mappings. Write to both main + index tables. Read from index table first, then fetch from main table. Cost: 2x writes, 2x storage. Benefit: O(1) lookups instead of full scans.

The Problem It Solves

Modern applications need to query data by multiple attributes, but most databases optimize for primary key lookups. Consider Instagram’s user profile system: users are stored with user_id as the primary key, but the app needs to find users by username, email, or phone number. Without index tables, these queries require full table scans—reading every record to find matches.

In a table with 2 billion users, scanning for email='user@example.com' might examine millions of records before finding the match. At 1ms per 1000 records scanned, a full scan could take 30+ minutes. This is unacceptable for login flows that need sub-100ms response times.

The problem intensifies with NoSQL databases like DynamoDB or Cassandra, which don’t support arbitrary secondary indexes. You can only query by partition key or partition key + sort key. If your access patterns don’t align with your primary key structure, you’re stuck with expensive scans or forced to duplicate data across multiple tables with different key structures.

Traditional relational databases offer secondary indexes, but these come with hidden costs in distributed systems: index updates must be synchronous, creating write bottlenecks. When Twitter writes a tweet, updating indexes on user_id, timestamp, hashtags, and mentions can triple write latency. Index tables give you control over this trade-off.

Full Table Scan vs Index Table Lookup

graph LR
    subgraph Without Index Table
        A1["Query: Find user<br/>by email"] --> B1["Scan users table<br/>2B records"]
        B1 --> C1["Check each record<br/>email == target?"]
        C1 --> D1["Found after<br/>1.5B comparisons<br/>⏱️ 30+ minutes"]
    end
    
    subgraph With Index Table
        A2["Query: Find user<br/>by email"] --> B2["Lookup in<br/>users_by_email<br/>O(1) operation"]
        B2 --> C2["Get user_id<br/>⏱️ 1-2ms"]
        C2 --> D2["Fetch from users<br/>by user_id<br/>⏱️ 1-2ms"]
        D2 --> E2["Total: 2-5ms"]
    end

Without index tables, finding a user by email requires scanning billions of records (30+ minutes). With index tables, the same query becomes two fast O(1) lookups (2-5ms total).

Solution Overview

The Index Table pattern solves non-primary-key queries by maintaining separate tables that map search attributes to primary keys. Each index table is optimized for a specific query pattern, with the search attribute as its primary key.

When you write data, you write to both the main table and all relevant index tables. When you read by a non-primary attribute, you query the index table first to get the primary key, then fetch the full record from the main table. This two-hop lookup is far faster than scanning millions of records.

For Instagram’s user system, you’d maintain three index tables: users_by_username, users_by_email, and users_by_phone. Each maps its attribute to user_id. When a user logs in with email, you query users_by_email to get the user_id, then fetch the full profile from the main users table.

The pattern trades storage (you’re duplicating keys) and write complexity (multiple writes per operation) for read performance. It’s essentially manual denormalization—you’re pre-computing the answer to “which user_id has this email?” and storing it for instant retrieval. This is the same principle behind database indexes, but you control the implementation, consistency model, and update strategy.

How It Works

Step 1: Design Your Index Tables

Identify your query patterns and create an index table for each non-primary-key attribute you need to search by. Each index table has the search attribute as its primary key and stores the corresponding main table primary key.

For Instagram users: Main table users has primary key user_id. Create index tables: users_by_username(username → user_id), users_by_email(email → user_id), users_by_phone(phone → user_id). Each index table is a simple key-value mapping optimized for lookups on its specific attribute.

Step 2: Write to All Tables Atomically

When inserting or updating data, write to the main table and all affected index tables. This requires transaction coordination or eventual consistency handling.

Example: User registration writes to four tables: users, users_by_username, users_by_email, users_by_phone. If using DynamoDB, wrap these in a transaction. If using Cassandra, use batch statements with appropriate consistency levels. If eventual consistency is acceptable, write to main table first, then asynchronously update indexes.

For updates that change indexed attributes (user changes email), you must delete the old index entry and create a new one. This is where complexity grows: changing email requires three writes (main table update, old email index delete, new email index insert).

Step 3: Query via Index Table

For non-primary-key queries, first query the index table to get the primary key, then fetch the full record from the main table.

Login flow: Query users_by_email with email → get user_id → query users table with user_id → return full profile. This two-hop lookup typically takes 2-5ms total (1-2ms per query), compared to 30+ minutes for a full table scan.

Step 4: Handle Index Maintenance

Index tables can become stale if writes fail partially. Implement reconciliation strategies: periodic scans to verify index consistency, tombstone markers for deletes, or version numbers to detect drift. Netflix runs nightly jobs to rebuild indexes from the main table, catching any inconsistencies from failed writes.

Write Flow: Maintaining Main and Index Tables

sequenceDiagram
    participant Client
    participant API
    participant MainTable as users table
    participant Idx1 as users_by_username
    participant Idx2 as users_by_email
    participant Idx3 as users_by_phone
    
    Client->>API: POST /register<br/>{username, email, phone}
    
    Note over API: Start Transaction
    
    API->>MainTable: 1. INSERT user_id=123<br/>{username, email, phone, ...}
    MainTable-->>API: ✓ Success
    
    API->>Idx1: 2. INSERT username='alice'<br/>→ user_id=123
    Idx1-->>API: ✓ Success
    
    API->>Idx2: 3. INSERT email='alice@ex.com'<br/>→ user_id=123
    Idx2-->>API: ✓ Success
    
    API->>Idx3: 4. INSERT phone='+1234567890'<br/>→ user_id=123
    Idx3-->>API: ✓ Success
    
    Note over API: Commit Transaction
    
    API-->>Client: 201 Created<br/>{user_id: 123}
    
    Note over API,Idx3: Write Amplification: 1 logical write → 4 physical writes<br/>Latency: 5ms → 15-20ms

User registration writes to four tables atomically: the main users table plus three index tables. This write amplification (1→4 writes) increases latency but enables fast lookups on username, email, and phone.

Read Flow: Two-Hop Lookup via Index Table

graph LR
    A["Client Request<br/>Login with email"] --> B["1. Query users_by_email<br/>email='alice@ex.com'"]
    B --> C["Index Table Returns<br/>user_id=123<br/>⏱️ 1-2ms"]
    C --> D["2. Query users table<br/>user_id=123"]
    D --> E["Main Table Returns<br/>Full User Profile<br/>⏱️ 1-2ms"]
    E --> F["Response to Client<br/>Total: 2-5ms"]
    
    subgraph Index Table
        B
        C
    end
    
    subgraph Main Table
        D
        E
    end

Login queries use a two-hop lookup: first query the index table to get the primary key (user_id), then fetch the full record from the main table. Each hop takes 1-2ms for a total of 2-5ms.

Variants

Composite Index Tables

Store multiple attributes in the index table’s sort key for range queries. Instead of username → user_id, use (country, signup_date) → user_id to support queries like “all users in US who signed up in January 2024.”

When to use: When you need to filter by multiple attributes together. Pros: Enables complex queries without full scans. Cons: Index table grows larger, more complex write logic.

Partial Index Tables

Only index records matching certain criteria. For example, index only active users or recent orders, not historical data.

When to use: When most queries target a subset of data (80/20 rule). Pros: Smaller index tables, faster queries, lower storage costs. Cons: Must fall back to full scans for non-indexed data, adds conditional logic.

Denormalized Index Tables

Store frequently accessed attributes directly in the index table instead of just the primary key. users_by_email might store (email → user_id, username, profile_pic_url) to avoid the second hop.

When to use: When the second lookup is a bottleneck or you need to minimize latency. Pros: Single query instead of two, lower latency. Cons: More storage, more complex updates (must update index when any denormalized field changes).

Inverted Index Tables

For many-to-many relationships, store (attribute_value → list of primary_keys). Used for tags, categories, or full-text search.

When to use: When one attribute value maps to many records (hashtags, product categories). Pros: Efficient for “find all records with tag X” queries. Cons: Large lists require pagination, updates are more complex.

Index Table Variants Comparison

graph TB
    subgraph Simple Index
        S1["users_by_email<br/>email → user_id"]
        S2["Storage: 100 bytes/record<br/>Query: Single attribute"]
    end
    
    subgraph Composite Index
        C1["users_by_country_date<br/>(country, signup_date) → user_id"]
        C2["Storage: 150 bytes/record<br/>Query: Range queries on multiple attributes"]
    end
    
    subgraph Denormalized Index
        D1["users_by_email<br/>email → (user_id, username, profile_pic)"]
        D2["Storage: 300 bytes/record<br/>Query: Single hop, no main table lookup"]
    end
    
    subgraph Inverted Index
        I1["posts_by_hashtag<br/>hashtag → [post_id1, post_id2, ...]"]
        I2["Storage: Variable (list size)<br/>Query: Many-to-many relationships"]
    end

Four index table variants trade storage and complexity for different query capabilities: simple (basic lookups), composite (range queries), denormalized (single-hop reads), and inverted (many-to-many relationships).

Trade-offs

Storage vs Query Performance

Index tables duplicate data. For Instagram’s 2B users with 3 index tables, you’re storing 6B additional key mappings. At ~100 bytes per mapping, that’s 600GB of index data.

Decision criteria: If query latency matters more than storage costs, use index tables. Storage is cheap ($0.10/GB/month on S3), but slow queries lose users. If storage is constrained (embedded systems, edge devices), use selective indexing or accept slower queries.

Write Complexity vs Read Simplicity

Every write hits multiple tables. User registration writes to 4 tables; email change writes to 3. This increases write latency (5ms → 15ms) and failure surface area.

Decision criteria: For read-heavy workloads (100:1 read/write ratio), the trade-off favors index tables. For write-heavy systems (logging, time-series), consider append-only designs or batch index updates. Measure your read/write ratio: if reads > 10x writes, index tables win.

Consistency vs Availability

Synchronous index updates guarantee consistency but reduce availability (if index write fails, entire operation fails). Asynchronous updates improve availability but risk stale indexes.

Decision criteria: For critical operations (payments, authentication), use synchronous updates with transactions. For non-critical features (search, recommendations), eventual consistency is acceptable. Instagram uses synchronous updates for username changes (must be unique) but asynchronous for bio updates.

Maintenance Overhead vs Operational Simplicity

Index tables require monitoring, reconciliation jobs, and careful update logic. Native database indexes are maintained automatically.

Decision criteria: Use index tables when your database lacks secondary indexes (DynamoDB, Cassandra) or when you need custom consistency models. Use native indexes when available and sufficient for your needs.

When to Use (and When Not To)

Use Index Tables When:

Your database lacks native secondary indexes (DynamoDB, Cassandra, HBase) and you need non-primary-key queries
Read/write ratio exceeds 10:1 and query latency is critical (user-facing features, real-time dashboards)
You need custom consistency models (eventual consistency for some indexes, strong for others)
Query patterns are predictable and stable (you know which attributes to index)
Storage costs are acceptable relative to query performance gains

Avoid Index Tables When:

Your database has efficient native secondary indexes (PostgreSQL, MySQL) that meet your needs
Write volume is extremely high and write latency is critical (logging systems, metrics ingestion)
Query patterns are unpredictable or change frequently (ad-hoc analytics, exploratory queries)
Data changes frequently and index maintenance overhead exceeds query performance gains
You can achieve acceptable performance with caching or query optimization

Anti-Patterns:

Creating index tables for every attribute “just in case”—this multiplies write complexity without proven benefit. Start with primary key queries, add indexes only when measurements show query performance problems. Over-indexing is worse than under-indexing because it slows every write.

Using index tables for analytics queries that scan large datasets—these queries should use data warehouses or batch processing, not operational index tables. Index tables optimize for point lookups, not aggregations.

Real-World Examples

Instagram: User Lookup System

Instagram maintains index tables for username, email, and phone number lookups. When you log in, the app queries users_by_email to get your user_id, then fetches your profile from the main users table stored in Cassandra.

Interesting detail: Username changes require careful coordination. Instagram uses a two-phase commit: reserve the new username in users_by_username (preventing others from taking it), update the main table, then release the old username. This prevents username collisions during the transition period. They also maintain a 14-day grace period where old usernames redirect to new ones, requiring a temporary reverse index.

Uber: Driver Location Indexing

Uber’s dispatch system needs to find nearby drivers quickly. They maintain geospatial index tables that map (lat, lng, geohash) → driver_id. When a rider requests a ride, the system queries the geohash index to find drivers within the same geohash cell, then filters by precise distance.

Interesting detail: Driver locations update every 4 seconds, generating billions of index writes per day. Uber uses eventual consistency for location indexes—stale data (driver appears 50 meters away from actual position) is acceptable for 4 seconds. They batch index updates and use probabilistic data structures (Bloom filters) to reduce write amplification.

Netflix: Content Recommendation Indexes

Netflix maintains index tables mapping (genre, release_year, rating) → content_id to power browse pages and recommendations. These composite indexes enable queries like “show me action movies from 2023 with rating > 4.0.”

Interesting detail: Netflix rebuilds these indexes nightly from the source of truth (main content catalog) rather than maintaining them transactionally. This eventual consistency model (up to 24 hours stale) is acceptable because content metadata changes infrequently. Nightly rebuilds also catch any inconsistencies from failed writes, providing automatic reconciliation.

Instagram Username Change: Two-Phase Commit

sequenceDiagram
    participant User
    participant API
    participant UsersTable as users table
    participant UsernameIdx as users_by_username
    participant TempIdx as username_redirects<br/>(14-day grace period)
    
    User->>API: Change username<br/>old='alice' → new='alice_new'
    
    Note over API: Phase 1: Reserve New Username
    API->>UsernameIdx: 1. INSERT 'alice_new' → user_id=123<br/>(reserve, prevent collisions)
    UsernameIdx-->>API: ✓ Reserved
    
    Note over API: Phase 2: Update Main Table
    API->>UsersTable: 2. UPDATE user_id=123<br/>SET username='alice_new'
    UsersTable-->>API: ✓ Updated
    
    Note over API: Phase 3: Setup Redirect
    API->>TempIdx: 3. INSERT 'alice' → 'alice_new'<br/>(14-day redirect)
    TempIdx-->>API: ✓ Redirect active
    
    Note over API: Phase 4: Release Old Username
    API->>UsernameIdx: 4. DELETE 'alice' → user_id=123<br/>(after 14 days)
    UsernameIdx-->>API: ✓ Deleted
    
    API-->>User: Username changed successfully
    
    Note over TempIdx: Old username redirects to new<br/>for 14 days, then becomes available

Instagram’s username change uses a two-phase commit to prevent collisions: reserve the new username, update the main table, create a temporary redirect for 14 days, then release the old username. This ensures no other user can claim the new username during the transition.

Interview Essentials

Mid-Level

Explain the two-hop lookup process: query index table for primary key, then query main table. Discuss why this is faster than full table scans (O(1) vs O(n)). Describe basic write flow: write to main table and index tables. Calculate storage overhead: if main table has N records and you create M index tables, you store N + (N × M) total records. Understand when eventual consistency is acceptable vs when you need synchronous updates.

Senior

Design index maintenance strategies for different consistency requirements. Explain how to handle partial write failures (what if main table write succeeds but index write fails?). Discuss trade-offs between synchronous and asynchronous index updates with concrete latency numbers. Calculate write amplification: if you have 5 index tables, each write becomes 6 writes (1 main + 5 indexes). Propose reconciliation strategies: periodic scans, version numbers, or event sourcing. Compare index tables to materialized views (see Materialized View)—both denormalize data but with different consistency models.

Staff+

Architect index table systems for multi-region deployments with different consistency requirements per region. Design index evolution strategies: how do you add new indexes to a live system with billions of records? Propose solutions for index table hotspots (popular usernames, trending hashtags). Calculate cost-benefit analysis: storage costs vs query performance gains vs operational complexity. Design hybrid approaches: use index tables for hot data, fall back to scans for cold data. Discuss how index tables interact with caching layers—should you cache index lookups or main table lookups?

Common Interview Questions

How do you handle index updates when the indexed attribute changes? (Delete old entry, insert new entry, handle race conditions)

What happens if index table write fails but main table write succeeds? (Reconciliation strategies, idempotent updates, event sourcing)

How do you decide which attributes to index? (Query patterns, read/write ratio, cardinality analysis)

Can you use index tables with eventual consistency? (Yes, for non-critical features; discuss staleness tolerance)

How do index tables differ from database secondary indexes? (Manual control, custom consistency, works with NoSQL)

Red Flags to Avoid

Saying ‘just add an index’ without discussing write amplification or storage costs

Not considering partial failure scenarios (what if one index write fails?)

Proposing synchronous updates for all indexes without discussing latency impact

Creating indexes for every attribute without analyzing query patterns

Not mentioning reconciliation or consistency verification strategies

Key Takeaways

Index tables trade storage and write complexity for fast non-primary-key queries by maintaining separate lookup tables that map search attributes to primary keys

Every write hits multiple tables (main + all indexes), increasing write latency and failure surface area. For read-heavy workloads (>10:1 ratio), this trade-off is worthwhile

Two-hop lookup (index table → primary key → main table) is far faster than full table scans: 2-5ms vs minutes for large datasets

Choose synchronous updates for critical operations requiring consistency (authentication, payments) and asynchronous updates for non-critical features (search, recommendations)

Index tables are essential for NoSQL databases lacking native secondary indexes (DynamoDB, Cassandra) but may be unnecessary if your database has efficient built-in indexing