Index Table Pattern: Secondary Index in NoSQL
After this topic, you will be able to:
- Implement index table pattern for efficient non-primary-key queries
- Design index maintenance strategies for write-heavy workloads
- Calculate storage and consistency trade-offs for index tables
TL;DR
The Index Table pattern creates separate lookup tables optimized for non-primary-key queries, trading storage space and write complexity for fast read performance. Instead of scanning entire tables, queries use dedicated index tables that map search attributes to primary keys. Essential for NoSQL systems lacking native secondary indexes and high-throughput read-heavy workloads.
Cheat Sheet: Index table = separate table storing (search_key → primary_key) mappings. Write to both main + index tables. Read from index table first, then fetch from main table. Cost: 2x writes, 2x storage. Benefit: O(1) lookups instead of full scans.
The Problem It Solves
Modern applications need to query data by multiple attributes, but most databases optimize for primary key lookups. Consider Instagram’s user profile system: users are stored with user_id as the primary key, but the app needs to find users by username, email, or phone number. Without index tables, these queries require full table scans—reading every record to find matches.
In a table with 2 billion users, scanning for email='user@example.com' might examine millions of records before finding the match. At 1ms per 1000 records scanned, a full scan could take 30+ minutes. This is unacceptable for login flows that need sub-100ms response times.
The problem intensifies with NoSQL databases like DynamoDB or Cassandra, which don’t support arbitrary secondary indexes. You can only query by partition key or partition key + sort key. If your access patterns don’t align with your primary key structure, you’re stuck with expensive scans or forced to duplicate data across multiple tables with different key structures.
Traditional relational databases offer secondary indexes, but these come with hidden costs in distributed systems: index updates must be synchronous, creating write bottlenecks. When Twitter writes a tweet, updating indexes on user_id, timestamp, hashtags, and mentions can triple write latency. Index tables give you control over this trade-off.
Full Table Scan vs Index Table Lookup
graph LR
subgraph Without Index Table
A1["Query: Find user<br/>by email"] --> B1["Scan users table<br/>2B records"]
B1 --> C1["Check each record<br/>email == target?"]
C1 --> D1["Found after<br/>1.5B comparisons<br/>⏱️ 30+ minutes"]
end
subgraph With Index Table
A2["Query: Find user<br/>by email"] --> B2["Lookup in<br/>users_by_email<br/>O(1) operation"]
B2 --> C2["Get user_id<br/>⏱️ 1-2ms"]
C2 --> D2["Fetch from users<br/>by user_id<br/>⏱️ 1-2ms"]
D2 --> E2["Total: 2-5ms"]
end
Without index tables, finding a user by email requires scanning billions of records (30+ minutes). With index tables, the same query becomes two fast O(1) lookups (2-5ms total).
Solution Overview
The Index Table pattern solves non-primary-key queries by maintaining separate tables that map search attributes to primary keys. Each index table is optimized for a specific query pattern, with the search attribute as its primary key.
When you write data, you write to both the main table and all relevant index tables. When you read by a non-primary attribute, you query the index table first to get the primary key, then fetch the full record from the main table. This two-hop lookup is far faster than scanning millions of records.
For Instagram’s user system, you’d maintain three index tables: users_by_username, users_by_email, and users_by_phone. Each maps its attribute to user_id. When a user logs in with email, you query users_by_email to get the user_id, then fetch the full profile from the main users table.
The pattern trades storage (you’re duplicating keys) and write complexity (multiple writes per operation) for read performance. It’s essentially manual denormalization—you’re pre-computing the answer to “which user_id has this email?” and storing it for instant retrieval. This is the same principle behind database indexes, but you control the implementation, consistency model, and update strategy.
How It Works
Step 1: Design Your Index Tables
Identify your query patterns and create an index table for each non-primary-key attribute you need to search by. Each index table has the search attribute as its primary key and stores the corresponding main table primary key.
For Instagram users: Main table users has primary key user_id. Create index tables: users_by_username(username → user_id), users_by_email(email → user_id), users_by_phone(phone → user_id). Each index table is a simple key-value mapping optimized for lookups on its specific attribute.
Step 2: Write to All Tables Atomically
When inserting or updating data, write to the main table and all affected index tables. This requires transaction coordination or eventual consistency handling.
Example: User registration writes to four tables: users, users_by_username, users_by_email, users_by_phone. If using DynamoDB, wrap these in a transaction. If using Cassandra, use batch statements with appropriate consistency levels. If eventual consistency is acceptable, write to main table first, then asynchronously update indexes.
For updates that change indexed attributes (user changes email), you must delete the old index entry and create a new one. This is where complexity grows: changing email requires three writes (main table update, old email index delete, new email index insert).
Step 3: Query via Index Table
For non-primary-key queries, first query the index table to get the primary key, then fetch the full record from the main table.
Login flow: Query users_by_email with email → get user_id → query users table with user_id → return full profile. This two-hop lookup typically takes 2-5ms total (1-2ms per query), compared to 30+ minutes for a full table scan.
Step 4: Handle Index Maintenance
Index tables can become stale if writes fail partially. Implement reconciliation strategies: periodic scans to verify index consistency, tombstone markers for deletes, or version numbers to detect drift. Netflix runs nightly jobs to rebuild indexes from the main table, catching any inconsistencies from failed writes.
Write Flow: Maintaining Main and Index Tables
sequenceDiagram
participant Client
participant API
participant MainTable as users table
participant Idx1 as users_by_username
participant Idx2 as users_by_email
participant Idx3 as users_by_phone
Client->>API: POST /register<br/>{username, email, phone}
Note over API: Start Transaction
API->>MainTable: 1. INSERT user_id=123<br/>{username, email, phone, ...}
MainTable-->>API: ✓ Success
API->>Idx1: 2. INSERT username='alice'<br/>→ user_id=123
Idx1-->>API: ✓ Success
API->>Idx2: 3. INSERT email='alice@ex.com'<br/>→ user_id=123
Idx2-->>API: ✓ Success
API->>Idx3: 4. INSERT phone='+1234567890'<br/>→ user_id=123
Idx3-->>API: ✓ Success
Note over API: Commit Transaction
API-->>Client: 201 Created<br/>{user_id: 123}
Note over API,Idx3: Write Amplification: 1 logical write → 4 physical writes<br/>Latency: 5ms → 15-20ms
User registration writes to four tables atomically: the main users table plus three index tables. This write amplification (1→4 writes) increases latency but enables fast lookups on username, email, and phone.
Read Flow: Two-Hop Lookup via Index Table
graph LR
A["Client Request<br/>Login with email"] --> B["1. Query users_by_email<br/>email='alice@ex.com'"]
B --> C["Index Table Returns<br/>user_id=123<br/>⏱️ 1-2ms"]
C --> D["2. Query users table<br/>user_id=123"]
D --> E["Main Table Returns<br/>Full User Profile<br/>⏱️ 1-2ms"]
E --> F["Response to Client<br/>Total: 2-5ms"]
subgraph Index Table
B
C
end
subgraph Main Table
D
E
end
Login queries use a two-hop lookup: first query the index table to get the primary key (user_id), then fetch the full record from the main table. Each hop takes 1-2ms for a total of 2-5ms.
Variants
Composite Index Tables
Store multiple attributes in the index table’s sort key for range queries. Instead of username → user_id, use (country, signup_date) → user_id to support queries like “all users in US who signed up in January 2024.”
When to use: When you need to filter by multiple attributes together. Pros: Enables complex queries without full scans. Cons: Index table grows larger, more complex write logic.
Partial Index Tables
Only index records matching certain criteria. For example, index only active users or recent orders, not historical data.
When to use: When most queries target a subset of data (80/20 rule). Pros: Smaller index tables, faster queries, lower storage costs. Cons: Must fall back to full scans for non-indexed data, adds conditional logic.
Denormalized Index Tables
Store frequently accessed attributes directly in the index table instead of just the primary key. users_by_email might store (email → user_id, username, profile_pic_url) to avoid the second hop.
When to use: When the second lookup is a bottleneck or you need to minimize latency. Pros: Single query instead of two, lower latency. Cons: More storage, more complex updates (must update index when any denormalized field changes).
Inverted Index Tables
For many-to-many relationships, store (attribute_value → list of primary_keys). Used for tags, categories, or full-text search.
When to use: When one attribute value maps to many records (hashtags, product categories). Pros: Efficient for “find all records with tag X” queries. Cons: Large lists require pagination, updates are more complex.
Index Table Variants Comparison
graph TB
subgraph Simple Index
S1["users_by_email<br/>email → user_id"]
S2["Storage: 100 bytes/record<br/>Query: Single attribute"]
end
subgraph Composite Index
C1["users_by_country_date<br/>(country, signup_date) → user_id"]
C2["Storage: 150 bytes/record<br/>Query: Range queries on multiple attributes"]
end
subgraph Denormalized Index
D1["users_by_email<br/>email → (user_id, username, profile_pic)"]
D2["Storage: 300 bytes/record<br/>Query: Single hop, no main table lookup"]
end
subgraph Inverted Index
I1["posts_by_hashtag<br/>hashtag → [post_id1, post_id2, ...]"]
I2["Storage: Variable (list size)<br/>Query: Many-to-many relationships"]
end
Four index table variants trade storage and complexity for different query capabilities: simple (basic lookups), composite (range queries), denormalized (single-hop reads), and inverted (many-to-many relationships).
Trade-offs
Storage vs Query Performance
Index tables duplicate data. For Instagram’s 2B users with 3 index tables, you’re storing 6B additional key mappings. At ~100 bytes per mapping, that’s 600GB of index data.
Decision criteria: If query latency matters more than storage costs, use index tables. Storage is cheap ($0.10/GB/month on S3), but slow queries lose users. If storage is constrained (embedded systems, edge devices), use selective indexing or accept slower queries.
Write Complexity vs Read Simplicity
Every write hits multiple tables. User registration writes to 4 tables; email change writes to 3. This increases write latency (5ms → 15ms) and failure surface area.
Decision criteria: For read-heavy workloads (100:1 read/write ratio), the trade-off favors index tables. For write-heavy systems (logging, time-series), consider append-only designs or batch index updates. Measure your read/write ratio: if reads > 10x writes, index tables win.
Consistency vs Availability
Synchronous index updates guarantee consistency but reduce availability (if index write fails, entire operation fails). Asynchronous updates improve availability but risk stale indexes.
Decision criteria: For critical operations (payments, authentication), use synchronous updates with transactions. For non-critical features (search, recommendations), eventual consistency is acceptable. Instagram uses synchronous updates for username changes (must be unique) but asynchronous for bio updates.
Maintenance Overhead vs Operational Simplicity
Index tables require monitoring, reconciliation jobs, and careful update logic. Native database indexes are maintained automatically.
Decision criteria: Use index tables when your database lacks secondary indexes (DynamoDB, Cassandra) or when you need custom consistency models. Use native indexes when available and sufficient for your needs.
When to Use (and When Not To)
Use Index Tables When:
- Your database lacks native secondary indexes (DynamoDB, Cassandra, HBase) and you need non-primary-key queries
- Read/write ratio exceeds 10:1 and query latency is critical (user-facing features, real-time dashboards)
- You need custom consistency models (eventual consistency for some indexes, strong for others)
- Query patterns are predictable and stable (you know which attributes to index)
- Storage costs are acceptable relative to query performance gains
Avoid Index Tables When:
- Your database has efficient native secondary indexes (PostgreSQL, MySQL) that meet your needs
- Write volume is extremely high and write latency is critical (logging systems, metrics ingestion)
- Query patterns are unpredictable or change frequently (ad-hoc analytics, exploratory queries)
- Data changes frequently and index maintenance overhead exceeds query performance gains
- You can achieve acceptable performance with caching or query optimization
Anti-Patterns:
Creating index tables for every attribute “just in case”—this multiplies write complexity without proven benefit. Start with primary key queries, add indexes only when measurements show query performance problems. Over-indexing is worse than under-indexing because it slows every write.
Using index tables for analytics queries that scan large datasets—these queries should use data warehouses or batch processing, not operational index tables. Index tables optimize for point lookups, not aggregations.
Real-World Examples
Instagram: User Lookup System
Instagram maintains index tables for username, email, and phone number lookups. When you log in, the app queries users_by_email to get your user_id, then fetches your profile from the main users table stored in Cassandra.
Interesting detail: Username changes require careful coordination. Instagram uses a two-phase commit: reserve the new username in users_by_username (preventing others from taking it), update the main table, then release the old username. This prevents username collisions during the transition period. They also maintain a 14-day grace period where old usernames redirect to new ones, requiring a temporary reverse index.
Uber: Driver Location Indexing
Uber’s dispatch system needs to find nearby drivers quickly. They maintain geospatial index tables that map (lat, lng, geohash) → driver_id. When a rider requests a ride, the system queries the geohash index to find drivers within the same geohash cell, then filters by precise distance.
Interesting detail: Driver locations update every 4 seconds, generating billions of index writes per day. Uber uses eventual consistency for location indexes—stale data (driver appears 50 meters away from actual position) is acceptable for 4 seconds. They batch index updates and use probabilistic data structures (Bloom filters) to reduce write amplification.
Netflix: Content Recommendation Indexes
Netflix maintains index tables mapping (genre, release_year, rating) → content_id to power browse pages and recommendations. These composite indexes enable queries like “show me action movies from 2023 with rating > 4.0.”
Interesting detail: Netflix rebuilds these indexes nightly from the source of truth (main content catalog) rather than maintaining them transactionally. This eventual consistency model (up to 24 hours stale) is acceptable because content metadata changes infrequently. Nightly rebuilds also catch any inconsistencies from failed writes, providing automatic reconciliation.
Instagram Username Change: Two-Phase Commit
sequenceDiagram
participant User
participant API
participant UsersTable as users table
participant UsernameIdx as users_by_username
participant TempIdx as username_redirects<br/>(14-day grace period)
User->>API: Change username<br/>old='alice' → new='alice_new'
Note over API: Phase 1: Reserve New Username
API->>UsernameIdx: 1. INSERT 'alice_new' → user_id=123<br/>(reserve, prevent collisions)
UsernameIdx-->>API: ✓ Reserved
Note over API: Phase 2: Update Main Table
API->>UsersTable: 2. UPDATE user_id=123<br/>SET username='alice_new'
UsersTable-->>API: ✓ Updated
Note over API: Phase 3: Setup Redirect
API->>TempIdx: 3. INSERT 'alice' → 'alice_new'<br/>(14-day redirect)
TempIdx-->>API: ✓ Redirect active
Note over API: Phase 4: Release Old Username
API->>UsernameIdx: 4. DELETE 'alice' → user_id=123<br/>(after 14 days)
UsernameIdx-->>API: ✓ Deleted
API-->>User: Username changed successfully
Note over TempIdx: Old username redirects to new<br/>for 14 days, then becomes available
Instagram’s username change uses a two-phase commit to prevent collisions: reserve the new username, update the main table, create a temporary redirect for 14 days, then release the old username. This ensures no other user can claim the new username during the transition.
Interview Essentials
Mid-Level
Explain the two-hop lookup process: query index table for primary key, then query main table. Discuss why this is faster than full table scans (O(1) vs O(n)). Describe basic write flow: write to main table and index tables. Calculate storage overhead: if main table has N records and you create M index tables, you store N + (N × M) total records. Understand when eventual consistency is acceptable vs when you need synchronous updates.
Senior
Design index maintenance strategies for different consistency requirements. Explain how to handle partial write failures (what if main table write succeeds but index write fails?). Discuss trade-offs between synchronous and asynchronous index updates with concrete latency numbers. Calculate write amplification: if you have 5 index tables, each write becomes 6 writes (1 main + 5 indexes). Propose reconciliation strategies: periodic scans, version numbers, or event sourcing. Compare index tables to materialized views (see Materialized View)—both denormalize data but with different consistency models.
Staff+
Architect index table systems for multi-region deployments with different consistency requirements per region. Design index evolution strategies: how do you add new indexes to a live system with billions of records? Propose solutions for index table hotspots (popular usernames, trending hashtags). Calculate cost-benefit analysis: storage costs vs query performance gains vs operational complexity. Design hybrid approaches: use index tables for hot data, fall back to scans for cold data. Discuss how index tables interact with caching layers—should you cache index lookups or main table lookups?
Common Interview Questions
How do you handle index updates when the indexed attribute changes? (Delete old entry, insert new entry, handle race conditions)
What happens if index table write fails but main table write succeeds? (Reconciliation strategies, idempotent updates, event sourcing)
How do you decide which attributes to index? (Query patterns, read/write ratio, cardinality analysis)
Can you use index tables with eventual consistency? (Yes, for non-critical features; discuss staleness tolerance)
How do index tables differ from database secondary indexes? (Manual control, custom consistency, works with NoSQL)
Red Flags to Avoid
Saying ‘just add an index’ without discussing write amplification or storage costs
Not considering partial failure scenarios (what if one index write fails?)
Proposing synchronous updates for all indexes without discussing latency impact
Creating indexes for every attribute without analyzing query patterns
Not mentioning reconciliation or consistency verification strategies
Key Takeaways
Index tables trade storage and write complexity for fast non-primary-key queries by maintaining separate lookup tables that map search attributes to primary keys
Every write hits multiple tables (main + all indexes), increasing write latency and failure surface area. For read-heavy workloads (>10:1 ratio), this trade-off is worthwhile
Two-hop lookup (index table → primary key → main table) is far faster than full table scans: 2-5ms vs minutes for large datasets
Choose synchronous updates for critical operations requiring consistency (authentication, payments) and asynchronous updates for non-critical features (search, recommendations)
Index tables are essential for NoSQL databases lacking native secondary indexes (DynamoDB, Cassandra) but may be unnecessary if your database has efficient built-in indexing