Data Analytics Infrastructure Requirements

Progress: 0 of 0 completed

Business Context & Use Case

Primary analytics use case? (BI reporting, ML/AI, real-time analytics, data warehousing)
Business stakeholders and their requirements?
Compliance requirements? (HIPAA, GDPR, SOX, industry-specific)
Data sensitivity level? (public, internal, confidential, restricted)
Project timeline and milestones?

Data Sources & Ingestion

What data sources? (databases, APIs, files, streaming, IoT devices)
Data formats? (JSON, CSV, Parquet, Avro, XML, logs)
Data volume? (GB/TB per day, total expected volume)
Data velocity? (batch, real-time, near real-time)
Data ingestion frequency? (hourly, daily, weekly, streaming)
Historical data migration needs?

Data Processing Requirements

Processing type? (ETL, ELT, stream processing, batch processing)
Data transformation complexity? (simple aggregations vs complex joins)
Processing schedule? (real-time, scheduled batches, ad-hoc)
Data quality requirements? (validation, cleansing, deduplication)
Processing framework preference? (Spark, Glue, EMR, Lambda, Kinesis)

Storage & Data Lake Architecture

Data lake requirements? (raw, processed, curated zones)
Storage tiers needed? (hot, warm, cold, archive)
Data retention policies?
Partitioning strategy? (by date, region, source, etc.)
Data catalog requirements? (Glue Catalog, metadata management)
Data lineage tracking needs?

Analytics & Querying

Query patterns? (ad-hoc, scheduled reports, dashboards, ML training)
Query complexity? (simple aggregations vs complex analytics)
Query performance requirements? (sub-second, minutes, hours)
Concurrent user requirements?
Preferred query engine? (Athena, Redshift, EMR, QuickSight)

Machine Learning Requirements

ML use cases? (predictive analytics, recommendations, classification, etc.)
Model training frequency?
Model deployment requirements? (batch inference, real-time inference)
ML framework preferences? (SageMaker, custom models)
Feature store requirements?

Data Visualization & BI

Dashboard requirements? (executive, operational, self-service)
User types and access patterns?
Visualization tool preference? (QuickSight, Tableau, Power BI, custom)
Report scheduling and distribution needs?
Mobile access requirements?

Security & Governance

Data access controls? (role-based, attribute-based)
Data masking or anonymization needs?
Audit trail requirements?
Encryption requirements? (at rest, in transit, key management)
Data sharing requirements? (internal, external, partners)

Performance & Scalability

Expected data growth rate?
Peak usage patterns?
Query response time SLAs?
Scalability requirements? (auto-scaling preferences)
Disaster recovery requirements? (RTO/RPO)

Budget & Cost Optimization

Monthly budget range?
Cost allocation requirements? (by department, project, etc.)
Data lifecycle cost optimization? (storage tiering, archival)
Reserved capacity vs on-demand preferences?