Basics – Swan Studies

Definition: Data ingestion involves collecting data from various sources (databases, APIs, files, real-time streams) and loading it into a system for further processing or storage.
Concepts:
- Batch Ingestion: Data is collected in large chunks and ingested at scheduled intervals.
- Streaming Ingestion: Data is ingested in real-time as it arrives, continuously processing events.

Definition: Data storage involves storing data in a structured or unstructured format depending on the use case.
Concepts:
- Relational Databases: Store data in structured, tabular formats using SQL (e.g., MySQL, PostgreSQL).
- NoSQL Databases: Handle unstructured or semi-structured data (e.g., MongoDB, Cassandra).
- Data Lakes: Store raw data in its original format (e.g., Amazon S3, Azure Data Lake).
- Data Warehouses: Structured storage for analytics and reporting (e.g., Snowflake, BigQuery).

Definition: Data pipelines are automated workflows that transport data from source to destination, often with multiple transformation stages in between.
Concepts:
- ETL (Extract, Transform, Load): Data is extracted from a source, transformed into a desired format, and loaded into a target system.
- ELT (Extract, Load, Transform): Data is extracted and loaded as-is, and transformations are performed afterward, often within the storage system.
- Data Orchestration: Scheduling and coordinating the execution of data pipelines and workflows (e.g., Apache Airflow).

Definition: Transforming raw data into formats suitable for analysis, machine learning, or storage.
Concepts:
- Data Cleansing: Removing or correcting erroneous data.
- Aggregation: Summarizing data (e.g., calculating averages, totals).
- Normalization and Denormalization: Structuring data for optimized storage and querying.

Definition: Processing data to generate insights or transform it into a usable format.
Concepts:
- Batch Processing: Processing data in bulk at scheduled intervals (e.g., Apache Spark, Hadoop).
- Stream Processing: Real-time data processing for time-sensitive applications (e.g., Apache Flink, Kafka Streams).
- Distributed Processing: Using multiple machines to process large-scale data simultaneously, increasing speed and efficiency.

Definition: Ensuring that data is accurate, consistent, complete, and reliable for analysis.
Concepts:
- Data Validation: Checking data for correctness and consistency.
- Data Profiling: Analyzing data to understand its structure, quality, and content.
- Error Handling: Managing and resolving data issues during ingestion and processing.

Definition: The practice of managing data assets with policies, processes, and standards that ensure data integrity, security, and compliance.
Concepts:
- Data Cataloging: Organizing and documenting data assets.
- Data Lineage: Tracking the flow and transformation of data through systems.
- Access Control: Defining who can access and manipulate data (e.g., role-based access).

Definition: Protecting data from unauthorized access and ensuring compliance with regulations.
Concepts:
- Encryption: Securing data in transit and at rest.
- Data Masking: Hiding sensitive data elements.
- Compliance: Adhering to data privacy regulations like GDPR, HIPAA.

Definition: Ensuring that data systems can handle growing data volumes and perform efficiently under heavy loads.
Concepts:
- Horizontal Scaling: Adding more machines to handle larger workloads.
- Vertical Scaling: Increasing the power of a single machine (CPU, RAM, etc.).
- Partitioning and Sharding: Distributing data across multiple storage units for efficient querying.

Definition: Regularly monitoring the health and performance of data pipelines and systems, ensuring they run smoothly.
Concepts:
- Monitoring Tools: Tools like Prometheus, Grafana for tracking pipeline health and system performance.
- Alerting: Notifying engineers of pipeline failures, data quality issues, or performance bottlenecks.

Definition: Using cloud platforms to build and manage data infrastructure, offering scalability and flexibility.
Concepts:
- Cloud Storage: (e.g., AWS S3, Google Cloud Storage).
- Managed Data Warehouses: (e.g., BigQuery, Redshift, Snowflake).
- Serverless Computing: (e.g., AWS Lambda, Google Cloud Functions).

Definition: Designing the structure of data to represent relationships and enable efficient querying and analytics.
Concepts:
- Star Schema: A type of data model used in data warehouses.
- Entity-Relationship Model: Mapping entities and their relationships in databases.