Saturday, January 4, 2025

Where do we use pandas in Data Engg

 

Pandas is extensively used in Data Engineering for various tasks related to data manipulation, cleaning, transformation, and analysis. While Pandas is more commonly associated with Data Analysis, its capabilities make it an essential tool for small-to-medium-scale Data Engineering tasks or prototyping.

Here are specific areas where Pandas is used in Data Engineering:

 

Data Extraction (ETL Process)

  • Use Case: Extracting data from various sources such as CSV files, Excel spreadsheets, databases, APIs, and more.
  •  
  • import pandas as pd
    # Load data from CSV
    df = pd.read_csv("data.csv")

    # Load data from a SQL database
    import sqlite3
    conn = sqlite3.connect('database.db')
    df = pd.read_sql("SELECT * FROM table_name", conn)
     

 2. Data Cleaning

Handling missing values, duplicates, and inconsistent data types.

# Drop rows with missing values
df.dropna(inplace=True)

# Fill missing values with default
df.fillna(value={"column": 0}, inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)
 

3. Data Transformation

Aggregation, filtering, sorting, and feature engineering.


# Filter rows based on a condition
df_filtered = df[df["age"] > 30]

# Add a new column
df["salary_with_bonus"] = df["salary"] * 1.1

# Grouping and aggregation
df_grouped = df.groupby("department").agg({"salary": "mean"})
 

4. Data Validation and Profiling

 

Ensuring data consistency and profiling the dataset.

# Check for data types
print(df.dtypes)

# Get summary statistics
print(df.describe())

# Validate uniqueness of a column
assert df["id"].is_unique

5. Small-scale Data Pipelines

 # Extract data
df = pd.read_csv("input.csv")

# Transform data
df["total_cost"] = df["quantity"] * df["price"]
df = df[df["total_cost"] > 100]

# Load data to a new file
df.to_csv("output.csv", index=False)


6. Data Integration

 Merging and joining datasets from different sources.

 # Merge two datasets
df_merged = pd.merge(df1, df2, on="id", how="inner")


. Prototyping and Testing

  • Rapidly prototyping ETL logic before scaling it to frameworks like PySpark, Dask, or Apache Beam.

 

# Prototype logic using Pandas
df_transformed = df.groupby("category").sum()

# Scale up the logic in PySpark later
 

 # Prototype logic using Pandas
df_transformed = df.groupby("category").sum()

# Scale up the logic in PySpark later


When to Use Pandas in Data Engineering:

  1. Small to Medium Datasets: When data fits in memory (up to a few GBs).
  2. Prototyping: Quick tests before scaling with distributed systems like PySpark or Dask.
  3. Ad-hoc Analysis: One-off or exploratory analysis.
  4. Preprocessing: Cleaning and transforming data before loading it into a database or data warehouse.

For larger datasets, distributed computing frameworks like PySpark, Dask, or Apache Beam are often more appropriate.

 

 

 



No comments:

Post a Comment