Sunday, December 10, 2023

PySpark Spark SQL

 from pyspark.sql import SparkSession


# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = [
"Name", "Age"]
df = spark.createDataFrame(data, columns)
#df.show()
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+
# Register the DataFrame as a temporary SQL table
df.createOrReplaceTempView("people")

# Use Spark SQL to query the table
result = spark.sql("SELECT * FROM people WHERE Age >= 30")

# Show the result
result.show()


|   Name|Age|
+-------+---+
|    Bob| 30|
|Charlie| 35|
+-------+---+


In this example:

  1. We create a Spark session.
  2. Create a DataFrame from a list of tuples.
  3. Register the DataFrame as a temporary SQL table named "people".
  4. Use spark.sql() to execute a SQL query on the "people" table.
  5. Show the result.

This is a simple example, but spark.sql allows you to perform complex SQL queries on large-scale distributed data. Make sure to adjust the configuration and settings based on your Spark cluster setup.

No comments:

Post a Comment