Sunday, January 21, 2024

write a query to find the employee count under each manager

 from pyspark.sql import SparkSession

from pyspark.sql.functions import *
spark = SparkSession.builder.appName('Joins_On_Dataframe').master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

data = [('4529', 'Nancy', 'Young', '4125'),
('4238','John', 'Simon', '4329'),
('4329', 'Martina', 'Candreva', '4125'),
('4009', 'Klaus', 'Koch', '4329'),
('4125', 'Mafalda', 'Ranieri', 'NULL'),
('4500', 'Jakub', 'Hrabal', '4529'),
('4118', 'Moira', 'Areas', '4952'),
('4012', 'Jon', 'Nilssen', '4952'),
('4952', 'Sandra', 'Rajkovic', '4529'),
('4444', 'Seamus', 'Quinn', '4329')]
schema = ['employee_id' ,'first_name', 'last_name', 'manager_id']


df = spark.createDataFrame(data=data, schema=schema)

df.createOrReplaceTempView('emp_mgr')
df.show()

query = '''
select e.manager_id as manager_id, count(e.employee_id) as no_of_emppp, m.First_name as manager_name
from emp_mgr e
inner join emp_mgr m
on m.employee_id =e.manager_id
group by e.manager_id ,m.First_name '''

result=spark.sql(query).show()

# Self-join the DataFrame to get manager names
result_df = df.alias("e").join(df.alias("m"), col("e.manager_id") == col("m.employee_id"), "inner") \
.select(col("e.employee_id"), col("e.first_name"), col("e.last_name"),
col("e.manager_id"),col("m.first_name").alias("manager_first_name"))

result_df.show()

result_df.groupBy("manager_id", "manager_first_name").count().show()

Tuesday, January 9, 2024

Hive Partitioning and Bucketing:

 

Hive Partitioning and Bucketing:

hive > create table non_partition_table (bankid string, bankrepname string, banktover string, country string, year int) row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile; load data local inpath 'SBA*' into table non_partition_table; hive > select * from non_partition_table where bankid like '%TZ%'; //It will go to each of 7 files in HDFS and searh for TZ. //Its a Sequential reading process . STATIC Partitioning: create table static_partition_table (bankid string, bankrepname string, banktover string) PARTITIONED BY (country string , year int) row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile; hive> load data local inpath 'SBA_NM_Data.log' into table static_partition_table ; /*failed as its a partitioned table*/ FAILED: SemanticException [Error 10001]: Line 1:52 Table not found 'parttab' hive> use batch122; hive> set hive.cli.print.current.db=true; hive > set hive.exec.dynamic.partition.mode=nonstrict ; hive> load data local inpath 'SBA_NM_Data.log' into table static_partition_table PARTITION(country='NIMEBIA',year=2011); hive > load data local inpath 'SBA_UG_Data.log' into table static_partition_table PARTITION(country='UGANDA',year=2011); hive >load data local inpath 'SBA_NG_Data.log' into table static_partition_table PARTITION(country='NIGERIA',year=2013); STATIC partition drawbacks: 1.Less data we can load , if thousands file will be there , it will be diffiult. 2. For NIGERIA One of the data is 2014 , we have by mistakely load to 2013. 3. For TANZANIA 2017 and 2012 , we have to load separately. Dynamic Partitioning: ******************** In DYNAMIC PARTITIONING we will create a non_partition_table/parent table , then one child table. From parent table , we will insert records to child table. INSERT OVERWRITE TABLE <<childtable>> PARTITION(country,year) SELECT * FROM <<parenttable>>; hive > create table non_partition_table (bankid string, bankrepname string, banktover string, country string, year int) row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile; hive > load data local inpath 'SBA*' into table non_partition_table; hive > set hive.exec.dynamic.partition.mode=nonstrict ; hive > create table dynamic_partition_table (bankid string, bankrepname string, banktover string) PARTITIONED BY (country string, year int) row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile; hive > INSERT OVERWRITE TABLE dynamic_partition_table PARTITION (country,year) select * from non_partition_table;

Friday, January 5, 2024

Amazon Redshift Spectrum:

 

Amazon Redshift Spectrum:

Amazon Redshift Spectrum is a feature within Amazon Web Services' RedShift data warehousing service that lets a data analyst conduct fast, complex analysis on objects stored on the AWS cloud.

With Redshift Spectrum, an analyst can perform SQL queries on data stored in Amazon S3 buckets. This can save time and money because it eliminates the need to move data from a storage service to a database, and instead directly queries data inside an S3 bucket. Redshift Spectrum also expands the scope of a given query because it extends beyond a user's existing Redshift data warehouse nodes and into large volumes of unstructured S3 data lakes.

How Redshift Spectrum works

Redshift Spectrum breaks a user query into filtered subsets that are run concurrently. Those requests are spread across thousands of AWS-managed nodes to maintain query speed and consistent performance. Redshift Spectrum can scale to run a query across more than an exabyte of data, and once the S3 data is aggregated, it's sent back to the local Redshift cluster for final processing.

Redshift Spectrum must have a Redshift cluster and a connected SQL client. Multiple clusters can access the same S3 data set at the same time, but queries can only be conducted on data stored in the same AWS region.

Redshift Spectrum can be used in conjunction with any other AWS compute service with direct S3 access, including Amazon Athena, as well as Amazon Elastic Map Reduce for Apache Spark, Apache Hive and Presto.

Redshift Spectrum vs. Athena

Amazon Athena is similar to Redshift Spectrum, though the two services typically address different needs. An analyst that already works with Redshift will benefit most from Redshift Spectrum because it can quickly access data in the cluster and extend out to infrequently accessed, external tables in S3. It's also better suited for fast, complex queries on multiple data sets.

Alternatively, Athena is a simpler way to run interactive, ad hoc queries on data stored in S3. It doesn't require any cluster management, and an analyst only needs to define a table to make a standard SQL query.

Other cloud vendors also offer similar services, such as Google BigQuery and Microsoft Azure SQL Data Warehouse.

Pricing

Amazon Redshift Spectrum follows a per-use billing model, at $5 per terabyte of data pulled from S3, with a 10 MB minimum query. AWS recommends that a customer compresses its data or stores it in column-oriented form to save money. Those costs do not include Redshift cluster and S3 storage fees.

Thursday, January 4, 2024

Python Programs for interview FAQ

 #Q1: Reverse a String in Python

def reverse_string(input_str): reversed_str = '' for character in input_str: reversed_str = character + reversed_str return reversed_str # Example original_string = "Satyam" reversed_string = reverse_string(original_string) print("Original String:", original_string) print("Reversed String:", reversed_string) #Q2: Checking a word is palindrome or not in python def is_palindrome(input_str): # Remove spaces and convert to lowercase for case-insensitive comparison cleaned_str = ''.join(input_str.split()).lower() # Check if the cleaned string is the same when reversed return cleaned_str == cleaned_str[::-1] # Example word = input("Enter a word: ") if is_palindrome(word): print(f"{word} is a palindrome.") else: print(f"{word} is not a palindrome.") #Q3: Write a program for fibonacci series def fibonacci(n): fib_series = [0, 1] while len(fib_series) < n: fib_series.append(fib_series[-1] + fib_series[-2]) return fib_series # Example: Generate the first 10 numbers in the Fibonacci series k = 10 fibonacci_series = fibonacci(k) print(f"Fibonacci Series (first {k} numbers):", fibonacci_series) Fibonacci Series (first 10 numbers): [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]