Sunday, January 21, 2024

write a query to find the employee count under each manager

 from pyspark.sql import SparkSession

from pyspark.sql.functions import *
spark = SparkSession.builder.appName('Joins_On_Dataframe').master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

data = [('4529', 'Nancy', 'Young', '4125'),
('4238','John', 'Simon', '4329'),
('4329', 'Martina', 'Candreva', '4125'),
('4009', 'Klaus', 'Koch', '4329'),
('4125', 'Mafalda', 'Ranieri', 'NULL'),
('4500', 'Jakub', 'Hrabal', '4529'),
('4118', 'Moira', 'Areas', '4952'),
('4012', 'Jon', 'Nilssen', '4952'),
('4952', 'Sandra', 'Rajkovic', '4529'),
('4444', 'Seamus', 'Quinn', '4329')]
schema = ['employee_id' ,'first_name', 'last_name', 'manager_id']


df = spark.createDataFrame(data=data, schema=schema)

df.createOrReplaceTempView('emp_mgr')
df.show()

query = '''
select e.manager_id as manager_id, count(e.employee_id) as no_of_emppp, m.First_name as manager_name
from emp_mgr e
inner join emp_mgr m
on m.employee_id =e.manager_id
group by e.manager_id ,m.First_name '''

result=spark.sql(query).show()

# Self-join the DataFrame to get manager names
result_df = df.alias("e").join(df.alias("m"), col("e.manager_id") == col("m.employee_id"), "inner") \
.select(col("e.employee_id"), col("e.first_name"), col("e.last_name"),
col("e.manager_id"),col("m.first_name").alias("manager_first_name"))

result_df.show()

result_df.groupBy("manager_id", "manager_first_name").count().show()

Tuesday, January 9, 2024

Hive Partitioning and Bucketing:

 

Hive Partitioning and Bucketing:

hive > create table non_partition_table (bankid string, bankrepname string, banktover string, country string, year int) row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile; load data local inpath 'SBA*' into table non_partition_table; hive > select * from non_partition_table where bankid like '%TZ%'; //It will go to each of 7 files in HDFS and searh for TZ. //Its a Sequential reading process . STATIC Partitioning: create table static_partition_table (bankid string, bankrepname string, banktover string) PARTITIONED BY (country string , year int) row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile; hive> load data local inpath 'SBA_NM_Data.log' into table static_partition_table ; /*failed as its a partitioned table*/ FAILED: SemanticException [Error 10001]: Line 1:52 Table not found 'parttab' hive> use batch122; hive> set hive.cli.print.current.db=true; hive > set hive.exec.dynamic.partition.mode=nonstrict ; hive> load data local inpath 'SBA_NM_Data.log' into table static_partition_table PARTITION(country='NIMEBIA',year=2011); hive > load data local inpath 'SBA_UG_Data.log' into table static_partition_table PARTITION(country='UGANDA',year=2011); hive >load data local inpath 'SBA_NG_Data.log' into table static_partition_table PARTITION(country='NIGERIA',year=2013); STATIC partition drawbacks: 1.Less data we can load , if thousands file will be there , it will be diffiult. 2. For NIGERIA One of the data is 2014 , we have by mistakely load to 2013. 3. For TANZANIA 2017 and 2012 , we have to load separately. Dynamic Partitioning: ******************** In DYNAMIC PARTITIONING we will create a non_partition_table/parent table , then one child table. From parent table , we will insert records to child table. INSERT OVERWRITE TABLE <<childtable>> PARTITION(country,year) SELECT * FROM <<parenttable>>; hive > create table non_partition_table (bankid string, bankrepname string, banktover string, country string, year int) row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile; hive > load data local inpath 'SBA*' into table non_partition_table; hive > set hive.exec.dynamic.partition.mode=nonstrict ; hive > create table dynamic_partition_table (bankid string, bankrepname string, banktover string) PARTITIONED BY (country string, year int) row format delimited fields terminated by '|' lines terminated by '\n' stored as textfile; hive > INSERT OVERWRITE TABLE dynamic_partition_table PARTITION (country,year) select * from non_partition_table;

Friday, January 5, 2024

Amazon Redshift Spectrum:

 

Amazon Redshift Spectrum:

Amazon Redshift Spectrum is a feature within Amazon Web Services' RedShift data warehousing service that lets a data analyst conduct fast, complex analysis on objects stored on the AWS cloud.

With Redshift Spectrum, an analyst can perform SQL queries on data stored in Amazon S3 buckets. This can save time and money because it eliminates the need to move data from a storage service to a database, and instead directly queries data inside an S3 bucket. Redshift Spectrum also expands the scope of a given query because it extends beyond a user's existing Redshift data warehouse nodes and into large volumes of unstructured S3 data lakes.

How Redshift Spectrum works

Redshift Spectrum breaks a user query into filtered subsets that are run concurrently. Those requests are spread across thousands of AWS-managed nodes to maintain query speed and consistent performance. Redshift Spectrum can scale to run a query across more than an exabyte of data, and once the S3 data is aggregated, it's sent back to the local Redshift cluster for final processing.

Redshift Spectrum must have a Redshift cluster and a connected SQL client. Multiple clusters can access the same S3 data set at the same time, but queries can only be conducted on data stored in the same AWS region.

Redshift Spectrum can be used in conjunction with any other AWS compute service with direct S3 access, including Amazon Athena, as well as Amazon Elastic Map Reduce for Apache Spark, Apache Hive and Presto.

Redshift Spectrum vs. Athena

Amazon Athena is similar to Redshift Spectrum, though the two services typically address different needs. An analyst that already works with Redshift will benefit most from Redshift Spectrum because it can quickly access data in the cluster and extend out to infrequently accessed, external tables in S3. It's also better suited for fast, complex queries on multiple data sets.

Alternatively, Athena is a simpler way to run interactive, ad hoc queries on data stored in S3. It doesn't require any cluster management, and an analyst only needs to define a table to make a standard SQL query.

Other cloud vendors also offer similar services, such as Google BigQuery and Microsoft Azure SQL Data Warehouse.

Pricing

Amazon Redshift Spectrum follows a per-use billing model, at $5 per terabyte of data pulled from S3, with a 10 MB minimum query. AWS recommends that a customer compresses its data or stores it in column-oriented form to save money. Those costs do not include Redshift cluster and S3 storage fees.

Thursday, January 4, 2024

Python Programs for interview FAQ

 #Q1: Reverse a String in Python

def reverse_string(input_str): reversed_str = '' for character in input_str: reversed_str = character + reversed_str return reversed_str # Example original_string = "Satyam" reversed_string = reverse_string(original_string) print("Original String:", original_string) print("Reversed String:", reversed_string) #Q2: Checking a word is palindrome or not in python def is_palindrome(input_str): # Remove spaces and convert to lowercase for case-insensitive comparison cleaned_str = ''.join(input_str.split()).lower() # Check if the cleaned string is the same when reversed return cleaned_str == cleaned_str[::-1] # Example word = input("Enter a word: ") if is_palindrome(word): print(f"{word} is a palindrome.") else: print(f"{word} is not a palindrome.") #Q3: Write a program for fibonacci series def fibonacci(n): fib_series = [0, 1] while len(fib_series) < n: fib_series.append(fib_series[-1] + fib_series[-2]) return fib_series # Example: Generate the first 10 numbers in the Fibonacci series k = 10 fibonacci_series = fibonacci(k) print(f"Fibonacci Series (first {k} numbers):", fibonacci_series) Fibonacci Series (first 10 numbers): [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

Wednesday, December 20, 2023

Remove Duplicate Items from a Python List

#Remove duplicate item from Python List old_list = ['k','l','k','m','m','n','o'] new_list =[] for element in old_list: if element not in new_list: new_list.append(element) #print(new_list) #['k', 'l', 'm', 'n', 'o'] #2nd Approach: old_list = ['k','l','k','m','m','n','o'] #new_list = dict.fromkeys(old_list) #{'k': None, 'l': None, 'm': None, 'n': None, 'o': None} new_list = dict.fromkeys(old_list).keys() #dict_keys(['k', 'l', 'm', 'n', 'o']) new_list = list( dict.fromkeys(old_list).keys() ) print(new_list) #['k', 'l', 'm', 'n', 'o']


Functional Programming and OOP Comparison Table

 

Basis For ComparisonFunctional ProgrammingOOP
DefinitionFunctional programming emphasizes an evaluation of functions.Object-oriented programming is based on the concept of objects.
DataFunctional programming uses immutable data.Object-oriented uses mutable data.
ModelFunctional programming does follow a declarative programming model.Object-oriented programming does follow an imperative programming model.
SupportFunctional Programming supports parallel programming.Object-oriented programming does not support parallel programming.
ExecutionIn Functional programming, the statements can be executed in any order.In OOPs, the statements should be executed in a particular order.
IterationIn Functional programming, recursion is used for iterative data.People use loops for iterative data in OOP.
ElementThe basic elements of functional programming are Variables and Functions.The essential elements of object-oriented programming are objects and methods.
UsePeople use functional programming only when they have few things with more operations.People use object-oriented programming when they have many things with few operations.

Sunday, December 17, 2023

Dense_Rank with Group function

 Source_Path: "E:\source_files\CUSTOMER_CGI.CSV"

sdate,custid,pname,region,pcategory,price 01-01-2019,111,pen,NY,stationary,10.00 01-01-2019,111,pencil,NY,stationary,40.00 01-01-2019,222,pen,NY,stationary,10.00 01-01-2019,333,pen,NY,stationary,10.00 01-01-2019,444,book,CT,stationary,55.00 01-01-2019,222,pen,NY,stationary,10.00 01-02-2019,333,pen,NY,stationary,10.00 01-02-2019,333,pen,NY,stationary,10.00 01-02-2019,111,slate,NY,stationary,40.00 01-02-2019,111,slate,NY,stationary,40.00 01-02-2019,555,bag,NY,stationary,100.00 01-03-2019,444,pen,CT,stationary,10.00 01-03-2019,444,pencil,CT,stationary,15.00 01-03-2019,444,pencil,CT,stationary,15.00 01-03-2019,444,pen,CT,stationary,10.00 01-03-2019,111,pen,NY,stationary,10.00 01-03-2019,111,pen,NY,stationary,10.00 01-04-2019,111,pen,NY,stationary,10.00 01-04-2019,222,pen(,NY,stationary,10.00 01-04-2019,333,pen,NY,stationary,10.00 01-04-2019,444,choco,CT,stationary,50.00 01-04-2019,444,choco,CT,stationary,50.00 SET LINES 100 ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MM-YYYY'; select * from CUSTOMER_CGI ORDER BY SDATE ; SDATE CUSTID PNAME REGION PCATEGORY PRICE ---------- ---------- ---------- ---------- ---------- ---------- 01-01-2019 333 pen NY stationary 10 01-01-2019 222 pen NY stationary 10 01-01-2019 111 pencil NY stationary 40 01-01-2019 111 pen NY stationary 10 01-01-2019 444 book CT stationary 55 01-01-2019 222 pen NY stationary 10 01-02-2019 111 slate NY stationary 40 01-02-2019 111 slate NY stationary 40 01-02-2019 333 pen NY stationary 10 01-02-2019 333 pen NY stationary 10 01-02-2019 555 bag NY stationary 100 01-03-2019 444 pen CT stationary 10 01-03-2019 444 pencil CT stationary 15 01-03-2019 444 pencil CT stationary 15 01-03-2019 444 pen CT stationary 10 01-03-2019 111 pen NY stationary 10 01-03-2019 111 pen NY stationary 10 01-04-2019 111 pen NY stationary 10 01-04-2019 222 pen NY stationary 10 01-04-2019 333 pen NY stationary 10 01-04-2019 444 choco CT stationary 50 01-04-2019 444 choco CT stationary 50 By SQL Method : Q1: --Get the custer detail who has done highest sales per day (datewise highest sales) SELECT * FROM ( SELECT custid, sdate, SUM (price) AS total_price, DENSE_RANK () OVER (PARTITION BY sdate ORDER BY SUM (price) desc) AS rnk --desc if u want maximum , asc if u want minimum FROM CUSTOMER_CGI GROUP BY sdate, custid ) tt WHERE tt.rnk = 1 ; Q2: --Get the custer detail along with date where the total sales price is/are the highest . SELECT * FROM ( SELECT custid, sdate, SUM (price) AS total_price, DENSE_RANK () OVER (ORDER BY SUM (price) desc) AS rnk --desc if u want maximum , asc if u want minimum FROM CUSTOMER_CGI GROUP BY sdate, custid ) tt WHERE tt.rnk = 1 By Dataframe Method: