Orc table creation from spark sql with snappy compression

12/7/2023

Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive and MapReduce on EC2.Involved in the Complete Software development life cycle (SDLC) to develop the application.Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.Įnvironment: HDFS, Yarn, MapReduce, Hive, Sqoop, Flume, Oozie, HBase, Kafka, Impala, Spark SQL, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL UNIX Shell Scripting, Cloudera.Used Jira for bug tracking and BitBucket to check-in and checkout code changes.Orchestrated number of Sqoop and Hive scripts using Oozie workflow and scheduled using Oozie coordinator.Developed shell scripts for running Hive scripts in Hive and Impala.Integrated Hive and Tableau Desktop reports and published to Tableau Server.Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.Developed Kafka consumer's API in Scala for consuming data from Kafka topics.Developed Spark core and Spark SQL scripts using Scala for faster data processing.Developed Spark scripts to import large files from Amazon S3 buckets.Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.Developed Flume ETL job for handling data from HTTP Source and Sink as HDFS.Involved in performance tuning of Hive from design, storage and query perspectives.Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.Developed Sqoop jobs to import data in Avro file format from Oracle database and created hive tables on top of it.Developed Spark API to import data into HDFS from Teradata and created Hive tables.Involved in complete BigData flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.Experience in different compression techniques like Gzip, LZO, Snappy and Bzip2.Experienced in working with different file formats - Avro, Parquet, RC and ORC.Experience with Azure Components like Azure SQl Database and Data Factory.Experience with AWS components like Amazon Ec2 instances, S3 buckets and Cloud Formation templates and Boto library.Experience in scheduling the jobs using Oozie Coordinator, Bundler and Crontab.Rich experience in automating Sqoop and Hive queries using Oozie workflow.Experience in creating DStreams from sources like Flume, Kafka and performed different Spark transformations and actions on it.

Experienced in Developing Spark application using Spark Core, Spark SQL and Spark Streaming API's.Experience in designing tables and views for reporting using Impala.Experience in developing Hive UDF's and running hive scripts using different execution engines like Tez and Spark (Hive on Spark ).

Experience in designing table partitioning, bucketing and optimized hive scripts using different performance utilities and techniques.
Experienced in analyzing the data using PIG Latin scripts.
Developed MapReduce programs in Java for data cleansing, data filtering, and data aggregation.Experience in developing Kafka Consumer API using Spark Scala applications.Experience in ingesting data from FTP/SFTP servers using Flume.Experience in importing and exporting data from different RDBMS Servers like MySQL, Oracle and Teradata into HDFS and Hive using Sqoop.In depth understanding of Hadoop Architecture including YARN and various components such as HDFS, Resource Manager, Node Manager, Name Node, Data Node and MR v1 & v2 concepts.Worked extensively with Hadoop Distributions like Cloudera, Hortonworks.Over 8 years of extensive IT experience in all phases of Software Development Life Cycle (SDLC), including 3+ years of strong experience working on Apache Hadoop ecosystem and Apache Spark.

0 Comments

Orc table creation from spark sql with snappy compression

Leave a Reply.

Author

Archives

Categories