to join our dynamic team for a long-term contract. The ideal candidate should have a solid understanding of distributed computing, data pipeline design, and large-scale data processing.
Key Responsibilities:
Develop and maintain scalable and reliable big data pipelines using
Hadoop
,
Python
, and
PySpark
Optimize and troubleshoot Spark jobs for performance and efficiency
Collaborate with data scientists, analysts, and other stakeholders to understand data requirements
Integrate data from various sources, ensuring consistency and quality
Implement data governance, security, and privacy best practices
Perform unit testing and validation of data pipelines and processes
Participate in design reviews and contribute to architecture decisions
Technical Skills Required:
Strong experience
in
Hadoop ecosystem
(HDFS, Hive, YARN, etc.)
Proficient in Python
programming for data manipulation and scripting
Hands-on experience
with
PySpark
for distributed data processing
Experience with data ingestion tools and frameworks (Kafka, Sqoop, Flume, etc.)
Good understanding of performance tuning and optimization techniques in Spark
Familiarity with version control tools like Git and CI/CD pipelines
Preferred Qualifications:
Experience with cloud platforms (AWS, Azure, or GCP) for data processing
Knowledge of SQL and NoSQL databases
Familiarity with workflow orchestration tools (Airflow, Oozie, etc.)
Excellent problem-solving and communication skills
Education:
Bachelor's or Master's Degree in Computer Science, IT, or related field