CLOUDSUFI, a Google Cloud Premier Partner, is a global leading provider of data-driven digital transformation across cloud-based enterprises. With a global presence and focus on Software & Platforms, Life sciences and Healthcare, Retail, CPG, financial services and supply chain, CLOUDSUFI is positioned to meet customers where they are in their data monetization journey.
Our Values
We are a passionate and empathetic team that prioritizes human values. Our purpose is to elevate the quality of lives for our family, customers, partners and the community.
Equal Opportunity Statement
CLOUDSUFI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. All qualified candidates receive consideration for employment without regard to race, colour, religion, gender, gender identity or expression, sexual orientation and national origin status. We provide equal opportunities in employment, advancement, and all other areas of our workplace. Please explore more at https://www.cloudsufi.com/
Location:
Noida, Uttar Pradesh, India (Hybrid)
Job Summary
We are seeking a highly skilled and motivated Data Engineer to join our Development POD for the Integration Project. The ideal candidate will be responsible for designing, building, and maintaining robust data pipelines to ingest, clean, transform, and integrate diverse public datasets into our knowledge graph. This role requires a strong understanding of Cloud Platform (GCP) services, data engineering best practices, and a commitment to data quality and scalability.
Key Responsibilities
ETL Development:
Design, develop, and optimize data ingestion, cleaning, and transformation pipelines for various data sources (e.g., CSV, API, XLS, JSON, SDMX) using Cloud Platform services (Cloud Run, Dataflow) and Python.
Data Modelling & Storage Design:
Data modelling and architecture design for storing structured, semi-structured, and unstructured data across databases and GCS.
Design highly scalable schemas to support taxon
omy, metadata, transactional data, and hierarchical relationships
. Ensure models support efficient querying, versioning, extensibility, and downstream analytics use cases
Database & SQL Expertise: Demonstrate
strong expertise in relational databases (e.g.,
PostgreSQL
), including:
Writing complex SQL queries, joins, and subqueries
Designing tables, indexes, keys, and constraints
Developing stored procedures, functions, and views
Optimize database performance and ensure data consistency and integrity.
Data Hunting & Ingestion:
Proactively identify, evaluate, and hunt high-quality data sources for specific technologies, domains, and business use cases. Build automation to ingest newly discovered datasets into the data corpus with minimal manual effort. Leverage
LLM APIs
to handle unknown schemas, unstructured inputs, and edge cases in data extraction and ingestion.
Data Validation & Quality Assurance:
Implement comprehensive data validation and quality checks (statistical, schema, anomaly detection, consistency) to ensure data integrity, accuracy, and freshness. Troubleshoot and resolve data quality errors.
Knowledge Graph Integration:
Integrate transformed data into the Knowledge Graph, ensuring proper versioning and adherence to existing standards.
Collaboration:
Work closely with cross-functional teams and relevant stakeholders.
Qualifications and Skills
Education:
Bachelor's or Master's degree in Computer Science, Data Engineering, Information Technology, or a related quantitative field.
Experience:
3+ years of proven experience as a Data Engineer, with a strong portfolio of successfully implemented data pipelines.
Programming Languages:
Proficiency in Python for data manipulation, scripting, and pipeline development.
Cloud Platforms and Tools:
Expertise in Google Cloud Platform (GCP) services, including Cloud Storage, Cloud SQL, Cloud Run, Dataflow, Pub/Sub, BigQuery, and Apigee. Proficiency with Git-based version control.
Core Competencies:
Solid understanding of data modelling, schema design, ETL, Python and knowledge graph concepts (e.g., Schema.org, RDF, SPARQL, JSON-LD).
Experience with data validation techniques and tools.
Familiarity with CI/CD practices and the ability to work in an Agile framework.
Strong problem-solving skills and keen attention to detail.
Preferred Qualifications:
Experience with LLM-based tools or concepts for data automation (e.g., auto-schematization).
Familiarity with similar large-scale public dataset integration initiatives.
* Experience with multilingual data integration.
Beware of fraud agents! do not pay money to get a job
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.