Job Title: Python Developer with PySpark
Location: Northompton
Job Type: Contract
About the Role:
We are seeking a skilled Python Developer with expertise in PySpark to join our dynamic team. The ideal candidate will have strong experience in building and optimizing large-scale data processing pipelines and a deep understanding of distributed data systems. You will play a key role in designing and implementing data solutions that drive critical business decisions.
Key Responsibilities:
- Develop, optimize, and maintain large-scale data pipelines using PySpark and Python.
- Collaborate with data engineers, analysts, and stakeholders to gather requirements and implement data solutions.
- Perform ETL (Extract, Transform, Load) processes on large datasets and ensure efficient data workflows.
- Analyze and debug data processing issues to ensure accuracy and reliability of pipelines.
- Work with distributed computing frameworks to handle large datasets efficiently.
- Develop reusable components, libraries, and frameworks for data processing.
- Optimize PySpark jobs for performance and scalability.
- Integrate data pipelines with cloud platforms like AWS, Azure, or Google Cloud (if applicable).
- Monitor and troubleshoot production data pipelines to minimize downtime and data issues.
Key Skills and Qualifications:
Technical Skills:
- Strong programming skills in Python with hands-on experience in PySpark.
- Experience with distributed data processing frameworks (e.g., Spark).
- Proficiency in SQL for querying and transforming data.
- Understanding of data partitioning, serialization formats (Parquet, ORC, Avro), and data compression techniques.
- Familiarity with Big Data technologies such as Hadoop, Hive, and Kafka (optional but preferred).
Cloud Platforms (Preferred):
- Hands-on experience with AWS services like S3, EMR, Glue, or Redshift.
- Knowledge of Azure Data Lake, Databricks, or Google BigQuery is a plus.
Additional Tools and Frameworks:
- Familiarity with CI/CD pipelines and version control tools (Git, Jenkins).
- Experience with orchestration tools like Apache Airflow or Luigi.
- Understanding of containerization and orchestration tools like Docker and Kubernetes (preferred).
Experience:
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field.
- 5+ years of experience in Python programming.
- 4+ years of hands-on experience with PySpark.
- Experience with Big Data ecosystems and tools.