

Note: If you update the code in the python DAG script, the airflow DAGs page has to be refreshed the first DAG, analyze_json_data, is the one built for this project). Here you will see all your DAGs and the Airflow example DAGs, sorted alphabetically.Īny DAG python script saved in the directory dags/, will show up on the DAGs page (e.g. You can now access the Airflow web interface by going to If you have not changed them in the docker-compose.yml file, the default user is airflow and password is airflow:Īfter signing in, the Airflow home page is the DAGs list page. Note: it might take up to 30 seconds for the containers to have the healthy flag after starting. The credentials for that user will have to be saved in the s3 file found the directory /airflow-data/creds: You will have to create an AWS s3 user specifficaly for Airflow to interact with the s3 bucket. Amazon Web Services (AWS) ( wikipedia page)ĭownload / pull the repo to your desired location.

Apache Spark, speciffically the PySpark api ( wikipedia page).because of a data recording error), then the difference for that day should be 0. If the difference is smaller than 0 (e.g. for county X there were 30 more cases in day 2 than in day 1). This data is loaded as a json file in the s3 bucket.įind the differences between days for all counties (i.e. for county X in day 1 there were 7 cases, in day 2 there were 37 cases). It does not contain the difference in numbers between the days (i.e. The Romanian COVID-19 data, provided by, contains COVID-19 data for each county, including the total COVID numbers from one day to the next. Note: The code and especially the comments in the python files dags/dagRun.py and sparkFiles/sparkProcess.py are intentionally verbose for a better understanding of the functionality. The PySpark data transformation/processing script, located in sparkFiles/sparkProcess.py.The Airflow DAG file, dags/dagRun.py, which orchestrates the data pipeline tasks.The project is built in Python and it has 2 main parts: Note: Since this project was built for learning purposes and as an example, it functions only for a single scenario and data schema. The pipeline architecture - author's interpretation: It is then transformed/processed with Spark (PySpark) and loaded/stored in either a Mongodb database or in an Amazon Redshift Data Warehouse. The data is extracted from a json and parsed (cleaned). ETL Pipeline with Airflow, Spark, s3, MongoDB and Amazon RedshiftĮducational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.Īn AWS s3 bucket is used as a Data Lake in which json files are stored.
