Pipeline Management Made Easy with Git
February 16, 2017|Data Intelligence
Submitted to 2017 Hadoop Summit, San Jose
By Tony Philip, Big Data DevOps Engineer
Pipeline management can be a hassle. There are a plethora of pipeline tools. Some even have nice pretty interfaces. But almost all are fairly complex systems and are often difficult to setup and maintain. At Cheetah Mobile we have created a process for managing our ETL pipelines with a single framework.
We call it Chronos.
Why create something new? Well, actually we didn’t. We used the tools we already had in the infrastructure to keep things simple, easy to manage, and easy for someone else to pick-up.
So, what is this magic tool? As you may have guessed from the title, its Git. Most everyone in development has dealt with or used git at some point. It’s one of the most popular source control solution out in the world today. Because of this, there is little to nothing new to learn.
Using Git we can manage our SDL processes with code reviews, checkouts, etc.
You may use something fancy like Jenkins or something more simple like BitBucket; But, most server side Git management systems have some process for gating usually using code reviews and approvals. It really doesn’t matter which tool you use. You may have legends of QA people, pouring over every line of code and testing every possibility before you elevate the code to production ready. These are just the processes you put in front of your control system. For us, in the name of simple, we have a peer review and a two man approval processes. W also control deployment with the source repository pull order. First we pull to master so everyone can see. Then we pull to Dev1. There we can now test on the development cluster. Finally we pull to the PROD branches. Each step needing the approvals.
We use Git in our deployment process as well. We have created a branch within the codebase for Chronos for each cluster. One client node from each cluster is chosen and using our configuration management system we pull the branch which corresponds to that cluster. The tool used doesn’t matter much here we just need to accomplish a few tasks:
– Pull down the latest code
– Change some ownerships of the projects folders
– Copy a cron file into /etc/cron.d/
– Copy a rotate file into /etc/logrotate.d/
The Chronos codebase contains some files that need to be copied into place. The most important being the Cron file with the jobs that makeup our pipeline. In the Chronos codebase there is a root folder called cron. We create a unique cron file for each cluster and store it in that folder. During deployment the system copies the cluster specific cron file from the chronos’s cron source folder into /etc/cron.d/.
We also during deployment copy the logrotate configuration file to /etc/logrotate.d/. We use this file to clean up logs created by Chronos jobs.
So now that we are using Git for deployment and code versioning, we can rollback any changes in the pipeline with a simple git reset. Or patch with a simple git commit to the cluster’s branch.
Now that we have covered the front end process side of Chronos, lets dig into the framework a bit. The base structure for the codebase is just a few folders:
We will not talk about this folder much. Just know that a part of our process deploys keys and creates users on the client node automatically based on the contents of this folder.
This folder contains a cron file for each cluster under chronos control. Using a single flat file, and not individual’s crontab or even separate files within cron.d lets us order the jobs by dependency and time.
Before jobs go to production the jobs are timed and those times are use to space the cronjob, giving us as much full time utilization of the cluster as possible without running too many or too few jobs at one time. This also allows us to scale-up and scale-down our clusters based on low or heavy use times should they exist.
Contains a Logrotate configuration file. Each job in the cron file has a log file it outputs. These logs are stored under /var/log with the rest of the system logs. The logrotate file contained within Chronos is used to clean-up any log files it produces.
Here we find the meet of the Chronos framework. Let’s take a high level pass at the core of the framework. To start, we have a folder for each service account that we will be running jobs on the cluster. Each service accounts has a subfolder. Looking back at the deployment process, we changed ownership of the project folders. It’s using the name of these folders, that we set the owner of all subfolders. So, if we have a service account called hive running jobs in our pipeline. We create a folder called hive and put all it’s job code under that folder branch. During deployment, that folder and it’s subfolders will have ownership assigned to based on that folder name. In this example, hive.
Under the service account folder we have a couple of folders:
The projects folder contains all the pipeline jobs that will be run by the named service account. Each job under that service account gets a unique folder. Within that folder are the scripts used to run the job, be it Scala, jar, py-spark, hive, etc. We have some additional optional support files for interfacing with the core Chronos framework as well. More on those files later.
The Base folder is where the Chronos framework lives. One of the problems we faced reining in the pipeline resource usage was managing our developers usage of those resources. As each part of the pipeline maybe written by different individuals with different techniques. All developers tend to try to take as much resource as possible to complete their job. It’s in our nature, we all want to be the fastest, after all who wants to be the pipeline “bottleneck”.
To get around these resource issues, Chronos is used to gate some of these resources. It does this by deriving memory and CPU stats specific to the cluster at runtime. So jobs run on one cluster would get different resources then another cluster. Invisible to the developer. Of course there are overrides for most settings should they be needed. Overriding variables is usually as simper as exporting specific environmental variables that will be picked up by Chronos at runtime.
Chronos provides plugins we call launchers. These launchers live under the base folder. Each job’s language flavor and job type have their own launchers. For example, we have a spark_launcher to launch Scala scripts. We have one for python jobs and ones for launching jars, etc. We can create a launcher plugin to drive any job type.
The next and most important script in the base folder is the runme script. This script is called passing parameters that are used to locate and run the job. It is called in every cron job found in the cluster’s cron file. Runme has a few functions, but first, let’s look at it’s execution parameters.
runme <Launcher> <Input(s)_folders> <ouput_folder> <extra args>
First you see we specify the launcher plugin to use. Followed by the input paths for the upstream data we need to process. This is a comma separated list of inputs.
Next we give it the output path. Where are we putting the data during computation. Last are any extra arguments the job may need.
Our pipeline like most is aggregating data over time: Hours, Weeks, Months, etc. One of the main functions of Chronos’s runme is to provide the job the next available date to process. It does this by looking at the input paths and comparing to the output path’s processed dates. Any dates not found in the output are then processed by the job. If multiple inputs are given to runme, the runme command will wait for all input folders to have the same next date to be processed. If they do not match the process goes into a holding pattern occasionally checking the inputs waiting for them to be in sync before finally calling the launcher.
Lastly, some additional optional files are provided for use within the project folders themselves.
First, any bash script files it finds in the project folder are executed before the launcher is called. An example of some of these additional files may look like: 1Delete-Tmp-Folder-Data.sh and 2Validate_System_Resource_Availabliilty.sh, etc. Numbering the file names will allow Chronos to run them in order if there are dependencies.
Next we provide a resource file to be sourced before the launcher is called. This file, called deployed_rc, contains any extra variables we may want exported for the job or for overriding general launcher default values; Like number of executors, classes, packages and even the spark version used.
Last we provide a file called, finally. “Finally” is just that, it runs after the job has finished executing. An example use of “Finally” maybe to clean up old data or move data from HDFS to S3.
In conclusion, it’s often important to consider operability of a pipeline. Especially more complex pipelines with many ETLs and sources. Keeping it simple reduces complexity thus reducing time spent troubleshooting and reducing operations costs. Management of runtimes and resource usage maximizes the value return from your clusters.
In summary, I think the take away here is process. We are only really using two tools here, git and cron. Those two technologies live in almost all linux environments and are most likely already being used. It’s really the process that is the magic behind Chronos. The steps and gates it helps define and enforce. Remember, Simple is almost always better than complex.