At Haptik, most of our backend stack revolves around Python. We use the Django framework to help us get started and build scalable applications on the go. Python, being an interpreted language, reduces the time between writing code and deploying the code in production.
Python has become one of the most popular coding languages out there. Here are a few reasons why:
- Easy to understand
- Millions of application packages via PIP library
- Highly Scalable and performant
Reasons for migrating to Python3:
- Haptik’s backend systems (1 of them) were running on Python2 and we needed to make a move to Python3 since Python2 was getting obsolete. We did not want to delay this any further.
- Python3 does not only offer a lot of bug fixes, better performance & security but now helps use Python the right way with a lot of changes around the same.
- Also, Python2 support is supposed to be dropped by the end of 2019.
Instagram was arguably our role model for this process. This is how they went about it.
We had this migration on our minds for some time and wanted to work towards implementing it in the first quarter of this year. We took up-gradation to Python3 as an OKR for the Jan-March quarter and this blog series will be centered around the same.
The entire process of up-gradation took approximately 3 months. Our applications are highly complex since this backend handles all the chatbot information, interface user message and response systems.
One important thing to note is that we had already containerized all our applications back in 2018, so right now all we needed to worry about was application code.
We divided the entire task into 7 steps/phases:
- Identify Python3 Compatible code (January 16-20th, 2019)
- Handling new features development
- Converting older code to being Python3 compatible
- Add new and update older test Cases
- Knowledge Sharing Session (March 2019)
- Testing (Side by side)
- Deployment (April 12th, 1 am)
At Haptik, we do in-depth research before starting any project to uncover as many unknowns as possible and prevent delays. This project was a great example of the benefits of spending time up front to do research rather than directly diving deep into the execution. The plan that you see below was put in place in January. The research was done upfront and we delivered the project as per the timeline set!
Identify Python3 Compatible code
- The best way to start with any up-gradation process is identifying how much of the old code is compatible and need not change. There are n number of packages available to do that. Of course, they are not always 100% correct, but that’s a good enough spot to start with.
- One such Pip package is caniusepython3 and another one is 2to3 & modernize. These helped us take note of code we would need to change and packages we would need to update if we were to move to Python3.
- caniusepython3 -r requirements.txt (to check if the packages in requirements.txt have support for Python3 or not)
Continuous Development (not stopping new features development)
- We did not want to stop the development of new features etc. so we had to figure out a way of letting people add new code while trying to reduce the amount of double work required when we would be switching our application to Python3.
- To start with writing new code in Python3 it was necessary to figure out what are the best practices and changes in Python3.
Though we have been using Python3 internally for a lot of our applications already, most of them were not this complex. Our main Haptik Backend is sort of a monolith right now which increases complexity for such changes.
- Also, we could not stop people from coding and pushing changes to the repository while the up-gradation process was going on.
- We setup linters when PRs were created (Lint Review for Python3 for this application) which helped us know if anyone’s PR had incompatible code. That way we forced people to write new code in Python3 compatible format only.
- 2 code reviewers check was enforced on GitHub.
Converting Old Code to Python3
Before we could even start with the conversion of old code there a few things we realized with our codebase:
- Developers who had worked on the features were no more working with Haptik, so documentation was the only way forward.
- How to go about setting owners for each piece of code that needed to be converted.
- Converting old code while the new code is getting written.
The two main packages which helped us get there were:
Python modernize – This library is a very thin wrapper around lib2to3 to utilize it to make Python 2 code more modern with the intention of eventually porting it over to Python 3.
Python six – Six is a Python 2 and 3 compatibility library. It provides utility functions for smoothing over the differences between the Python versions with the goal of writing Python code that is compatible with both Python versions.
We used Python modernize to find all the files which do not have python 3 compatible code. After we have seen those files, then we will decide which one of these are obvious fixes and compatible with both python 2 and python 3 without six. Then we will push these fixes. After this, we will find all the obvious fixes which require the six Library. For these, we will fix them one modernize rule at a time (e.g. except). For those rules which affect lots of files, we will fix them application-wise e.g. user, chats, ml_pipeline, etc.
During this phase, we also found some more code that was not compatible with python 3. We fixed that too.
Once we determined which packages do not have python 3 support, then we either upgraded the package version or found a replacement. If this required code changes, we did that on a case to case and priority basis.
We started doing GitHub Tag-Based releases to keep track of changes being made to the repositories.
One key task to all of this was to make sure we update the existing test cases and write new test cases which were Python3 compatible.
We were still working on improving our coverage, so we made a decision that since we can’t cover all the lines of code, we would cover the most important modules of our application.
10% of the most core/important modules to be covered with test cases. Luckily for us, our Machine Learning team had taken it as one of their OKRs, which helped us to complete this in parallel.
Knowledge Sharing Session
Not working in Silos is what we promote and prefer at Haptik. Back in January, we decided that by the time all the research and basic fixes are done, we would have a knowledge sharing session for the same.
So, it was time to share the plan outlined below with the Engineering team, along with the release plan:
- Best practices for writing new code which is Python3 compatible
- Changes being made for Python2 to Python3 and how that would affect future development activities
- Standard logging pattern
- Final Release/Deployment plan
- Steps to move development environments to Python 3
A lot of testing went into making our Python3 release possible in a seamless manner. We carried out multiple phases of testing. Listing out a few of these below:
- API testing using Postman
- Bot testing – A lot of bots to be tested. Haptik Bot testing Tool came to the rescue and helped us check if all the bots are working fine or not.
- New Python3 based environment
- End-to-end QA
- Load testing with locust + Update load test scripts
Once we started writing Python3 compatible code, we started to push the code in small bits and pieces into our develop branch (we follow git-flow). The idea was to test this code and check if we are implementing the changes in the right manner or not.
The APIs we updated was tested using Postman and we checked if we were getting the right response body, headers, etc. We tried to test as many different scenarios as possible. If anything wrong went into production, it would mean a huge downtime, at least for the public-facing APIs.
Another thing that helped us achieve this feat quickly was us having our entire application code & systems infrastructure as code (IAC). We could launch the entire Haptik system in a new AWS account, change the containers to be running on Python3 base images and do end-end QA on the entire setup. We used Terraform for the same and the templates were around for a while now.
The last couple of weeks were supposed to be very crucial. We planned to go live entirely on Python3.
Release weeks. We also wanted to do a ZERO downtime deployment.
1. Code Freeze & Sanity
- New code changes were stopped a few days before the release. We created a release version tag for all code in master as that would be the last working Python2 version code.
- All developers were asked to check their features and code and if everything was working fine in UAT. Post that, code would be merged into release & then into master.
(As of today, UAT runs on the release branch, but back then it was running on develop)
- Try to improve code coverage even further
- If there are any fixes, push those to develop on priority
MORE TIME WE SPEND HERE, THE LESSER WE NEED FOR PRODUCTION RELEASE
2. Switch UAT to Python3
April 5th, 11 am
- All Python3 transition changes to be merged into develop branch
- Switch Python3 Docker container image, Python3 supervisor, workers, etc. – This will mean that switch will happen in UAT at this point to Python3.
3. Python3-Day (D-day)
April 11th – 11 pm
- All clients were informed about this Major release
- We asked all engineers need to be available and asked at least 1 person from every team to be in the office from where the release was happening. If anything breaks, we would have someone who knows about that feature.
Things we did just before the release:
- Backup all Databases: We backed up each and every data store. To reduce the time we already took a manual backup every hour during the day so that before the release, the difference is less and the backup time is also less.
- In case of failure, we still had a way to revert code immediately to the previous release GitHub tag.
- Enable Blockers – We had a system in place which helped us put banners on user-facing interfaces which told the users that from x to y duration, the system might respond slowly, or the system would be under maintenance. That helped us with the Python3 release as well. It acted as a shield if anything went wrong.
Release begins (Blue-green & Zero Downtime)
- The first step was to reduce the number of nodes for all the microservices
- Next step was to make changes to the Django migration files on our Cron VMs (we have a central system to store migration files for our Django applications) and deploy code (along with Docker container image changes)
- The next step was to deploy code to each of the set of ECS services one by one using our Continuous Deployment pipeline. (Read more here.)
- Slowly, we brought up the platform step by step.
- Checked errors on NewRelic, if any.
- Ran a simple load test to send messages to the Pipeline to check if everything was working normally.
- Enable endpoints for the clients one at a time. (We have a system where we have different DNS entries for different clients we have on the platform, it helps in situations like these).
And it’s a WRAP! Python 3 migration is complete.
All teams being available was a boost for us to do the release even more freely.
We are on Python3. The release went as smooth as it can get. We started the release at 1 am in the night and wrapped it up by 5:00 am with all sanity and testing.
Things to be careful about while migrating in the next blog.
- Performance improvements
- Support for threading & better scalability for millions and millions of users
- Security improvements
- Zero Errors shipped into Production
- Breakdown our main monolith service just like other microservices we have
- Go the Serverless way
More details in the following blog. We will also come up with an e-book around this with more stats etc.