Wallet Scalability

Hike
Hike Blog
Published in
6 min readJun 9, 2023

--

By Gaurav Arora — Engineering Team, Hike

Back when we started building Rush in Nov 2020, our approach was to quickly hack up required backend services without considering long-term requirements, the right architecture, code quality etc. Our goal was to quickly build a Real money Gaming application in about a month. With a small team of 3 backend developers and 2 client developers, this was the only way we could launch an entire application in such a short time frame.

As part of building the entire system with this approach, we also picked up a few systems or pieces of code that already existed in Hike and modified them according to our new use case. One such system was “Wallet” which forms the backbone of the entire Rush application today handling hundreds of transactions every second. This wallet was built specifically for some internal purposes and this was not a service which was built for handling significant user traffic and thus was never optimized for handling scale.

The modified wallet service worked well and so did our gaming app. Our business was scaling new heights every week and with it came an ever-increasing workload of building new features, all of which meant that we couldn’t focus on fixing the tech debt and scalability issues that we inherited. This continued very well till it suffered a major downtime that continued intermittently for 5 days.

Challenge

In April 2021, we suffered our first and only big downtime (so far) where the inserts into the wallet DB’s transaction table were taking considerably large amounts of time. This had a domino effect and soon enough, all incoming requests were waiting to get a free connection from the pool but eventually timed out. Something had to be done immediately as this led to a complete downtime of our application which meant 100% business loss.

Immediate Solution

Since the inserts in one particular table (Transactions) were taking considerably longer than expected and we knew that the indexing on this table was not appropriate, we decided to create a new transactions table. This provided us with a new table without any inefficiencies linked to bad indexing and without tens of millions of records.

This strategy worked and instantly the system was back up and running. However, our users were unable to see their historical transactions. To reduce the extent of the problem, we decided to backfill the last 15 days worth of older transactions. This solved most of the problems and the service ran well for more than 24 hours.

The next day, we started facing the same issues again and we finally concluded that the current system will not be able to handle the current traffic. So we stopped all paid user acquisition, marketing, notifications etc to reduce the traffic till a solution is found.

Then came the marathon all-nighter where a bunch of us started going through the code to find inefficiencies that can be resolved quickly and we encountered a few problems with the way the system was architected. There were multiple “SELECT …. FOR UPDATE” statements that were using a non-unique index, leading to a gap lock being acquired instead of locking a single record if a unique index was used. (Read here). This meant that each “SELECT …. FOR UPDATE” was locking the index space which did not let subsequent inserts to take place till this lock was released.

A quick solution would have been to change the transaction isolation level to READ COMMITTED from the default REPEATABLE READ. However, we weren’t confident if this would break anything else. So we decided against it and came up with a different approach to eliminate all “SELECT …. FOR UPDATE”.

We introduced redis based locks right at the entry point of our service to ensure there is only a single request being processed for a single transactionId at any given time. This, along with a few other fixes, ensured that we no longer need any “SELECT…. FOR UPDATE” statement and thus no gap lock is acquired.

This fix immediately fixed the scale challenge and our DB insert latency was back to milliseconds and we were able to handle the entire traffic seamlessly.

Long Term Solution

We did multiple load tests to be fully confident of the scale we’d be able to handle without fundamentally changing the wallet architecture. However, the rate at which we were growing, meant that we could only handle the scale for the next 6–9 months.

So, instead of re-designing the system from scratch, we decided to migrate from a single database to a sharded database as a way to gain significantly higher scalability as the DB transactions would now get split across multiple database machines.

When you implement a new system keeping a sharded database in mind, the biggest thing to worry about is choosing the right shard key based on all possible query patterns that would be required. However, while sharding an existing database which wasn’t designed keeping this in mind, it becomes a tricky problem to solve as a lot of use cases would now have to be changed according to the shard key chosen. Another hurdle while sharding the database of a running system is to migrate from a single database system to a sharded database system while ensuring no downtime and maintaining data sanctity and consistency.

So, we decided on a shard key that ensures minimal changes to the existing system and also solves all of our use cases. However, in order to ensure that there is no hot spotting or hot bucketing, we chose to use the murmur hash of our shard key to finally decide on the shard and it gave us really good results based on our simulations.

Once our code changes to support the sharding were in place, came the big task of migrating the data without any downtime. For this, we considered multiple strategies and finalized upon the following strategy -

  1. Setup the sharded DB cluster.
  2. Make code-level changes to ensure writes after a particular timestamp (t1) are happening in both clusters (single DB and sharded DB).
  3. Backfill all the historical data from start until timestamp t1 into the sharded cluster.
  4. Validate the entire data in each shard and compare it with the data available in the single DB.
  5. Once everything is successfully validated, deploy code changes to start reading from the sharded database cluster instead of the single DB, and let writes continue to happen on both clusters.
  6. Let the system run like this for 3–4 days and if everything is stable, deploy code changes to remove writes from the single DB.

Point to note here is that in step 2, when we enabled dual writes, it increased the latency significantly which was not acceptable. So, we switched to parallel DB writes to both the DB clusters. Parallel writes had its own challenges as any write failure in one db cluster would mean that the corresponding transactions on the other cluster had to be rolled back.

This strategy ensured that we face absolutely no downtime while migrating from single DB to a sharded DB cluster while also giving sufficient time to each stage to check for data validity and system stability. This strategy also provided us a mechanism to switch back to the single DB system till the very end in case any issues are observed.

Future Scalability and cost considerations

While sharding the data, we decided to create 32 shards, which generally means that the DB machine cost also becomes 32 times of what it was earlier. However, given our current scale, we don’t need these many shards for the next 2–3 years, so why incur the cost of 32 shards right from day 1? However, changing the number of shards in the future comes with its own challenges. So we decided to continue with 32 logical shards, but used only 8 DB machines hosting these 32 shards with each machine having 4 databases, each representing a shard.

This way, we incur the cost of running only 8 db machines while also keeping scope to expand up to 32 machines with very little operational work. So, this takes care of scaling the system further when required while also keeping our current costs in check.

Another area to be taken care of is the table that contains transactions as this is an ever-growing table and would grow significantly faster than the table containing user wallets (since each user continues doing multiple transactions over their course of life). Currently, the transactions table is, like other tables, sharded on the bases of user id/wallet id. However, in future, this needs to be sharded on user id/wallet id + time range. To explain this in simpler terms, we can say that transactions will essentially be sharded on user id/wallet id, however they would lie in different database clusters depending on the time range. For e.g. transactions for year 2021 can lie in cluster 1, transactions for year 2022 can lie in cluster 2 and so on.

These future iterations would ensure a significantly long-term scalability to our wallet system and we don’t see any scalability challenges for the next few years.

--

--