I was troubleshooting a production issue last week. The issue stated that some shared files weren’t showing up in our application even though they had been shared multiple times.
Seems like a simple enough bug to triage.
I went to CloudWatch to peek at the logs for the lambda that was responsible for sharing files. But there were no errors. Everything seemed to be succeeding without issue.
So I took a look at some of the files that were “missing” in DynamoDB to see if they were in a weird status. Once again, everything seemed to be in place. It made sense why there weren’t any errors in the logs, the data all looked right.
I tried to reproduce it myself. I started over from scratch, but everything seemed fine again. It all showed up for me fine. So I went to the area reported in the bug and sure enough, files were missing.
What the heck was going on?
I spent a lot of time looking at the same 100 lines of code in that lambda file. Just staring at it. Walking through it in my head trying to figure out where it could possibly be going wrong.
Then I hit a turning point. I compared the area with the reported issue to the one I tried to reproduce and something stuck out to me. I had only shared 7 or 8 files, but the one reported in the issue had shared 2,000. This had to be a data size issue.
Back in the code I see when we’re querying for data we load everything. All files, related entities, the works. But we don’t page. DynamoDB can retrieve a max of 1 MB of data in a single query. This was my problem.
When DynamoDB has more results than the 1 MB limit, it returns a LastEvaluatedKey
in the response so you can run subsequent queries starting where it left off.
The code was ignoring that. It never occurred to me that a user would have this much data in production. So I took a step back to think about what we had built and where it went wrong.
It all came down to poor REST API design and bad NoSQL data modeling.
A great aspect of REST is that it allows intuitive drill-down into your entities. If an entity has children, you should have an endpoint to load the entity and an endpoint to load the children. If the children have children, you’d have an endpoint to load the children’s children (and so on). Let’s say we had the entity model described below:
Entity Relationship Diagram (ERD) for a neighborhood
This view shows a 4 level hierarchy of entities, where neighborhood is on top and it has two child entities property and HOA (Home Owners Association). When it comes to RESTful API design, endpoints would be structured like this:
/neighborhoods/{neighborhoodId}
/neighborhoods/{neighborhoodId}/properties
/neighborhoods/{neighborhoodId}/hoa
The endpoint structure is to start at the top of the hierarchy and add a path for each layer you traverse.
In our next layer, the property entity has two child entities, trees and buildings. We would structure these to be:
/neighborhoods/{neighborhoodId}/properties/{propertyId}/trees
/neighborhoods/{neighborhoodId}/properties/{propertyId}/buildings
We can see that buildings has a child entity named rooms. So we’d structure those endpoints as :
/neighborhoods/{neighborhoodId}/properties/{propertyId}/buildings/{buildingId}/rooms
/neighborhoods/{neighborhoodId}/properties/{propertyId}/buildings/{buildingId}/rooms/{roomId}
Each one of the endpoints listed above would return the last entity listed in the url (assuming these were all GET endpoints).
/neighborhoods/{neighboorhoodId}
returns data about the neighborhood entity/neighborhoods/{neighborhoodId}/properties/{propertyId}
returns data about a specific property entity/neighborhoods/{neighborhoodId}/properties/{propertyId}/buildings/{buildingId}/rooms
returns a list of all rooms in a specific buildingStructuring your data and endpoints in this fashion is not only an industry-standard way to implement REST, but it’s a way to help you identify and design your NoSQL access patterns.
If you made the /neighborhoods/{neighborhoodId}
endpoint return details on the neighborhood, properties, HOA, trees, buildings, and rooms, you would have a massive API call that drastically over fetches data and would charge you way too much money to make database calls to DynamoDB.
If you adhere to REST standards, you should always have an easy way to get a list of entities and get a single entity by id.
With DynamoDB, you can to take the drill down approach with composite keys. This means you can structure your hash and range keys to include multiple entities in order to query your lists or get a single entity.
For example, I might structure my hash and range key like this for a property:
pk: `${neighborhoodId}#${propertyId}`,
sk: `metadata`
And I could overload the sort/range key like this:
pk: `${neighborhoodId}#${propertyId}`,
sk: `building#${buildingId}`
So if I wanted to get details on the property, I could do a GetItem
with the first pk/sk combo. If I wanted to get a list of buildings on that property, I could use the same pk, but do a query where begins_with(sk, 'building#')
. If I wanted to get details about a specific building, I could do a GetItem
with the pk/sk for the building.
What you want to avoid is building a structure that requires a FilterExpression to find the data you want. A FilterExpression is applied after the query, which means you get charged for all the Read Capactiy Units (RCUs).
Serverless is all about pay for what you use, and you don’t want to unnecessarily be paying for reads that you are never going to use. If you find yourself in a situation where you might be overusing FilterExpression, take a step back and see if you can approach your data model differently.
The bug I was working ran into this. We had an improperly designed data model around the shared files that was causing us to do a combination of FilterExpressions and post-query filtering. It was some serious overfetching!
Due to this, we quickly ran into the issue (that was overlooked for so long) where DynamoDB was returning the LastEvaluatedKey to inform us there was more data to be loaded. But because we hadn’t anticipated data of this magnitude, our code wasn’t checking for that in the response. This caused us to not process all of the entities and give the appearance that files were missing.
All of this could have been avoided if we had designed the data model correctly.
With proper REST design, your endpoints should be a string of nouns. They offer an easy way to get at a hierarchy of elements. But what about scenarios where your application is going to be updating hundreds (or thousands) of entities at the same time?
You could require a single API call for each entity update. Serverless APIs can absolutely scale to meet demand. But should it? If a single user is taking an action that results in 1000 API calls, what happens if 10 users are doing that at the same time? What about 100 users?
You will quickly run into some service limits with concurrent lambda executions. You can always increase those limits, but again, should you?
It’s ok to break the rules sometimes. Don’t be a fundamentalist.
In scenarios like this where a single user is trying to update hundreds or thousands of entities at the same time, it might be time for a batch action endpoint. A batch action endpoint can be debated whether or not it is RESTful, but in some situations it’s absolutely necessary.
If we’re trying to update 1000 entities, it might be better to make that 10 calls of 100 instead of 1000 calls of 1. This will keep your concurrent lambda counts down and help avoid bottlenecking.
For high throughput endpoints, another alternative would be to integrate directly from API Gateway to SQS so you can control batch sizes and limit concurrent executions of the lambda function that is reading from the queue.
If you find yourself not knowing the access patterns or high traffic endpoints when you begin your project, that’s ok! Work with what you know and what you anticipate to be the primary use case.
There are ways to get insight before you’re done. Alberto Savoia’s “The Right It” book talks about strategies you can take to mitigate risk and do your designing up front.
In the event that you’ve already built your software and you’re beginning to run into issues like I was, it’s never too late to fix the problem. You can build new endpoints, modify existing ones, or deprecate some that should no longer be used. Just be sure you aren’t creating breaking changes if your app is in production.
You can redesign a data model. Migrating between serverless data models comes in five steps:
This approach applies specifically to applications following a CI/CD approach. You have to make sure your migration is successful without any downtime as you make changes piece by piece.
Use metrics to determine your most called endpoints. Are any running with significantly more concurrent executions than others? Is it the same user making a significant amount of calls? You can gain incredibly valuable insights for your lambda functions that let you pick out the outliers and candidates for a batch action.
Software, especially serverless software, should be in a constant state of refactor. You gain insights to how your application is being used every day, you add features, you debug problems. Every one of these things is a learning opportunity to improving your application.
You will never know less than you do right now.
Take the insights to improve your data models, optimize for performance and cost, and give your users a delightful experience.
Happy coding!
Thank you for subscribing!
View past issues.