I've always considered myself a Java guy. I really liked C# back in the day when I actually used it, but I haven't for more than 3-4 years, so I can't say I'm a .NET guy anymore.
Anything serious I have to develop, I think Java first, but lately I've found myself changing the course a little bit1.
Python is so easy to use; the barrier to start is so low; what needs to be done to have a full working Python script is so little that it's just hard to justify not doing it in Python.
If I open my command line interface, type python and hit enter, I can start testing theories and trying out code right away without anything else. No IDE, no "New program...", no required main class, nothing.
The language itself is clean, easy to understand, and you can use any editor you have available. No Eclipse, no NetBeans, no IntelliJ, and not 5GB of Visual Studio for sure.
Like with Java, there are a bunch of people writing awesome (open source) libraries for Python. Like Java, you can run your Python code everywhere. Like Java, Python is fully supported in Google App Engine (and I'm in love with it!)
But of course, this doesn't mean I don't love Java. These two are both great languages and I'm not planning to ditch Java anytime soon. Python however is winning the lightweight title with me. More complicated stuff? We'll have to wait and see.
(If you are starting out in a development career, I seriously recommend you to look at Python. You'll love it and it will teach you all the fundamentals necessary to step up to more serious stuff later. And of course you can always buy this book, love Python forever, and never ever move on.)
1 For the last 4 serious projects I've coded I used Python three times and Java once (and Java just because it was an Android exercise for my Masters class.)
This is a very common problem everyone new to the Datastore faces sooner rather than later: you've got a specific entity kind and you want to remove multiples entries from it. Unfortunately there's no DELETE FROM Entity instruction, so the problem is a little bit more complicated than what it seems.
There are multiple options that you'll need to evaluate to make sure you take the most appropriate way. Below, I'll try to explore each one of the possible scenarios/solutions to give you enough options to choose from:
Do you really need to delete those entries?
First of all, I want you to ask yourself if you really need to delete the information. It might be cheaper to just keep it. It might be cheaper to soft-delete it by adding an "Archived" field instead. It all depends on multiple factors, but you should spend some time thinking about it.
Deleting is expensive, especially if you are deleting entries represented in multiple indexes. The cheapest way to delete something is by leaving it alone.
So don't jump into conclusions right away, and try to think what would happen if you decide not to remove the entries at all.
Special case: Deleting all entries by hand
Before getting into more complicated procedures, if all you want to do is remove all existing entries by hand, you can use the Datastore Admin interface provided by Google:
Navigate to the Datastore Admin tab in the old appengine.google.com console.
Select the entity kind you want to remove, and click on the Delete Entities button.
Note: At the time of this writing, the old App Engine console still exists, but Google is migrating everything to the new console. The Datastore Admin feature only exists in the old console, but that will hopefully change. I'll make sure to update this post when (if) that happens.
Deleting entries using the Remote API
If you want a little bit more flexibility than removing all entries for a given kind, you may want to consider the Remote API (Python, Java).
The Remote API provides an interactive shell for you to execute Datastore commands locally. Something like:
>>> from google.appengine.ext import db
>>> entries = Entry.all(keys_only=True)
The above code will select all keys for the Entry kind and delete them. Instead of running from a file in App Engine, the code will be running directly on your local computer, and the Remote API will take care of executing each command in the remote Datastore.
Deleting entries programmatically - The simplest approach
Of course, removing entries manually is very easy, but things start getting complicated if you want to remove them programmatically.
If you are looking at just a few entries, you may get away by doing something like this:
This is exactly the same code we used with the Remote API example above, but put together in a single line. Very likely this is the simplest approach that you can follow to remove entries of a given kind in code: load the keys, then delete them.
Deleting entries in multiple batches
Unfortunately, the method described above to remove multiple entries breaks as soon as we need to get rid of a larger set of entries. The Datastore has a 30-second deadline limit, which means that we need to come up with a different solution for scenarios involving bigger data sets.
A well-understood approach to do this is to remove the data in multiple batches. For this you'd use two things: Tasks (Java, Python) and Cursors (Java, Python).
Let's say that you already identified that you can easily remove 1,000 entries at the same time without hitting the Datastore 30-second deadline. Using a cursor you can limit your Datastore queries to 1,000 entries at the time, and using a Task Queue you can distribute multiple operations over time to avoid hitting another App Engine deadline: the 60-second per request limit.
from google.appengine.ext import ndb
from google.appengine.api import taskqueue
entries = Entry.all(keys_only=True)
bookmark = self.request.get('bookmark')
query = Entry.query()
entries, next_cursor, more = query.fetch_page(
bookmark = None
bookmark = next_cursor.to_websafe_string()
The above code shows the implementation of a Task that removes 1,000 entries at the time from the Datastore. Note how this task schedules itself at the end of the method passing a "bookmark" as an argument, so the next execution starts from where the previous one left off.
Deleting entries using Mapreduce
Despite the above solution works fine, if you are planning to remove large amount of entries, I'd recommend looking into Mapreduce:
MapReduce is a programming model for processing large amounts of data in a parallel and distributed fashion. It is useful for large, long-running jobs that cannot be handled within the scope of a single request (...)
By using Mapreduce, you can delegate all the plumbing to make your solution work in parallel to the framework and concentrate only on deleting the entries. The resulting code will be clearer and the processing will be executed optimally in App Engine.
Spreading the costs over time
Before finishing this post, I want to make sure you keep something in mind: deleting is an expensive operation, so depending on how many entries you want to remove, you may be looking at a very large bill at the end of the month.
One solution to this when applicable, is to spread the costs over time by removing a fixed amount of entries every day. Here you can play with the free quotas of the Datastore or simply spread the costs to avoid paying a huge one time lump sum.
As long as you don't jump to code right away and properly evaluate your options before hand, you should be fine. The Datastore is extremely powerful, but complicated to use as soon as we start manipulating large amount of data, so it requires a more careful approach to what some of us are used to.
Developers are weird (I can tell because I'm a developer). We are usually very proud of our careers, and very protective of our work, so the weirdness shows whenever we have to interact with other developers.
I've seen it over and over again: developers have a hard time saying nice things about the work of other developers.
It's a shame, but it's also a fact.
So it's not hard to understand that one of the biggest risks of two development teams working together is precisely the fact that they are working together. It's a fragile relationship bound to break at any time, and when it does, the source code will be caught in the middle of it.
I probably made it sound a little bit dramatic above, but I think the point stands: you don't want your two teams to be fighting.
I have this situation at my company right now: two development teams a thousand miles apart that just started working on the same project together.
I'm afraid of what could possible come down the road, so I want to fight it before it even shows.
Here is an idea (not mine, somebody mentioned it and now I'm championing it left and right): getting both teams together doing something fun may smooth out rough edges and bring up the human side of everyone when things get tight.
It easier to blame somebody you've never seeing before. It's easier to point fingers if you don't have a face for that name. I think we can prevent some bumpiness if we take the steps early on and have everyone sharing beers together for a couple of days.
So before going over the first line of code, we are going to try and have everyone sharing some fun time together. Maybe lunch, drinks, and then dinner. Have some laughs and show everyone that we can easily be a team.
On every new interview I find myself explaining how the engineering team at Levatas works. Every new candidate wants to know about our company, and I have to go over all the details not only during our conversation, but every time I'm pitching our group to potential candidates.
(The video above is the background of our website. It has no audio because it shouldn't. I'm putting it here because it's awesome and shows a little bit about our culture.)
This is a quick recollection of all those small conversations about Levatas. Hopefully I'll be able to send this link to every new candidate going forward.
Levatas is a small company in the always sunny and beautiful Palm Beach Gardens, Florida. We currently have two offices located a few yards away from each other.
We serve free food on Fridays and snacks 24/7. We play ping-pong and Xbox, and write in the walls all over the place. We brew our own beer and fun stuff is always welcome.
The Technology Group
That's how we call our group, which is more than pure software developers: we also have UX, UI, QA, and PMO.
And of course, a lot of developers.
But we are not just technology at Levatas; we also have a marketing department that complements our services very well. Clients seem to love the fusion.
"Software Developer" is not cool anymore, so we call ourselves "Engineers" and we are (have) a bunch of those around here.
We are a well-balanced team. We have developers all the way from juniors to ninjas. All together, we cover a wide variety of technologies and skills.
Most people work from the office (if you are full time, you have a desk). Contractors can work from home or can come to the office if they prefer. Offshore contractors (we have a bunch) work from their countries (duh!).
You have to be in the office during our core hours 10:00am to 4:00pm. You need a badge to get in, and if you want others making fun of you, I dare you to come wearing a suit.
Management positions come with a lot of meetings, but regular employees get the bare minimum. As a company we only meet on Fridays for 20 - 30 minutes for a quick catch-up.
We also run Scrum, which comes with a bunch of meetings disguised as "ceremonies". Each project team has their own.
Internally we use Slack, but I know some people can't let go and still use Skype and Google Hangouts. Of course, email is a big part of internal and external communication.
We brag about being an agile company. We run Scrum. Not by the book, but our own version of Scrum, crafted and modified over the last few years.
There's still a ton of stuff we need to get better at with this process, but we've totally moved away from rigid processes.
Windows vs Mac
There's more Windows among engineers (designers are heavy towards the Mac side). Discussions about which one is better are frequent and heated, but we Mac people end always winning (of course!)
We don't have anything in the books that you have to use and I love that. We love to explore and try out the latest and the greatest. Over time we've settled in a core set of technologies we use the most, but we keep our eyes open.
Our bread and butter
We do not support IE8 anymore unless the company pays a lot.
Lunch and Learn
We do these from time to time (pizza included). Somebody stands up and teaches the crowd something they don't know about (the crowd, not the speaker). We use this time to increase our general knowledge about hot topics in the industry.
Source Control Management
We used Subversion for years, until we decided to migrate to Git. We kept the existing code base in Subversion, and only new projects are created in GitHub.
I love Git and don't like Subversion. I'm glad most people at Levatas feel the same. It takes some time to understand the nuances of Git when you come from the SVN world, but I think the time is totally worth it.
We don't most of the time, but we want to. We do have some projects with a lot of unit test coverage which is amazing, but we haven't gotten our entire team onboard yet.
Certain things are easier to test than others. Most of the stuff we do fall in this "other" group. The stuff that doesn't, we try to test.
We started doing them as a general practice, but we haven't gotten too far with it. Most code reviews happen informally, and it's something that we definitively want to make better.
Now that we are almost exclusively working with GitHub, our plan is to use Pull Requests as a tool to propel Code Reviews forward.
We don't do automated Continuous Integration for most of our projects. We do have it for some though using a tool that we built internally. We have to grow in this area a lot, but we have taken the right steps so far.
We built a tool for one of our projects that takes care of deploying the app in multiple environments. We love it and it works great.
Then we built another tool for another client and we use it to manage a lot of small projects. Last year alone, it managed around 45 applications, and this year that number will pale in comparison.
And now we are building another one, this time to Rule Them All. It's not done yet at the time of this writing, but it will be soon.
Last but not least, we work for a bunch of awesome clients: IBM, HSBC, Duffy's, Bennett, Cisco, HP, Penn Mutual, Dell, Leap... and the list goes on and on.
Some of these you probably know already, some are less prominent but equally fun to work with. For all of them we go the whole nine yards.
A final word
We are just like any other engineering team out there. We try to do great work and have fun while at it.
We aren't perfect and have a lot to change and learn, but we are moving in the right direction. The best quality we have is our passion to make our company a better place.
When using frontend instances in Google App Engine, every request has a 60-second budget to process and return a response back to the caller. If this time goes by and the code hasn't returned, a DeadlineExceededError (Python) or DeadlineExceededException (Java) is thrown.
And everyone is always bitching about this (I'm thinking about myself here).
Why a deadline in the first place?
I think there are several reasons:
First of all, we need to remember that we are using a platform shared with other applications. The best way to make sure we are good citizens and don't do stupid things is by having the platform itself enforce the rules.
Do you imagine if just for the fun of it we make our requests never return, keeping our frontend instances working 24x7? Yes, we'll pay for it (literally with cash) but it will also hurt the platform's ability to share unused resources.
Or we may have an honest mistake in the code that causes an infinite loop, and the enforcement of a deadline will save us from a surprising bill at the end of the month.
The deadline makes sure everyone plays by the same rules, keeping the entire ecosystem healthy. I also see it as an opportunity for developers to make sure we write the best possible (performant) code, removing any probable sloppiness from the equation.
What causes deadline errors?
The simple answer is "anything that makes a frontend instance request last longer than 60 seconds", but that's probably not too helpful, so here are some of the things you might want to consider when trying to discover potential problematic areas:
Fetching external URLs: Are you using any external services via the URL Fetch API (Java, Python)? Keep in mind that HTTP requests may take a long time, especially if the requested service is down or fails to return in a short amount of time.
Datastore contention: Updating too frequently the same entity group in the datastore may lead to datastore contention, which in turn will cause your application to hang up more time than needed.
Sending emails: A very common case I've seen is trying to blast multiple emails during a user request.
Startup time: This is a personal pet peeve of mine: using the Spring Framework on App Engine is a one way ticket to a bunch of deadline exceptions during startup time. I wrote about this some time ago, and unfortunately I don't think it has changed.
Any other long running operations: Are you doing any heavy processing using frontend instances? Long loops over a lot of data, or complicated computation that takes a bunch of time?
The above is not by any means a comprehensive list, but rather some of the areas I've found problematic in the past. Your application will certainly have its own characteristics that may or may not align with the above list.
How to avoid deadline errors?
Now that you know what's going on with your application, it's time to find a solution.
I was thinking about how exactly to write this section of the post, and decided to simply list some general ideas that will help with deadline exceptions. You'll have to decide which one of these "tools" is the best answer to your problem:
Consider processing large amounts of data in a parallel and distributed fashion using the MapReduce library.
A few weeks ago somebody told me in a meeting: "You are always thinking that things are gonna go wrong." and honestly, that made me feel very good. (This probably makes me a pessimistic, but I'm OK with that.)
Apparently I'm always the one considering what's gonna happen when (and not if) real life proves to be more challenging than our idealistic dreams.
When we are designing distributed systems, our application needs to tolerate the failure of any component at any given time. Any service call could fail, and it's our responsibility to respond to this as gracefully as possible.
(Distributed system is probably where failure is more common, but I can see this approach being valuable for any type of system.)
Pretending that everything is going to work all the time every time is not only irresponsible, but totally unprofessional.
Designing for failure is hard. It introduces extra complexity in our code, but not doing it is not an option.
A very common problem when working with the Datastore is to count all the entries of a given type. If you come from the relational world, this is as simply as a SELECT COUNT(*) statement, but the Datastore works differently.
Returns the number of results matching the query. (...) Unless the result count is expected to be small, it is best to specify a limit argument; otherwise the method will continue until it finishes counting or times out.
If you are expecting to have a large number of entries, the count method may time out before finishing counting, and even if it doesn't, count scales linearly with the total number of entries to count, so you can't expect it to be fast.
(The definition of "large number of entries" depends on multiple factors. As a rule of thumb, think approximately anything greater than 1,000.)
Let's see what we can do to solve this problem.
Solution 1: Accessing the Datastore Statistics
Behind the scenes, the Datastore keeps several statistics about every stored entity kind, including the total number of entries. You can see these values by opening the Google Cloud Console and navigating to Storage > Cloud Datastore > Dashboard, then selecting a specific Kind in the dropdown.
With only a few lines we can access these same values programmatically. In the official documentation you can find a simple code example in Java and Python.
Take into account that these values aren't updated in real time. If you need dead-on statistics, then keep reading.
Solution 2: Keeping a counter
The second solution to this problem is to keep a counter saved in the Datastore. Every time a new entry is written, you'll have to increment the counter, and every time you delete an entry, you'll have to decrement it.
Implementing this solution is very simple, and the main advantage is that retrieving the number of entries is as fast as retrieving just a value from the Datastore (as fast as it gets!)
There's a problem though: depending on the frequency that new entries are added or deleted, this solution may lead to Datastore contention:
Datastore contention occurs when a single entity or entity group is updated too rapidly. The datastore will queue concurrent requests to wait their turn. Requests waiting in the queue past the timeout period will throw a concurrency exception. If you're expecting to update a single entity or write to an entity group more than several times per second, it's best to re-work your design early-on to avoid possible contention once your application is deployed.
If you are expecting a high number of concurrent updates, then you'll need to upgrade to sharding counters.
Solution 3: Sharding counters
To avoid writing too rapidly to same entity in the Datastore to keep the total number of entries, you can break your counter up into different entities that don't belong to the same entity group.
So instead of having a single entity holding the total count, you'll have several entities with a partial result, which you will randomly select at the time of making an update.
This article does a better job explaining the algorithm and presenting an example source code in several languages.
It gets better from here
Depending on the throughput of your system, you might need to get creative about how to select and distribute the shards. Unfortunately, every problem is different, so it's hard to can every possible solution. Hopefully this gave you the right foundation.
And always remember to try the easy solution first, and only keep progressing from there is that doesn't fit the bill. It's all about finding the right compromises.
The best part of writing a blog is having thousand of people following you. This doesn't mean you are going to use this priviledge responsibly refrending from writing non-sense. I write crap all the time, so I apologize about that.