Lately I've been using more Python than Java and here is why

I've always considered myself a Java guy. I really liked C# back in the day when I actually used it, but I haven't for more than 3-4 years, so I can't say I'm a .NET guy anymore.

Anything serious I have to develop, I think Java first, but lately I've found myself changing the course a little bit1.

Python is so easy to use; the barrier to start is so low; what needs to be done to have a full working Python script is so little that it's just hard to justify not doing it in Python.

If I open my command line interface, type python and hit enter, I can start testing theories and trying out code right away without anything else. No IDE, no "New program...", no required main class, nothing.

The language itself is clean, easy to understand, and you can use any editor you have available. No Eclipse, no NetBeans, no IntelliJ, and not 5GB of Visual Studio for sure.

Like with Java, there are a bunch of people writing awesome (open source) libraries for Python. Like Java, you can run your Python code everywhere. Like Java, Python is fully supported in Google App Engine (and I'm in love with it!)

And like I mentioned before in a different post, it's not only me, which is really encouraging.

But of course, this doesn't mean I don't love Java. These two are both great languages and I'm not planning to ditch Java anytime soon. Python however is winning the lightweight title with me. More complicated stuff? We'll have to wait and see.

(If you are starting out in a development career, I seriously recommend you to look at Python. You'll love it and it will teach you all the fundamentals necessary to step up to more serious stuff later. And of course you can always buy this book, love Python forever, and never ever move on.)

--

1 For the last 4 serious projects I've coded I used Python three times and Java once (and Java just because it was an Android exercise for my Masters class.)

How to bulk delete entries in the App Engine's Datastore

This is a very common problem everyone new to the Datastore faces sooner rather than later: you've got a specific entity kind and you want to remove multiples entries from it. Unfortunately there's no DELETE FROM Entity instruction, so the problem is a little bit more complicated than what it seems.

There are multiple options that you'll need to evaluate to make sure you take the most appropriate way. Below, I'll try to explore each one of the possible scenarios/solutions to give you enough options to choose from:

Do you really need to delete those entries?

First of all, I want you to ask yourself if you really need to delete the information. It might be cheaper to just keep it. It might be cheaper to soft-delete it by adding an "Archived" field instead. It all depends on multiple factors, but you should spend some time thinking about it.

Deleting is expensive, especially if you are deleting entries represented in multiple indexes. The cheapest way to delete something is by leaving it alone.

So don't jump into conclusions right away, and try to think what would happen if you decide not to remove the entries at all.

Special case: Deleting all entries by hand

Before getting into more complicated procedures, if all you want to do is remove all existing entries by hand, you can use the Datastore Admin interface provided by Google:

  1. Enable the Datastore Admin
  2. Navigate to the Datastore Admin tab in the old appengine.google.com console.
  3. Select the entity kind you want to remove, and click on the Delete Entities button.

Note: At the time of this writing, the old App Engine console still exists, but Google is migrating everything to the new console. The Datastore Admin feature only exists in the old console, but that will hopefully change. I'll make sure to update this post when (if) that happens.

Deleting entries using the Remote API

If you want a little bit more flexibility than removing all entries for a given kind, you may want to consider the Remote API (Python, Java).

The Remote API provides an interactive shell for you to execute Datastore commands locally. Something like:

>>> from google.appengine.ext import db
>>> entries = Entry.all(keys_only=True)
>>> db.delete(entries)

The above code will select all keys for the Entry kind and delete them. Instead of running from a file in App Engine, the code will be running directly on your local computer, and the Remote API will take care of executing each command in the remote Datastore.

Deleting entries programmatically - The simplest approach

Of course, removing entries manually is very easy, but things start getting complicated if you want to remove them programmatically.

If you are looking at just a few entries, you may get away by doing something like this:

db.delete(Entry.all(keys_only=True))

This is exactly the same code we used with the Remote API example above, but put together in a single line. Very likely this is the simplest approach that you can follow to remove entries of a given kind in code: load the keys, then delete them.

Deleting entries in multiple batches

Unfortunately, the method described above to remove multiple entries breaks as soon as we need to get rid of a larger set of entries. The Datastore has a 30-second deadline limit, which means that we need to come up with a different solution for scenarios involving bigger data sets.

A well-understood approach to do this is to remove the data in multiple batches. For this you'd use two things: Tasks (Java, Python) and Cursors (Java, Python).

Let's say that you already identified that you can easily remove 1,000 entries at the same time without hitting the Datastore 30-second deadline. Using a cursor you can limit your Datastore queries to 1,000 entries at the time, and using a Task Queue you can distribute multiple operations over time to avoid hitting another App Engine deadline: the 60-second per request limit.

from google.appengine.ext import ndb
from google.appengine.api import taskqueue

class Task(webapp2.RequestHandler):
    def post(self): 
        entries = Entry.all(keys_only=True)

        bookmark = self.request.get('bookmark')
        if bookmark:
            cursor = 
                ndb.Cursor.from_websafe_string(bookmark)

        query = Entry.query()
        entries, next_cursor, more = query.fetch_page(
            1000, 
            keys_only=True, 
            start_cursor=cursor)

        ndb.delete_multi(entries)

        bookmark = None
        if more:
            bookmark = next_cursor.to_websafe_string()

        taskqueue.add(
            url='/task', 
            params={'bookmark': bookmark})

The above code shows the implementation of a Task that removes 1,000 entries at the time from the Datastore. Note how this task schedules itself at the end of the method passing a "bookmark" as an argument, so the next execution starts from where the previous one left off.

Deleting entries using Mapreduce

Despite the above solution works fine, if you are planning to remove large amount of entries, I'd recommend looking into Mapreduce:

MapReduce is a programming model for processing large amounts of data in a parallel and distributed fashion. It is useful for large, long-running jobs that cannot be handled within the scope of a single request (...)

By using Mapreduce, you can delegate all the plumbing to make your solution work in parallel to the framework and concentrate only on deleting the entries. The resulting code will be clearer and the processing will be executed optimally in App Engine.

Spreading the costs over time

Before finishing this post, I want to make sure you keep something in mind: deleting is an expensive operation, so depending on how many entries you want to remove, you may be looking at a very large bill at the end of the month.

One solution to this when applicable, is to spread the costs over time by removing a fixed amount of entries every day. Here you can play with the free quotas of the Datastore or simply spread the costs to avoid paying a huge one time lump sum.

As long as you don't jump to code right away and properly evaluate your options before hand, you should be fine. The Datastore is extremely powerful, but complicated to use as soon as we start manipulating large amount of data, so it requires a more careful approach to what some of us are used to.

Quick thought about smoothing rough edges of two scattered development teams

Developers are weird (I can tell because I'm a developer). We are usually very proud of our careers, and very protective of our work, so the weirdness shows whenever we have to interact with other developers.

I've seen it over and over again: developers have a hard time saying nice things about the work of other developers.

It's a shame, but it's also a fact.

So it's not hard to understand that one of the biggest risks of two development teams working together is precisely the fact that they are working together. It's a fragile relationship bound to break at any time, and when it does, the source code will be caught in the middle of it.

I probably made it sound a little bit dramatic above, but I think the point stands: you don't want your two teams to be fighting.

I have this situation at my company right now: two development teams a thousand miles apart that just started working on the same project together.

I'm afraid of what could possible come down the road, so I want to fight it before it even shows.

Here is an idea (not mine, somebody mentioned it and now I'm championing it left and right): getting both teams together doing something fun may smooth out rough edges and bring up the human side of everyone when things get tight.

It easier to blame somebody you've never seeing before. It's easier to point fingers if you don't have a face for that name. I think we can prevent some bumpiness if we take the steps early on and have everyone sharing beers together for a couple of days.

So before going over the first line of code, we are going to try and have everyone sharing some fun time together. Maybe lunch, drinks, and then dinner. Have some laughs and show everyone that we can easily be a team.

TDD is not about the tests

This is an interesting thought that makes a lot of sense when I think about it:

Test-Driven Development is not about testing code, but about designing it. The tests happen to be a very useful byproduct of the process.

I've always thought about it the other way: Let's test, and oh, nice! We also get a better design! But In my book the design is a little bit more important, so it makes sense to flip the equation.

Engineering at Levatas

On every new interview I find myself explaining how the engineering team at Levatas works. Every new candidate wants to know about our company, and I have to go over all the details not only during our conversation, but every time I'm pitching our group to potential candidates.

(The video above is the background of our website. It has no audio because it shouldn't. I'm putting it here because it's awesome and shows a little bit about our culture.)

This is a quick recollection of all those small conversations about Levatas. Hopefully I'll be able to send this link to every new candidate going forward.

Levatas

Levatas is a small company in the always sunny and beautiful Palm Beach Gardens, Florida. We currently have two offices located a few yards away from each other.

We serve free food on Fridays and snacks 24/7. We play ping-pong and Xbox, and write in the walls all over the place. We brew our own beer and fun stuff is always welcome.

The Technology Group

That's how we call our group, which is more than pure software developers: we also have UX, UI, QA, and PMO.

And of course, a lot of developers.

But we are not just technology at Levatas; we also have a marketing department that complements our services very well. Clients seem to love the fusion.

Engineering

"Software Developer" is not cool anymore, so we call ourselves "Engineers" and we are (have) a bunch of those around here.

We are a well-balanced team. We have developers all the way from juniors to ninjas. All together, we cover a wide variety of technologies and skills.

The Office

Most people work from the office (if you are full time, you have a desk). Contractors can work from home or can come to the office if they prefer. Offshore contractors (we have a bunch) work from their countries (duh!).

You have to be in the office during our core hours 10:00am to 4:00pm. You need a badge to get in, and if you want others making fun of you, I dare you to come wearing a suit.

Our Meetings

Management positions come with a lot of meetings, but regular employees get the bare minimum. As a company we only meet on Fridays for 20 - 30 minutes for a quick catch-up.

We also run Scrum, which comes with a bunch of meetings disguised as "ceremonies". Each project team has their own.

Communication

Internally we use Slack, but I know some people can't let go and still use Skype and Google Hangouts. Of course, email is a big part of internal and external communication.

We do have office phones, and use WebEx, GoToMeeting, and JoinMe heavily.

Scrum

We brag about being an agile company. We run Scrum. Not by the book, but our own version of Scrum, crafted and modified over the last few years.

There's still a ton of stuff we need to get better at with this process, but we've totally moved away from rigid processes.

Windows vs Mac

There's more Windows among engineers (designers are heavy towards the Mac side). Discussions about which one is better are frequent and heated, but we Mac people end always winning (of course!)

Technologies

We don't have anything in the books that you have to use and I love that. We love to explore and try out the latest and the greatest. Over time we've settled in a core set of technologies we use the most, but we keep our eyes open.

If you really want to know, here is some of the buzzwords that might do it for you: Git, HTML5/CSS3/JavaScript, AngularJS, .NET, Google App Engine, PHP, MySQL, SQL Server, Dojo...

Our bread and butter

We are a digital agency, but in 2014 - 2015 that mostly means "a web company". Almost all we do in our Technology group is for the web, so our bread and butter is written in JavaScript and runs in multiple browsers.

We do not support IE8 anymore unless the company pays a lot.

Lunch and Learn

We do these from time to time (pizza included). Somebody stands up and teaches the crowd something they don't know about (the crowd, not the speaker). We use this time to increase our general knowledge about hot topics in the industry.

Source Control Management

We used Subversion for years, until we decided to migrate to Git. We kept the existing code base in Subversion, and only new projects are created in GitHub.

I love Git and don't like Subversion. I'm glad most people at Levatas feel the same. It takes some time to understand the nuances of Git when you come from the SVN world, but I think the time is totally worth it.

Unit Testing

We don't most of the time, but we want to. We do have some projects with a lot of unit test coverage which is amazing, but we haven't gotten our entire team onboard yet.

Certain things are easier to test than others. Most of the stuff we do fall in this "other" group. The stuff that doesn't, we try to test.

Code Reviews

We started doing them as a general practice, but we haven't gotten too far with it. Most code reviews happen informally, and it's something that we definitively want to make better.

Now that we are almost exclusively working with GitHub, our plan is to use Pull Requests as a tool to propel Code Reviews forward.

Continuous Integration

We don't do automated Continuous Integration for most of our projects. We do have it for some though using a tool that we built internally. We have to grow in this area a lot, but we have taken the right steps so far.

Deployments

We built a tool for one of our projects that takes care of deploying the app in multiple environments. We love it and it works great.

Then we built another tool for another client and we use it to manage a lot of small projects. Last year alone, it managed around 45 applications, and this year that number will pale in comparison.

And now we are building another one, this time to Rule Them All. It's not done yet at the time of this writing, but it will be soon.

Our Clients

Last but not least, we work for a bunch of awesome clients: IBM, HSBC, Duffy's, Bennett, Cisco, HP, Penn Mutual, Dell, Leap... and the list goes on and on.

Some of these you probably know already, some are less prominent but equally fun to work with. For all of them we go the whole nine yards.

A final word

We are just like any other engineering team out there. We try to do great work and have fun while at it.

We aren't perfect and have a lot to change and learn, but we are moving in the right direction. The best quality we have is our passion to make our company a better place.

Deadline errors: 60 seconds or less in Google App Engine

When using frontend instances in Google App Engine, every request has a 60-second budget to process and return a response back to the caller. If this time goes by and the code hasn't returned, a DeadlineExceededError (Python) or DeadlineExceededException (Java) is thrown.

And everyone is always bitching about this (I'm thinking about myself here).

Why a deadline in the first place?

I think there are several reasons:

First of all, we need to remember that we are using a platform shared with other applications. The best way to make sure we are good citizens and don't do stupid things is by having the platform itself enforce the rules.

Do you imagine if just for the fun of it we make our requests never return, keeping our frontend instances working 24x7? Yes, we'll pay for it (literally with cash) but it will also hurt the platform's ability to share unused resources.

Or we may have an honest mistake in the code that causes an infinite loop, and the enforcement of a deadline will save us from a surprising bill at the end of the month.

The deadline makes sure everyone plays by the same rules, keeping the entire ecosystem healthy. I also see it as an opportunity for developers to make sure we write the best possible (performant) code, removing any probable sloppiness from the equation.

What causes deadline errors?

The simple answer is "anything that makes a frontend instance request last longer than 60 seconds", but that's probably not too helpful, so here are some of the things you might want to consider when trying to discover potential problematic areas:

  • Fetching external URLs: Are you using any external services via the URL Fetch API (Java, Python)? Keep in mind that HTTP requests may take a long time, especially if the requested service is down or fails to return in a short amount of time.

  • Datastore queries: Are you doing any heavy processing with the Datastore? Maybe queries that are taking a long time to return? Remember it's very easy to get caught in long-running operations that timeout after a long time.

  • Datastore contention: Updating too frequently the same entity group in the datastore may lead to datastore contention, which in turn will cause your application to hang up more time than needed.

  • Sending emails: A very common case I've seen is trying to blast multiple emails during a user request.

  • Startup time: This is a personal pet peeve of mine: using the Spring Framework on App Engine is a one way ticket to a bunch of deadline exceptions during startup time. I wrote about this some time ago, and unfortunately I don't think it has changed.

  • Any other long running operations: Are you doing any heavy processing using frontend instances? Long loops over a lot of data, or complicated computation that takes a bunch of time?

The above is not by any means a comprehensive list, but rather some of the areas I've found problematic in the past. Your application will certainly have its own characteristics that may or may not align with the above list.

How to avoid deadline errors?

Now that you know what's going on with your application, it's time to find a solution.

I was thinking about how exactly to write this section of the post, and decided to simply list some general ideas that will help with deadline exceptions. You'll have to decide which one of these "tools" is the best answer to your problem:

  • Consider processing large amounts of data in a parallel and distributed fashion using the MapReduce library.

  • Consider redesigning your data model to avoid Datastore contention.

  • Consider executing operations outside the original user request by using Task Queues (Java, Python) (which will give you 10 minutes instead of 60 seconds).

  • Consider using asynchronous Datastore requests to execute as many operations in parallel as possible (Java, Python).

  • Memcache (Java, Python) is always your friend, and can offload some heavy lifting from your Datastore (and your bill), making everything faster in the process.

  • Make sure your code is ready to serve requests as fast as possible (minimum startup time). I can't stress this one enough. While you are at it, make sure you enable Warmup Requests (Java, Python).

So what's next?

You may want to read this article about this same topic. You might also want to think hard about your specific problem because chances are you'll need to get creative finding the solution you need.

(Certainly one of the things that I love about this type of problem: there's plenty room for creative thinking).

Designing for failure

A few weeks ago somebody told me in a meeting: "You are always thinking that things are gonna go wrong." and honestly, that made me feel very good. (This probably makes me a pessimistic, but I'm OK with that.)

Apparently I'm always the one considering what's gonna happen when (and not if) real life proves to be more challenging than our idealistic dreams.

When we are designing distributed systems, our application needs to tolerate the failure of any component at any given time. Any service call could fail, and it's our responsibility to respond to this as gracefully as possible.

(Distributed system is probably where failure is more common, but I can see this approach being valuable for any type of system.)

Pretending that everything is going to work all the time every time is not only irresponsible, but totally unprofessional.

Designing for failure is hard. It introduces extra complexity in our code, but not doing it is not an option.

The browser cache has become the scapegoat for every web development mistake

I sort of touched on this topic sometime ago while talking about the browser cache and people having to clear it. If I remember correctly, I was frustrated, and this is how I ended the post:

Learn about leveraging cache, and stop giving the "please clear your cache" answer for each one of your problems.

So this post is just to formally state that I believe that web developers use the browser cache as the ultimate scapegoat for every mistake they make.

Instead of spending 30 seconds thinking about what the problem really is, I've seen how the first reaction is always "please, clear your cache".

It really bugs me. It's wrong and unprofessional. I've gone as far as to count how many times I receive this answer and how many times it's the wrong reason.

My advise is to think twice every time you feel the urge to blame the cache. And if you're right and the cache is giving you a hard time, then keep working until you fix that problem.

How to count all entries of a given type in the App Engine's Datastore?

A very common problem when working with the Datastore is to count all the entries of a given type. If you come from the relational world, this is as simply as a SELECT COUNT(*) statement, but the Datastore works differently.

Here is the documentation for the count method in the Query class:

Returns the number of results matching the query. (...) Unless the result count is expected to be small, it is best to specify a limit argument; otherwise the method will continue until it finishes counting or times out.

If you are expecting to have a large number of entries, the count method may time out before finishing counting, and even if it doesn't, count scales linearly with the total number of entries to count, so you can't expect it to be fast.

(The definition of "large number of entries" depends on multiple factors. As a rule of thumb, think approximately anything greater than 1,000.)

Let's see what we can do to solve this problem.

Solution 1: Accessing the Datastore Statistics

Behind the scenes, the Datastore keeps several statistics about every stored entity kind, including the total number of entries. You can see these values by opening the Google Cloud Console and navigating to Storage > Cloud Datastore > Dashboard, then selecting a specific Kind in the dropdown.

With only a few lines we can access these same values programmatically. In the official documentation you can find a simple code example in Java and Python.

Take into account that these values aren't updated in real time. If you need dead-on statistics, then keep reading.

Solution 2: Keeping a counter

The second solution to this problem is to keep a counter saved in the Datastore. Every time a new entry is written, you'll have to increment the counter, and every time you delete an entry, you'll have to decrement it.

Implementing this solution is very simple, and the main advantage is that retrieving the number of entries is as fast as retrieving just a value from the Datastore (as fast as it gets!)

There's a problem though: depending on the frequency that new entries are added or deleted, this solution may lead to Datastore contention:

Datastore contention occurs when a single entity or entity group is updated too rapidly. The datastore will queue concurrent requests to wait their turn. Requests waiting in the queue past the timeout period will throw a concurrency exception. If you're expecting to update a single entity or write to an entity group more than several times per second, it's best to re-work your design early-on to avoid possible contention once your application is deployed.

If you are expecting a high number of concurrent updates, then you'll need to upgrade to sharding counters.

Solution 3: Sharding counters

To avoid writing too rapidly to same entity in the Datastore to keep the total number of entries, you can break your counter up into different entities that don't belong to the same entity group.

So instead of having a single entity holding the total count, you'll have several entities with a partial result, which you will randomly select at the time of making an update.

This article does a better job explaining the algorithm and presenting an example source code in several languages.

It gets better from here

Depending on the throughput of your system, you might need to get creative about how to select and distribute the shards. Unfortunately, every problem is different, so it's hard to can every possible solution. Hopefully this gave you the right foundation.

And always remember to try the easy solution first, and only keep progressing from there is that doesn't fit the bill. It's all about finding the right compromises.

Probably the most simple stupid-sounding tip about unit testing

If you are tip toeing in the unit testing world, this is a very simple tip but one that I've found extremely valuable:

Forget about the code. Think about the functionality.

Yeah, I know it sounds stupid, but I frequently find that people think about the method they want to test instead of the function of that method that needs to be tested.

See the difference?

(I'm really bad at explaining this, and at this point you are probably totally confused, so I'm going to try again below.)

Every time somebody asks me "How should I test this method?" my answer is always the same: "Test everything that can possible fail".

The point is that you need to abstract yourself away from the code of the method. It really doesn't matter if it does this or that; that's not what you should focus on.

Better now?