Holograms and Data Science

This will be a quick and different post, recording one of these A-HA moments we sometimes have, before I forget it. It is a random thought that’s got more than 140 characters. A tweet would not do it.

I just finished reading The Black Hole War, by Leonard Susskind (one of the most important physicists alive today). It is a great book if you are interested in modern physics. Here is an excerpt of my review on Goodreads:

(…). For non-physicists like me, this was a fantastic introduction on what we currently know about quantum gravity and its relation with other areas of science. As a bonus, it also (finally) helped me start grasping string theory, and better understand entropy, the event horizon (complementarity, information paradox) and the holographic principle.

My A-HA moment came while the book discussed the Holographic Principle. Coincidence or not, I have been studying a whole bunch of Machine Learning, and things converged. I love when (seemingly) completely different areas of science suddenly converge. It reinforces my belief that there must be something common underlying all of it (call it Grand Unified Theory, or God, or as you please).

Principal Component Analysis and Dimensionality Reduction are the Data Science (Machine Learning) topics in particular that seem to be correlated with the Holographic Principle. They are techniques of removing redundancy in data sets and extracting the minimal data necessary to represent something. It allows you to reduce the number of columns in a database without losing meaningful information, for example. The main technique can be seen as a linear algebra transformation: a projection of the N-dimensional data onto a smaller K-dimensional space.

In a sense, holograms are the same thing: a projection (encoding) of a higher dimensional space onto a smaller dimensional space (3D to 2D, for example). The Holographic Principle states that all information inside a N-dimensional space is contained into its (N-1)-dimensional boundary. For example, all the information inside a volume (3D) is contained (or described) onto its surface/area (2D).

This is fascinating. Holograms are fascinating. It could mean that not all dimensions (e.g.: rows/columns in your database) are necessary to perfectly describe any information, and removing that redundancy is precisely what Dimensionality Reduction (and PCA) tries to do. I wonder if we can use the Holographic Principle to find ways to do Dimensionality Reduction without losing any information (a loss-less compression, if you will).

Linear algebra is another fascinating aspect of all this. Projecting data and extracting meaningful information (compression) always seems to involve the calculation of Eigenvectors and Eigenvalues. Google is constantly calculating Eigenvectors that power your searches. Somehow, Eigenvectors also seem to be central to all of this.

I am sure this is not something I’m inventing or discovering and there are plenty of papers about it out there. I only happen to not have stumbled upon any of them. A quick search tells me that Holography and Dimensionality Reduction is being used in many different areas of science, including genetics and biology. If you know of any of such papers (proving or disproving my random thought), let me know in the comments.

Software Development Guidelines

From Merrian-Webster:

guide·line (noun): an indication or outline of policy or conduct

U.S. Dept. of Veterans Affairs:

A guideline is a statement by which to determine a course of action. A guideline aims to streamline particular processes according to a set routine or sound practice. By definition, following a guideline is never mandatory. Guidelines are not binding and are not enforced.

These definitions are very important. What I am listing here are my guidelines, a set of items that usually drive my particular style of coding and engineering. They are not rules. As someone once wisely pointed out (no references, sorry):

“guidelines account for judgement, rules don’t”

They are also my guidelines. It is ok if they do not fit in your case, I am not trying to describe the definitive way of writing, maintaining and operating software. Most of the items here should be obvious for many people, but I hope this helps you think about your own guidelines and understand more others.

After a quick disclaimer, let us go to the list. There is a lot more I would like to publish someday, but I will try to not make this too long and painful.

Extremism

There is a big chance that ying-yang is in my blood; my father is Chinese. One of the most important lessons I learned so far is that extremism is bad, everything needs balance.

This is at the top of my list, mainly because it nicely applies to the other items as well. I consider them to be good ideas, but I will not just blindly apply them to every situation. I usually do, unless I can come up with a very good justification not to.

The recursiveness of this item is also beautiful. It means that you can even be extreme (passionate?) about something, as long as you can give a damn good reason for it. You should not be extreme about not being extreme.

Take for example my deep hate for inheritance (in the OO context): there are good reasons for it. But it is not true that I would never use inheritance: the fact that it should be avoided is something to keep in mind, not something to block you from delivering software.

While we are on the topic, I really like how Go approaches inheritance.

Enabling vs. Directing

I took this very important lesson from some good discussions I had with my friend Tiago many years ago, back when we were coworkers in Germany.

Enabling means that something can be used in many different ways. Even in ways that we have not even considered yet. Humans are creative and will come up with different ways of using and applying an enabling idea. Directing however, means something that is designed to be used in a particular way, or an specific action.

A good example is how to design Java annotations. Here is how we could annotate a class to define that its objects need to be saved to a database (i.e. they are persistent):

@Save
public class Person {
  // ...
}

This is an example of a directing implementation, it specifically tells that objects of this class need to be saved. Using that information for anything else would be awkward.

Here is how the Java Persistence API (thanks to Hibernate) defines it:

@Entity
public class Person {
  // ...
}

This information could be used by many different components on a system. For instance, a logging component could read and use it to determine that some of its fields need to be filtered from logs (such as passwords), just because persistent entities usually contain sensitive data. A formatting component could use different colors for entity objects. There are all sorts of different uses for that information and creative people will come up with even more.

It does not mean that all of those uses will be correct, but that enables more from your code and makes it potentially more flexible.

This concept applies to almost any decision we need to do related to software development. Think about shipping a new feature, designing a component, planning an api, coding styleguides, people and project management styles, technology (framework?) choices, etc. There is probably a more enabling than directing way of doing them all in each context, which would empower people more.

In practice, it can be very hard. I have experienced many cases where it was really hard to tell if we were directing or enabling people. Keeping this in mind already helps a lot though.

Premature Optimization

This has been discussed a lot already, no need to talk too much about it. We all know that “Premature optimization is the root of all evil”.

However, quoting Albert Einstein:

“Everything should be made as simple as possible, but not simpler.”

Do not use premature optimization as an excuse to be lazy, or irresponsible about what you ship. Keep it in mind, but balance it up with your previous experience and feedback from others. There are times when you just know it is going to bite you soon.

Which leads us to the next item…

Productionization

I have learnt a lot of this over the past years running large production services (both at Locaweb and now at Heroku).

From the beginning, think about how your code is going to run in production. Do you understand the platform (runtime) it is going to run on? Are you confortable in troubleshooting hard problems? While we are on the topic, how are you planning to debug those problems?

Are you collecting metrics of how it is being used and how it services its requests? Even if you do not officially publish a SLA, understanding your service times is invaluable to dig into issues that will come with production load. Remember to track not only mean values, mean alone is useless. Always mix it up with variance, or rather with percentiles. Try to target for high percentiles – 95th and 99th are good targets, depending on what your scale is.

Last but not least: are you confident that you can notice (and be notified of) problems before your customers start complaining about your service on twitter? Metrics and automated monitoring are very important to run production services.

Testing

It does not really matter if automated or not. Yes, I said it. I have seen successful software (for whatever that means) both ways. Testing what you ship is just being responsible about it. And that includes having the proper infrastructure to do it: staging environments, gradual rollouts, feature flagging, etc.

Do not get me wrong. Automated testing should be usually preferred, but it is not the ultimate goal. I have seen many “evangelists” speaking hours about how automated testing is important, while the kernel of the operating system they use prefers a more traditional approach with lots of manual (or semi-automated) tests, Q/A testing teams and not many (or even zero) automated tests in its codebase.

Quick note about test (or behavior) driven development: it has more to do with software design methods than with the tests themselves. IMO it is a good practice, which works for me sometimes, but not always in every project, everyday.

Dependency Inversion

I am really proud of what some of my heavy Java development days have taught me.

Designing loosely coupled components is a big one. Dependency Injection, Dependency Inversion Principle and Inversion of Control are all related topics that to me mean a simple thing: design for single (or few) responsibilities.

It means that when you are writing that piece of code (be it a method, function, class, module, script or whatever), focus it on a single responsibility, and keep it small. If it gets to big, break it. If it needs resources to do its job, do not go after the resources it needs there, receive (inject) them instead.

In summary: instead of going after your dependencies, inject them. When a component needs to go after a resource it needs, it usually means that:

  1. The component is now responsible for its dependency lifecycle: this adds more responsibility. Now that it opens that damn connection to your DB (or your queue service), it needs to decide when to close, and needs to know how to do it. Also, what happens when you need to share the same connection in other parts of the system?
  2. The resource needs to be globally accessible: if the component did not create the resource it needs, or if that resource is shared in many parts of the codebase, it needs to be globally available. Hiding the resource creation behind factories does not help when the factory objects themselves need to be globally reachable. I hope I do not need to talk much about how globally accessible things are bad, but it is usually hard to test components using them, and it leads to tighter coupling. Changing that global resource gets harder with more stuff using it.

When you invert control and inject dependencies instead, you have the opportunity to centralize resource management in a single place. That single place centralizes the (single) responsibility of deciding where to inject that resource and how to share it. In more fancy environments, something like this can be called Dependency Injection Framework, but it does not need to be a big bloated framework once you understand the mechanics. In fact, this does not require a framework at all, it is just plain old method/function/constructor invocation with proper parameters.

All in all, let us please not forget about balance. This is not a rule, there are times when what you need is just a goddamn simple function.

Do Not Block The Event Loop

Evented programming (or event-driven programming) is very popular these days and one of the side effects is that some projects will want to be evented without careful consideration.

Going evented or not is one of the big decisions involved in writing code, with big implications. Things blocking the event loop are usually very hard to debug. When you go that route, all code called (including external libraries) needs to be aware of the event loop and not to block it.

There are many advantages though, notably it leads to much more lightweight servers which support much higher concurrency levels. When I am deciding if I should do event-driven code or not, here is what I consider (feel free to add a comment below with your own thoughts):

  • Is it an event-driven runtime/platform? Nodejs, for example, has been designed from the ground up to be evented. Meaning that all libraries and code written to run on it are already aware of the event loop. If the platform was not designed to be evented (like Ruby with EventMachine) much more care must be taken to not call code which will block the event loop. It is hard to control all the code in libraries included in your project. Take that into consideration.
  • Consider evented if the piece of code you are writing is mostly a data multiplexer, meaning that it just takes data from one side and sends it to another, acting like a pipe, distributor, load balancer, or router. This type of component is usually I/O bound.
  • Avoid evented if the piece of code you are writing is CPU bound and does a lot of processing. The chance it will block the event loop is much higher. I have seen projects having to resort to threads (or external processes) to move that part out of the event loop, often leading to spaghetti concurrency very hard to follow. Part evented, part threaded.

There Is Always Something To Learn

As I learn more, I expect this list to change, but I would say that it currently contains the factors influencing my development style the most. It was a very healthy exercise to think about what defines me as a programmer. I hope it is for you too, try it out!

Spider Man’s professionalism at RailsConf 09

Professionalism definition from Uncle Bob in his recent talk at RailsConf 2009:

“Discipline to wielding of power.”

— Robert Martin (aka Uncle Bob)

To me, it looks pretty much similar to the “Ruby” definition from Chad Fowler in the last Rails Summit Brasil:

“Ruby is a dangerous tool.”

— Chad Fowler

Wich makes me remember Uncle Ben’s advice to Spider Man:

“With great power comes great responsibility.”

— Uncle Ben

Would be Uncle Bob inspired by Uncle Ben? What if they are the same person? 😀

Rfactor: Ruby Refactoring for your loved editor

I know we all love Ruby, and doesn’t care that much about not having auto completion/IntelliSense available.

I don’t care that much about auto completion, when coding in Ruby, myself. What I really like in Java IDEs is their refactoring support. Eclipse and IntelliJ IDEA are simply awesome in this space for Java. We still have ReSharper for Visual Studio and others, targeting other languages. Ruby has NetBeans, Aptana RadRails, RubyMine and TurboRuby/3rdRail doing a great job in this area.

But, I have this feeling that most of Ruby developers do not use IDEs (including myself). We are using good text editors, such as TextMate, Vim, Emacs and GEdit. They are good enough. Why would I need something else?

I have to admit. I really miss some refactorings while programming in Ruby. Particularly, the lack of “Extract Method” and “Extract Variable” bothers me. They aren’t even complicated, why hasn’t someone already implemented them?

So, I would like to introduce Rfactor. It is a Ruby gem, which aims to provide common and simple refactorings for Ruby code. RubyParser from Ryan Davis is being used to analyze and manipulate the source code AST, in the form of Sexps.

In theory, we should be able to use Rfactor to power any editor, adding refactoring capabilities to it. I’m targeting TextMate, but I would love to see contributions for others. The TextMate Bundle is hosted on github:

Rfactor TextMate Bundle, with installation instructions

This very first release has support only for basic “Extract Method”: inside methods and without trying to guess the method parameters and return.

Stay in touch, there is much more coming!

Word Movement in OS X Leopard Terminal.app

Word movement in OS X Leopard Terminal.app is a pain! After long time searching, I must keep the solution documented here.

I’ve been searching for a long time, how to fix home/end keys and how to jump words. In every OS X application, cmd + arrows and option + arrows would do the trick, except Terminal.app. I had once fixed it for OS X Tiger, but I couldn’t remember how…

Finally, I’ve found it. Thanks Textmate guys!

My choice is for fn + arrows (home/end) to begin/end of lines and ctrl + arrows to jump words. Fire your Terminal.app, hit cmd + , (yes, period); the alternative is Terminal -> Preferences. Go to Settings area, then Keyboard tab. Edit your combos as below:

Terminal.app keyboard settings

The trick is the code \033b. It is produced through esc (\033) + b and represents “move one word backward”. Forward movement is esc + f, home is ctrl + a (\001) and end is ctrl + e (\005).