Improving the quality of user stories in a multi-team scrum process

March 3rd, 2011

At Turtle Entertainment we are employing scrum for over a year. We have three teams each having one dedicated product owner. By a lot of very valuable help from bor!sgloger the teams are working synchronized since the beginning of 2011. All three teams take part in a combined planning and review meeting. This was a great improvement because now we can proceed very efficiently on the company backlog. But yet we are facing the consequences of forcing teams to work together in a highly efficient process.

Why improve quality?

Team synchronization enforces the teams to work together. Also our product owners have to form a team. This unveiled a bunch of problems to us. These problems existed a long time before but became visible now:

  • The isolation of teams was close to total. They did not talk to each other. So knowledge got stuck at a team for example: How does our testing system work? How do I test most efficiently? – Only one team had the answer and they did not share it.
  • The product owners were isolated too. They did not share their experience on writing stories nor did they communicate their stories to each other.
  • Product owner of team A could not explain a story to team B. And vice versa team B was not able to understand product owner A.
  • The product owners tried to go against the lack of communication by writing more “precise” stories. Eventually the stories were to detailed so they were not negotiable anymore. The stories did not comply with the INVEST model.
  • The new synchronized stories were estimated unrealistically. Comparing the sum of story points to our velocity told us that the whole backlog would be done in one sprint. This was totally unrealistic.
  • Due to the unrealistic estimations we are not able to plan what put our company backlog meeting in jeopardy.

These problems are dangerous to our efficiency they reduced the reliability of our scrum process and last but not least they bear down our teams motivation.

The root cause is easy to identify: There is no flow of information between teams & product owners, teams & teams and product owners & product owners.

We need to fix this fast.

What do we need to do?

Our measures:

  • Transparency - We want more discussion of stories upfront. So we need to start to discuss the stories really really early at the beginning of the writing process. We need to make the writing process a transparent part of the scrum process.
  • Guideline - Next we need a guide on how to discuss a story. What is important? What needs to be discussed next? When is the story ready for the selected backlog?
  • Location - We need a room which encourages discussion and enables creativity.
Scrum teams discuss a user story with the product owner (dressed as mario?! It's Carneval :-))

Scrum teams discuss a user story with the product owner (dressed as Mario?! It's Carneval :-))

How to do this?

Transparency:

  • Get rid of excel files owned only by one specific product owner. We want all the company backlog in one simple powerpoint file. The product owners can keep their backlogs but they need to pitch their new ideas to our management which then puts them into the company backlog. As off there the story only exists in the powerpoint file and gets removed from the backlog.
  • Product owners start to talk about theme-sized stories with the teams. The product owners are responsible to log the conversation between team and product owner. The scrum masters must push their teams to discuss the stories on a regular, reliable basis.

Guideline:

  • Since we are very inexperienced in discussing user stories we need guidelines to help us. This guideline is a “Definition of done for a user story” giving the answer to the question: “Is this story ready for the selected backlog yet?”. A check list is appended to each story so one can easily see the progress of the conversation. At a glace you can see if this story is doing good or not so you hopefully are encouraged to about the story.
  • We can later introduce a “Level of Done” for the story writing process which helps us with transparency also.

Location:

  • The discussion itself must be made transparent. Having three teams it is not possible to have all members discussing one story at the same time. But all members must have equal chances to know what the other teams or product owners said about a story. So we must go for a asynchron discussion by writing on sticky notes and pin them at a story. Take a look at the photos to see this in action.
  • The room where this discussion happens must not be a scrum space or meeting room. We want a living, open and ongoing discussion and even brainstorms and other creative methods should be applied. This leads to the thought of a cosy, relaxed location with enough space and creativity tools like whiteboards, flipcharts, pens in different colors and colored sticky notes.

Summary

At the end of the day we take a bit of team building and mix it up with a lot of transparency to achive better user stories. Nearly no formalization is needed, except for the check lists and user story done definition.

I think this is a good, noninvasive extension to the scrum process, which will be accepted by the teams and product owners very fast. It will take some time until we have the user stories rewritten and discussed as supposed. The (hopefully positive) effect will be visible next month.

Files

Supermicro Servers use IPMI

January 6th, 2011

Never heard of IPMI? Me neither. When we were installing our brandnew Supermicro Servers in our Datacenter (pictures here), we were wondering which Lights out management is used. Having a look at the BIOS, we quickly discovered IPMI options.

So what is IPMI in detail. IPMI gives you full remote control over your server, even if its powered off. You can toggle power states, see System Information, inspect server health conditions and most interesting: you have full remote KVM over IP including Virtual Media. This means you don’t have to spent additional money on buying hardware KVM over IP Solutions like Avocent. Its already on board!

So you think, whats the deal about it? All major Vendors have this on board, such as IBMs RSA or HP Ilo. You are right but often this is again additional spendings for the hardware board or licenses to activate options like Remote KVM or Virtual Media. Supermicro Servers IPMI functions have all this included.

Despite additional fees, in fact we really were confident with IBMs RSA when administrating our blade center. But there were many small bugs that made operations daily work annoying. Like theres that keyboard error which lets you type every keypress twice when using remote keyboard, or sometimes you wont even get a remote picture. After three month of intense use of Supermicros IPMI there was not one adminstration issue. Its very responsive, reliable and stable.

Thanks to IPMI supporting WBEM you can let automatic scripts power cycle your server for instance.

In our case we developed a Nagios Script which automatically reboots a server when it fails to come up.
So to powercycle a supermicro server over the cli, you first install wbemcli for debian ,e.g. and then execute:

1
wbemcli cm http://${LOGIN}:${PW}@${IP}:5988/root/cimv2:CIM_PowerManagementService.CreationClassName="CIM_PowerManagementService",Name="CIM:HostPowerManagement",SystemCreationClassName="CIM_ComputerSystem",SystemName="IPMI Controller 32" RequestPowerStateChange.PowerState="5",ManagedElement=CIM_ComputerSystem.CreationClassName="CIM_ComputerSystem",Name="IPMI Controller 32"

Some Pictures of the IPMI Java Applet:

Setting up a symfony project with PHPUnit on Hudson

October 6th, 2010

We are using Hudson now for several months as Continous Integration System. This short article describes how we have configured a hudson project for a symfony 1.4 project. I am assuming that the reader is already used to Hudson and knows how a normal project has to be configured.

PHPUnit is the test framework of our choice (surprise, surprise) and we are using the sfPHPUnit2Plugin for all our projects. If you do not now this plugin you may first read another post where the usage and features are described in detail.

All requirements in short:

  • Hudson has to be installed
  • Hudson plugin xUnit Plugin has to be installed
  • the symfony project needs the sfPHPUnit2Plugin
  • PHPUnit has to be installed on your test server

Ok, here the configuration steps of the hudson project:

1. Configure your project

Configure standard settings for a hudson project like source-code management settings or email notifications. Please check the official docs if you do not know how to handle this.

2.  Add a shell build step

Building a symfony project in a test environment is pretty easy. With the help of some shell commands the project is completely configured and ready for testing. Those shell commands may be entered in the build step section of the hudson project. Defining the correct commands is the main part during the configuration process.

Our configuration looks like this:

1
2
3
4
5
6
cd $WORKSPACE/trunk
sh _deployment/install_test.sh
php symfony cc
php symfony phpunit:test-all --configuration --options="--log-junit=build/testresult_$BUILD_NUMBER.xml"
cd build
ln -s -f testresult_$BUILD_NUMBER.xml currentTestResult.xml
  1. Jump in the project root of the project
  2. Install the project on the test server with the help of a internal shell script. This step includes for example the generation of the databases.yml.
  3. Clear the symfony cache (always a good choice)
  4. Run all PHPUnit tests including unit and functional tests. The test result is written in a jUnit compatible logfile (needed for the xUnit Plugin).
  5. Jump in the build directory, which is internally used by Hudson
  6. Symlink the latest testresult

3. Configure Post-Build-Action

After the xUnit Plugin is installed correctly, an additional PHPUnit Pattern field should be displayed in the post build action section. In this field has to be entered:

1
trunk/build/currentTestResult.xml

The options “Fail the build if test results were not updated this run” and “Delete temporary JUnit file” should be both checked.

The xUnit Plugin takes the currentTestResult.xml file, which was previously created with the help of the sfPHPUnit2Plugin and analyzes it. When everything works fine, you should be able to review the created test reports.

Here some screenshots how this result could look like.

Build history:

hudson-01

Trend graph of the test results:

hudson-02

In search of the anti-ddos device

September 10th, 2010

Preamble

The following text describes the evaluation of the most reasonable solution in order to achieve the goal of protecting our infrastructure from DDoS attacks. All collected values and impressions do not lay claim to being correct nor complete. This article only reflects our experiences and data and therefore should be used to help you make your own decisions.

Expectations

The Electronic Sports League – ESL – is Europe’s largest online computer gaming league. Over 2.6 million registered members generate more than 100 million page impressions per month. In order to have the ability to deal with such a huge amount of data, it requires an extremely stable IT-infrastructure. The ESL was increasingly targeted by DDoS attacks -  a distributed network of computers hammering our servers with thousands of requests. Wikipedia: “Denial of service attack“ The goal of DDoS attacks is to make the target unavailable to its intended users, therefore causing economic loss.

In search of a solution to the problem of these DDoS attacks we made a number of different approaches.

Requirements

The attacks we had to deal with were mostly simple SYN attacks, between 80k and 500k pkts/s in size. Our primary goal was to be resistant to these SYN Attacks. The device should mitigate these attacks as soon as possible, the period of vocational adjustment and thus the amount of configuration should be manageable. The more types of attacks it detects so much the better. Also it should not be attackable itself, so it should be able to operate in transparent mode. As an alternative there are different providers of proxy-services. The IP which should be protected is pointed at the proxy of the provider. They are in charge of defending the attacks. However, looking at the amount of service IPs we offer, together with the amount of traffic we generate, this option was not feasible in our case.

First Steps

The firewall we had used so far was an HP DL380 with an additional Intel Network Card running Debian. This hardware had massive problems to handle the amount of packets per second. System interrupts between 20k and 25k were leading to si values in “top” between 90% and 100%. Ksoftirq was leading in CPU-Usage. The consequences were dropped packets, the website becoming slow and unresponsive. Having a brief look at Google was promising to find a solution for this problem.

Below I don’t want to immerse myself into details, but rather give a brief overview about actions we have taken. It must be pointed out that all actions were taken to the best of our knowledge at the time, but we cannot rule out a configuration mistake that led to wrong results.

The first hit on Google was NAPI.  NAPI is designed for reducing CPU-load caused by high system interrupts.  We tried it out but it had no effect on our problem.

Next we tried tuning the Intel network-card-driver. The InterruptThrottleRate was especially interesting for us. Like NAPI, InterruptThrottleRate is also in charge of optimizing interrupts as it delays packets and thereby leading to less CPU load. To our disappointment this also had no effect in our tests.

Another approach was using syn-cookies to avoid at least SYN based DDoS attacks. In this model the common table of half-open TCP-connections is obsolete, so that it cannot be overflowed. The sequence number is calculated each time a handshake will take place. This is a good option for servers which are terminating the attacked IP, but has no influence on the firewall routing those packets. So it would still ne a case of just too many packets.

Also further approaches like SYN-proxy (not available under linux) and iptables tuning were not leading to success, so that we were forced to searching for a hardware solution. So what exactly are we looking for?

Hardware solution

Taking a look at the market for usable devices, you are promised that nearly every device is suitable for our situation. In order to develop our own opinion besides what the marketing would have us believe, we tried to reproduce the attacks in our test environment.

The test scenario

We setup 4 servers to reproduce the online scenario. 2 acting as attackers, 1 as web server and 1 as client.

netzplan_testumgebung

Software

  • Webserver: Lighttpd which serves a simple html containing few pictures
  • Attacker: sudo hping3 192.168.0.11 –interface eth0 –flood –destport 80 –syn –rand-source –verbose
  • Client: Curl-loader constantly loading static html and 4 small pictures

Hardware

Lets take a look at our nominees ;) We evaluated the following devices in chronological order. Fortigate 310-B, Juniper SRX650 (Routing Mode), Palo Alto PA-2050, TopLayer IPS-1000E, RioRey RX2310U. Except for the Juniper all devices are able to operate in transparent mode.

Fortigate 310-B

Fortigates 310-B was recommend and made available for testing by a local computer retailer which also supported us with configuring it so that misconfiguration would be minimized. The device offers many many configuration options and would be categorized as an all-rounder. We especially liked the function of virtual firewalls. Here you can configure completely independent configurations for different scenarios which you can simply enable or disable. For our main problem, the DDoS attacks, the Fortigate offers a set of special anti-DDoS policies which can be applied on every of the virtual firewalls. These policies have again thousands of configuration options you can adjust to your needs. The idea of those policies is to gain control over DDoS attacks through limiting packet rates. Sadly it it emerged in our test scenario that pretty quickly the device encountered the same problems as our Linux firewall. CPU load rises to 100% and all further packets are dropped completely. Also when you disable all rules regarding packet inspection, it cannot manage the volume of packets correctly, so we refrained from enabling further IPS functions.

Juniper SRX650

The Juniper SRX650 is a classic Layer3 Firewall. It does not support transparent mode which forced us to test it in routing mode. Besides rate limiting there are no special anit-DDoS policies that can be configured. Our tests quickly verified our presumption that this simply is not the device we are looking for. Right away few seconds after beginning the attack the SRX650 buckles under the amount of packets. The interface is completely unresponsive and needs about 5 minutes to return to a normal behavior after stopping the attack. The next best model does support the Layer 2 mode but exceeded our price range.

Palo Alto PA-2050

The PA-2050 from Palo Alto Networks also promised to solve our problem. We had direct support from the vendor who was familiar with our test setup and should have led us to quick success with the optimum configuration setup. We were surprised however when we saw Palo Alto behaving the same way the Juniper did. After a few seconds of packets no further traffic was handled and the client tried to access the test page to no avail. The Palo Alto crew tried its best, but under our circumstances we did not find any solution respecting our time frame. We still however think this device is just as  good as the juniper for other fields of application.

TopLayer IPS-1000E

The IPS-1000E belongs to the class of devices which are specialized in Intrusion Preventions Systems (IPS). As we were being attacked at time of evaluation we made the decision to test it in the real production environment.

The attacks lasted 10 minutes. In the first minutes we were hardly reachable and incoming traffic was cut into half. After 2-5 minutes the situation became more stable and everything went subjectively faster. After 10 minutes our monitoring system changed its state back to normal. The TopLayer solution is obviously not capable of protecting us from a DDoS attack completely. The firewall reported being overloaded at only 48k packets/s. We have already had attacks in the range of half a million packets per second. We believe further investigation and tuning could result in more effective protection, but due to TopLayer being far too expensive, this approach was not followed up any further.

RioRey

The RioRey Device is specialized in mitigating DDoS attacks and only DDoS attacks. If you are searching for a firewall with also routing etcetera, this is not the right device for you.

It was the last in our test series and turned out to be the best.

At first – disappointed from the other tests – we did not expect much. A mitigation time of 90 seconds and requiring a Windows client to administrate the device were not good signs.

After some time stuck in the German customs, the device finally found its way to our office and we began installing it. Installation in a production environment was done easily without a risk or downtime, because its default configuration is set to monitor mode; that means that all attacks are reported and recognized as in filter mode, but no packets will be dropped; traffic is just passed through. The device offers a WAN, LAN and MGMT interface. Once connected to MGMT interface, you configure the basic setup browsing at the preconfigured IP over https.

Here you configure the basics like IP address, syslog server, snmp, passwords and etcetera. To get an insight view on monitor and filter mode you need a windows client which has the RioReys software called “rVIEW” installed. You connect to the configured IP and now can get much more information and configuration options. So lets start the test:

As you can see it takes 90 seconds to analyze the traffic. After this time, about 90% of legitimate traffic is passed through, all illegitimate traffic is blocked. And this happened with zero configuration (except ip,password). You just put the RioRey in place, switch to filter mode and that’s it. Besides the really simple installation, the most important point is, that it is the only device that actually lives up to what its promise. So RioRey call themselves rightly “The DDoS specialist”.

After these tests we installed the device in our production environment with direct communication to the RioRey tech team. They were analyzing our traffic and suggesting the optimum settings for our environment. What really impressed our team, was the detailed analysis they provided of an a attack we dealt while RioRey was active. It turned out that not only the device itself is of high quality, but – even more important – the staff behind this device is.

After some weeks in operation, there were a few things which are not perfect at the moment. Many alerts which seem to have had no effect whether filtered or not are reported. At this moment we cannot really say what of those attacks are really a threat for our website. So there’s a need of some more tweaking. Another point is the weak status log. Somehow it is not displaying recent hardware events like “link up/down state changed, power failures and stuff” you only see the last event that switched state.

On the flip side we also experienced many positive behaviors. You can reboot and update the device without a downtime. The traffic during this time is just unfiltered. The reports are doing a good job so we can identify the duration of the attack and the attackers IP(s). All things considered we are fully satisfied with this device. It just does what we expect of it. Heres another attack defended by RioRey in production environment:

Links

Report on RioRey at tweakers.net

Own a MacBook Pro? Don’t use Google Chrome!

August 9th, 2010

It’s not really work related, but looking around in the office I see a lot of co-workers using a MacBook. And since we are doing webdevelopment I guess these same co-workers could/should be interested in browser-interoperability and maybe they got stuck with Google Chrome as their default browser. So here is the catch:

My MacBook constantly ran at very high temperatures, most of the time it ran the fan for cooling. Searching the internet everyone seemed to say “it’s normal” but noone said that for temperatures constantly above 80°C. So I went out to find the root cause of this and found that a bug causes Chrome to writes something like 20 lines to system.log _per pageload_. This led to my system.log silently growing to about 40 GB and spotlight indexing the hell out of it while I was browsing the web…hence the heat. Here are the bugs:

http://code.google.com/p/chromium/issues/detail?id=43786
http://code.google.com/p/chromium/issues/detail?id=26621

So what is the fix? Using Safari instead of Chrome dropped the temperature on my MacBook by about 30°C to a smooth 50°C. There you go: Don’t use Google Chrome!