In search of the anti-ddos device

September 10th, 2010 by Thomas Poehler

Preamble

The following text describes the evaluation of the most reasonable solution in order to achieve the goal of protecting our infrastructure from DDoS attacks. All collected values and impressions do not lay claim to being correct nor complete. This article only reflects our experiences and data and therefore should be used to help you make your own decisions.

Expectations

The Electronic Sports League – ESL – is Europe’s largest online computer gaming league. Over 2.6 million registered members generate more than 100 million page impressions per month. In order to have the ability to deal with such a huge amount of data, it requires an extremely stable IT-infrastructure. The ESL was increasingly targeted by DDoS attacks – a distributed network of computers hammering our servers with thousands of requests. Wikipedia: “Denial of service attack“ The goal of DDoS attacks is to make the target unavailable to its intended users, therefore causing economic loss.

In search of a solution to the problem of these DDoS attacks we made a number of different approaches.

Requirements

The attacks we had to deal with were mostly simple SYN attacks, between 80k and 500k pkts/s in size. Our primary goal was to be resistant to these SYN Attacks. The device should mitigate these attacks as soon as possible, the period of vocational adjustment and thus the amount of configuration should be manageable. The more types of attacks it detects so much the better. Also it should not be attackable itself, so it should be able to operate in transparent mode. As an alternative there are different providers of proxy-services. The IP which should be protected is pointed at the proxy of the provider. They are in charge of defending the attacks. However, looking at the amount of service IPs we offer, together with the amount of traffic we generate, this option was not feasible in our case.

First Steps

The firewall we had used so far was an HP DL380 with an additional Intel Network Card running Debian. This hardware had massive problems to handle the amount of packets per second. System interrupts between 20k and 25k were leading to si values in “top” between 90% and 100%. Ksoftirq was leading in CPU-Usage. The consequences were dropped packets, the website becoming slow and unresponsive. Having a brief look at Google was promising to find a solution for this problem.

Below I don’t want to immerse myself into details, but rather give a brief overview about actions we have taken. It must be pointed out that all actions were taken to the best of our knowledge at the time, but we cannot rule out a configuration mistake that led to wrong results.

The first hit on Google was NAPI. NAPI is designed for reducing CPU-load caused by high system interrupts. We tried it out but it had no effect on our problem.

Next we tried tuning the Intel network-card-driver. The InterruptThrottleRate was especially interesting for us. Like NAPI, InterruptThrottleRate is also in charge of optimizing interrupts as it delays packets and thereby leading to less CPU load. To our disappointment this also had no effect in our tests.

Another approach was using syn-cookies to avoid at least SYN based DDoS attacks. In this model the common table of half-open TCP-connections is obsolete, so that it cannot be overflowed. The sequence number is calculated each time a handshake will take place. This is a good option for servers which are terminating the attacked IP, but has no influence on the firewall routing those packets. So it would still ne a case of just too many packets.

Also further approaches like SYN-proxy (not available under linux) and iptables tuning were not leading to success, so that we were forced to searching for a hardware solution. So what exactly are we looking for?

Hardware solution

Taking a look at the market for usable devices, you are promised that nearly every device is suitable for our situation. In order to develop our own opinion besides what the marketing would have us believe, we tried to reproduce the attacks in our test environment.

The test scenario

We setup 4 servers to reproduce the online scenario. 2 acting as attackers, 1 as web server and 1 as client.
Web server Attacker Client

netzplan_testumgebung

Web server Attacker Client
CPU 2x AMD Opteron @2,2GHz 2x Intel Xeon @3.20GHz Intel Pentium 4 @3GHz
RAM 4GB 4GB 2GB
NIC BCM5704 Gigabit 82540EM Gigabit 82573L Gigabit
IP 192.168.0.11 192.168.0.13 192.168.0.12
OS Ubuntu lucid (10.04) Ubuntu lucid (10.04) Ubuntu lucid (10.04)

Software

  • Webserver: Lighttpd which serves a simple html containing few pictures
  • Attacker: sudo hping3 192.168.0.11 –interface eth0 –flood –destport 80 –syn –rand-source –verbose
  • Client: Curl-loader constantly loading static html and 4 small pictures

Hardware

Lets take a look at our nominees ;) We evaluated the following devices in chronological order. Fortigate 310-B, Juniper SRX650 (Routing Mode), Palo Alto PA-2050, TopLayer IPS-1000E, RioRey RX2310U. Except for the Juniper all devices are able to operate in transparent mode.

Fortigate 310-B

Fortigates 310-B was recommend and made available for testing by a local computer retailer which also supported us with configuring it so that misconfiguration would be minimized. The device offers many many configuration options and would be categorized as an all-rounder. We especially liked the function of virtual firewalls. Here you can configure completely independent configurations for different scenarios which you can simply enable or disable. For our main problem, the DDoS attacks, the Fortigate offers a set of special anti-DDoS policies which can be applied on every of the virtual firewalls. These policies have again thousands of configuration options you can adjust to your needs. The idea of those policies is to gain control over DDoS attacks through limiting packet rates. Sadly it it emerged in our test scenario that pretty quickly the device encountered the same problems as our Linux firewall. CPU load rises to 100% and all further packets are dropped completely. Also when you disable all rules regarding packet inspection, it cannot manage the volume of packets correctly, so we refrained from enabling further IPS functions.

Juniper SRX650

The Juniper SRX650 is a classic Layer3 Firewall. It does not support transparent mode which forced us to test it in routing mode. Besides rate limiting there are no special anit-DDoS policies that can be configured. Our tests quickly verified our presumption that this simply is not the device we are looking for. Right away few seconds after beginning the attack the SRX650 buckles under the amount of packets. The interface is completely unresponsive and needs about 5 minutes to return to a normal behavior after stopping the attack. The next best model does support the Layer 2 mode but exceeded our price range.

Palo Alto PA-2050

The PA-2050 from Palo Alto Networks also promised to solve our problem. We had direct support from the vendor who was familiar with our test setup and should have led us to quick success with the optimum configuration setup. We were surprised however when we saw Palo Alto behaving the same way the Juniper did. After a few seconds of packets no further traffic was handled and the client tried to access the test page to no avail. The Palo Alto crew tried its best, but under our circumstances we did not find any solution respecting our time frame. We still however think this device is just as good as the juniper for other fields of application.
our testreport in detail overview traffic overview packets

TopLayer IPS-1000E

The IPS-1000E belongs to the class of devices which are specialized in Intrusion Preventions Systems (IPS). As we were being attacked at time of evaluation we made the decision to test it in the real production environment.

The attacks lasted 10 minutes. In the first minutes we were hardly reachable and incoming traffic was cut into half. After 2-5 minutes the situation became more stable and everything went subjectively faster. After 10 minutes our monitoring system changed its state back to normal. The TopLayer solution is obviously not capable of protecting us from a DDoS attack completely. The firewall reported being overloaded at only 48k packets/s. We have already had attacks in the range of half a million packets per second. We believe further investigation and tuning could result in more effective protection, but due to TopLayer being far too expensive, this approach was not followed up any further.

RioRey

The RioRey Device is specialized in mitigating DDoS attacks and only DDoS attacks. If you are searching for a firewall with also routing etcetera, this is not the right device for you.

It was the last in our test series and turned out to be the best.

At first – disappointed from the other tests – we did not expect much. A mitigation time of 90 seconds and requiring a Windows client to administrate the device were not good signs.

After some time stuck in the German customs, the device finally found its way to our office and we began installing it. Installation in a production environment was done easily without a risk or downtime, because its default configuration is set to monitor mode; that means that all attacks are reported and recognized as in filter mode, but no packets will be dropped; traffic is just passed through. The device offers a WAN, LAN and MGMT interface. Once connected to MGMT interface, you configure the basic setup browsing at the preconfigured IP over https.

Here you configure the basics like IP address, syslog server, snmp, passwords and etcetera. To get an insight view on monitor and filter mode you need a windows client which has the RioReys software called “rVIEW” installed. You connect to the configured IP and now can get much more information and configuration options. So lets start the test:

As you can see it takes 90 seconds to analyze the traffic. After this time, about 90% of legitimate traffic is passed through, all illegitimate traffic is blocked. And this happened with zero configuration (except ip,password). You just put the RioRey in place, switch to filter mode and that’s it. Besides the really simple installation, the most important point is, that it is the only device that actually lives up to what its promise. So RioRey call themselves rightly “The DDoS specialist”.

After these tests we installed the device in our production environment with direct communication to the RioRey tech team. They were analyzing our traffic and suggesting the optimum settings for our environment. What really impressed our team, was the detailed analysis they provided of an a attack we dealt while RioRey was active. It turned out that not only the device itself is of high quality, but – even more important – the staff behind this device is.

After some weeks in operation, there were a few things which are not perfect at the moment. Many alerts which seem to have had no effect whether filtered or not are reported. At this moment we cannot really say what of those attacks are really a threat for our website. So there’s a need of some more tweaking. Another point is the weak status log. Somehow it is not displaying recent hardware events like “link up/down state changed, power failures and stuff” you only see the last event that switched state.

On the flip side we also experienced many positive behaviors. You can reboot and update the device without a downtime. The traffic during this time is just unfiltered. The reports are doing a good job so we can identify the duration of the attack and the attackers IP(s). All things considered we are fully satisfied with this device. It just does what we expect of it. Heres another attack defended by RioRey in production environment:

Links

Report on RioRey at tweakers.net

Own a MacBook Pro? Don’t use Google Chrome!

August 9th, 2010 by Stephan Maihöfer

It’s not really work related, but looking around in the office I see a lot of co-workers using a MacBook. And since we are doing webdevelopment I guess these same co-workers could/should be interested in browser-interoperability and maybe they got stuck with Google Chrome as their default browser. So here is the catch:

My MacBook constantly ran at very high temperatures, most of the time it ran the fan for cooling. Searching the internet everyone seemed to say “it’s normal” but noone said that for temperatures constantly above 80°C. So I went out to find the root cause of this and found that a bug causes Chrome to writes something like 20 lines to system.log _per pageload_. This led to my system.log silently growing to about 40 GB and spotlight indexing the hell out of it while I was browsing the web…hence the heat. Here are the bugs:

  • http://code.google.com/p/chromium/issues/detail?id=43786
  • http://code.google.com/p/chromium/issues/detail?id=26621

So what is the fix? Using Safari instead of Chrome dropped the temperature on my MacBook by about 30°C to a smooth 50°C. There you go: Don’t use Google Chrome!

 

Functional test on uploads for a web service – a workaround

August 4th, 2010 by Franz Stelzer

Yesterday i was faced with a problem on implementing a functional test for a web service, which handles a file upload.
However this workflow seemed to be pretty easy so, as it is documented in the Jobeet Tutorial:

1
2
3
4
5
6
7
8
9
10
11
12
$browser->get('/job/new')->
with('request')->begin()->
isParameter('module', 'job')->
isParameter('action', 'new')->
end()->

click('Preview your job', array('job' => array(
'company' => 'Sensio Labs',
'url' => 'http://www.sensio.com/',
'logo' => sfConfig::get('sf_upload_dir').'/jobs/sensio-labs.gif',
// other parameters ...
));

First the form is called and rendered from the job/new page and than the test browser clicks on the submit button. The call of the click method is the most problematic point of this issue. This click method does some magic and assigns the given file from an absolute path to the request. This magic will not be done in a normal post request. Trying to submit a file directly with the post action during a functional test will end up to empty file values in the action code.

This will not work, also our web service would require it:

1
$browser->post('/job/new', array('job' => array('company' => 'Sensio Labs', 'logo' => sfConfig::get('sf_upload_dir').'/jobs/sensio-labs.gif',') ))->....

Now we have a bad situation: The web service requires the post request, but the functional test does only work with the click method.

Searching a while through the web exposed that i am not the only one with this problem. Similar topics or questions could be read in the forum, on stackoverflow or on the user mailing list.

The workaround for this problem is as easy as ugly: We have to render a form for this functional test but have to avoid that this rendered form is accessible in the producation environment.

The action code of route test/upload looks like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public function executeUpload(sfWebRequest $request)
{
  $this->form = new TestUploadForm();
  if($request->isMethod('post'))
  {
    $this->form->bind($request->getParameter('upload'), $request->getFiles($this->form->getName()));
 
    if($this->form->isValid())
    {
      // make something with the form ...
      $values = $this->form->getValues();
      $file = $values['file'];
      $filepath = sfConfig::get('sf_data_dir').'/test/some.thing';
      $file->save($filepath);
    }
    else
    {
      // make some error handling ...
    }
  }
  else
  {
    // only render form in dev or test environment
    if(in_array(sfConfig::get('sf_environment'), array('dev', 'test')))
    {
      throw new sfException('not enabled for this environment');
    }
  }
}

For better understanding, this action saves the incoming file always to the same location.

The template for the faked form rendering looks like (standard template for rendering a form)

1
2
3
4
5
6
7
8
9
10
<?php echo form_tag($some_route, array('multipart' => true)); ?>
  <table>
    <?php echo $form; ?>
      <tr>
        <td colspan="2">
          <input type="submit" value="GO" />
        </td>
      </tr>
    </table>
</form>

After calling test/upload in your browser, you should see this rendered form now. This form is rendered in the test environment, too.
The functional test is now able to handle this form as it would be placed on a normal website.

The functional test code could look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$generatedFile = sfConfig::get( 'sf_data_dir' ).'/test/some.thing';
 
// cleanup
@unlink($generatedFile);
 
// do the fake request
$browser->
  get('/test/upload')->
    with('request')->begin()->
      isParameter('module', 'test')->
      isParameter('action', 'upload')->
    end()->
 
// do the real web service request
  click('GO', array( 'upload' => array( 'name' => 'foo' , 'file' => $testfile)))->
    with('request')->begin()->
      isParameter('module', 'test')->
      isParameter('action', 'upload')->
    end()->
 
    with('response')->begin()->
      isStatusCode(200)->
    end();
 
// action should have saved the incoming file to the expected location
$this->assertTrue(file_exists($generatedFile));

This works now like the example from the Jobeet Tutorial. First we make our workaround and fake a rendered form in the get request. Afterwards we make the post request within the click call.
When everything went fine, the incoming file should be copied to the expected file location.

I am not proud of this workaround, yet it is the only simple solution doing this trick without hacking the symfony core code.
If you have any other ideas or better solutions, please let me know!

Postgres ERROR: tuple already updated by self on deleting duplicate rows

July 1st, 2010 by Alexander Schöcke

We’ve had some serious hassle with duplicate row issues on some of our PostgreSQL databases.

While simply deleting the rows worked for most of the issues, some of them were just unable to delete. Trying to delete or update them resulted in the curious error message: Postgres ERROR: tuple already updated by self on deleting duplicate rows.

The only way for us to delete these row was to find differences between the rows. Unfortunately, they were completely identical, including oid values. After some search on the internet, we found a solution:

1
2
SELECT ctid, *
FROM TABLE

ctid here holds the block and item number of the file on the hard disk, which contains the table row. This id was different for both of the rows opposed to all the other fields. Thus we were able to delete the row:

1
2
3
DELETE FROM TABLE
WHERE id = 123
AND ctid = '(2134,1234)'

We hope this blog article can help out people who are having the same hard time as we had.

World Cup 2010 Half Time: Toilet or ESL?

June 24th, 2010 by Koalabaerchen

Graph shows “Users Online” on June 23rd, 2010.

When did mouse inversion die?

June 21st, 2010 by Alexander Schöcke

Ever had the discussion whether it is “correct” to invert your mouse axis in first person shooters or not? So did we, and we wanted to get a little more science out of it. So we made a little poll to see a) how many people were using mouse inversion, and b) how old these people were.

Here are the results:
a) NOT inverted won by far. There were 7517 people with mouse inversion disabled, only 1154 with mouse inversion enabled.
b) Mouse inversion is old-school:

Age Distribution of mouse inversion

Well, nothing new here then. Mouse inversion is old-school, mouse inversion is elite.
I’m using mouse inversion btw.

Evolution of Post-Its: Layout for work unit tickets in Scrum

June 16th, 2010 by Alexander Schöcke

Since the beginning of 2010, we’re using Scrum to manage our development projects.
Last month we were able to welcome members of other departments to the team – including IT Operations, Community Management and Marketing – for the first time.

The team decided to use Post-Its for their work units in order to easily visualize the progress they were achieving during the sprint.
This lead, among other oddities, to tickets like these here:

Evolution of Post-Its - Step 1

Can’t read anything? You have no clue what you would need to do to finish these work units? That’s what the Scrum team experienced as well.

In the Sprint Retrospective Meeting the team decided to enhance the informative value of the tickets and their readability, by designing a layout for these pieces of paper.
It clearly states what information has to stand on what position on the ticket, and led to a digitally available layout:

Evolution of Post-Its: Part 2

The “Notes/Dependencies” part has been defined as optional, it can get folded behind the main ticket part.

If this is of any help for you, you can download the Open Office Document here (Sorry, file is missing).

My personal opinion: Great team work enhancing the productivity with a simple change of the method after only one single sprint has been finished.

sfPHPUnit2Plugin version 0.9.0 is out

June 10th, 2010 by Franz Stelzer

The sfPHPUnit2Plugin is now available in version 0.9.0. The new release is the reason of some feedback i got within my older blog post or by mail and i hope it contains all the requested things.

Two changes/enhancements have to be pointed out:
The older release had some trouble with symfony 1.2 projects and so a compatibility task was added lately.
The biggest new feature is the support for selenium tests. This support was established through a great contribution of Richard Shank. Thanks Richard!

And here the complete changelog

  • added compatibility task for symfony 1.2
  • added selenium support (Special Thanks to Richard Shank!)
  • added experimental support for plugin tests
  • added possibility to customize skeleton template files
  • adepted changes to latest changes in the lime_test class

The next big thing will be the support of plugin tests. The sfTaskExtraPlugin provides plugin tests for lime. I will have a deeper look into this mechanism and try to migrate it to this plugin.

Creating a <table/>-less tournament tree (Part #1)

May 12th, 2010 by Koalabaerchen

Everyone knows tournament trees. They are used in nearly every competitive event. For example in the finals of the 2010 FIFA World Cup.

Or this little example:

Tournament Tree

Rendering such a tree in a webbrowser can be quite the pain. There is always the problem of the vertical distance (red and purple in the graphic). How to solve it without using tables or plugins like Flash or incredible ugly hacks? Use CSS and JS!
To calculate the tree with good ol’ <div> tags you can simply use the following formula:

Tournament Forumla

p0 = number of pairings in the first round
pn = number of pairings in the current round
h = height of pairing in pixel
x = vertical distance unit in pixel

Example: We have a tournament with 16 contestants like in the graphic above. We want to know the vertical alignment of the pairings in round 3.
Having 16 contestants we have 8 pairings in the first round, so p0 is 8. In round three we have two pairings. So pn is 2. The height of a pairing in this example is 50 pixel. 40 for the grey box, and 5 pixel margin on top and bottom. So h is 50.

Now we have the disctance. For round three the result x is 75 pixel. So every pairing in round 3 has 75 pixel margin at top and bottom and the distance between the pairing p1-p7 and p10-p15 is de facto twice the vertical distance unit, so 150 pixel.

Next week I’ll show you how to draw the lines between the pairings between two rounds.

Starting erlang appmon from a Windows Client

March 19th, 2010 by Stephan Maihöfer
  • first of all download and install erlang for windows.
  • start erlang like this:
    C:\Programme\erl5.7.5\bin\werl.exe -name myself@mypc -setcookie 1217983712938DKAJSHD
  • make sure to set the correct cookie for the cluster you want to connect to.
  • in the erlang shell we announce ourselves to the cluster we want to connect to like this:
    net_adm:ping(list_to_atom("ejabberd@yourdomain.org")).
  • check if this worked by listing the nodes:
    nodes().
  • Now you can start appmon:
    appmon:start().
  • In the Nodes Tab choose which node you want to get an insight view.