MSolution monitoring and data analytics with Brocade FC switches

Brocade fibre channel switches are found at the center of many SAN applications. With such a critical piece of infrastructure, the monitoring of performance metrics becomes very important. Sensu is a popular, open-source monitoring framework and product, so it makes perfect sense to monitor your switch with it!

brocade logo

Getting data out of FabricOS

FabricOS (the embedded operating system that runs Brocade switches), doesn’t quite have facilities for monitoring aside from (the often disliked) SNMP protocol. It does however have a rich command line interface exposed over SSH. This is what we’ll be using to get vital stats out of FabricOS.

SSH in ruby

Running SSH commands and getting output from inside ruby is made easy by the net/ssh gem. Unfortunately, you sometimes have to take whatever is available on the system, so we’ll concoct a fallback that resorts to just invoking the OpenSSH client and scraping its output.

begin
  require 'net/ssh'

  SSHCommand = lambda do |host, user, password, command|
    lambda do
      Net::SSH.start host, user, password: password do |ssh|
        ssh.exec! command
      end
    end
  end
rescue LoadError
  require 'pty'
  require 'expect'

  SSHCommand = lambda do |host, user, password, command|
    lambda do
      cmd = ['ssh', '-o', 'PubkeyAuthentication=no',
                    '-o', 'NumberOfPasswordPrompts=1',
                    '-o', 'StrictHostkeyChecking=no',
                    "#{user}@#{host}",
                    '--',
                    command]
      out = ''
      PTY.spawn *cmd do |r, w, pid|
        r.expect /password/i
        w.write password
        w.write "\n"
        r.expect "\n"

        begin
          until r.eof?
            out << r.readline
          end
        rescue Errno::EIO
          break
        end
      end
      out
    end
  end
end

Now this can be used (in tandem with a timeout wrapper to ensure we don’t get jammed) to define parameterless functions that execute commands remotely and return their output!

require 'timeout'

WithTimeout = lambda do |t, cmd|
  lambda do
    timeout t do
      cmd.call
    end
  end
end

command = SSHCommand.call 'example.com', 'admin', 'password', 'uname'
timedout_command = WithTimeout.call 15, command
timedout_command.call # returns "Linux" if the target is Linux!

FabricOS commands

With FabricOS, a good amount of information can be gleaned from three commands:

switchShow

switchshow
switchName:	brocade-san
switchType:	120.3
switchState:	Online   
switchMode:	Native
switchRole:	Principal
switchDomain:	1
switchId:	fffc01
switchWwn:	10:00:01:21:f8:50:4a:48
zoning:		ON (Switch_cfg001)
switchBeacon:	OFF
FC Router:	OFF
Allow XISL Use:	OFF
LS Attributes:	[FID: 128, Base Switch: No, Default Switch: Yes, Address Mode 0]

Index Slot Port Address Media  Speed        State    Proto
============================================================
  16    2    0   011000   id    N8	   No_Light    FC  
  17    2    1   011100   id    N8	   No_Light    FC  
  18    2    2   011200   id    N8	   Online      FC  F-Port  50:03:08:02:c0:88:b0:20 
  19    2    3   011300   id    N8	   Online      FC  F-Port  50:03:08:02:c0:88:b0:1a 
  20    2    4   011400   id    N8	   Online      FC  F-Port  21:00:00:86:10:03:c5:3e 
  21    2    5   011500   id    N8	   Online      FC  F-Port  50:03:08:02:c0:88:b0:44 
  22    2    6   011600   id    N8	   No_Light    FC  
.....
 306    4   34   01bac0   id    N8	   Online      FC  E-Port  10:00:00:27:f0:1e:a4:37 "backup-SAN" 
 307    4   35   01bbc0   id    N8	   Online      FC  E-Port  10:00:00:27:f0:1e:a4:37 "backup-SAN" (downstream)
....

A high level overview of what the switch is up to. We’re just going to get general information about our ports here, like whether something’s actually plugged in and what the speed is.

portPerfShow

This command lets us get at instantaneous (sampled) throughput for all the ports on the switch.

      0      1      2      3      4      5      6      7      8      9     10     11     12     13     14     15  
========================================================================================================================
slot 2:    0      0    239.1m   0    239.1m   0      0      0     12.4m   7.3k   0      0      6.8k   6.7k   0      1.1m

     16     17     18     19     20     21     22     23     24     25     26     27     28     29     30     31  
========================================================================================================================
slot 2:    0      3.8k   0      0      0      0     55.5m   0     44.5m   7.3m   9.5m  14.7m   2.6m   7.2m 129.7m   0   

     32     33     34     35     36     37     38     39     40     41     42     43     44     45     46     47   Total
================================================================================================================================
slot 2:    0      0      0      0      0      0      0      0      0      0      0      0      0    117.9m  55.5m  14.2m 950.8m

      0      1      2      3      4      5      6      7      8      9     10     11     12     13     14     15  
========================================================================================================================
slot 3:    0      0      0      0    239.0m   0      0      0      0      0    700      0      0      7.2k   0      0   

     16     17     18     19     20     21     22     23     24     25     26     27     28     29     30     31  
========================================================================================================================
slot 3:    0      0      0      0      0      0      0      0      8.4m  24.7m   3.5m  24.6m   2.1m   7.1m 129.7m   0   

     32     33     34     35     36     37     38     39     40     41     42     43     44     45     46     47   Total
================================================================================================================================
slot 3:    0      0      0      0      0      0      0      0      0      0      0      0      0    117.9m  48.5m   0    606.0m

portErrShow

Shows the error counters (frames received/transmitted, number of CRC errors) since last reset. Since these are ever-increasing (save for the occasional reset), we’ll have to store intermediate data to be able to get the actual numbers for a particular sampling interval.

          frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy    c3timeout    pcs
       tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig                   tx    rx     err
 16:    0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0   
 17:    0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0   
 18:    2.1g   2.8g   0      0      0      0      0      0      1     16      5      0     10      0      0      0      0      0   
 19:  420.2m 546.2m   0      0      0      0      0      0      0     16      5      0      9      0      0      0      0      0   
 20:    2.3g   1.6g   0      0      0      0      0      0      0      0      4      0      4      0      0      0      0      0   
 21:    1.3g   3.0g   0      0      0      0      0      0      0     16      5      0     10      0      0      0      0      0   
 22:    0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0   
 23:    0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0   
 24:    4.2g   2.5g   0      0      0      0      0      0     22.0k   0     40      0     40      0      0      0      0      0   
 25:  130.0m  67.7m   0      0      0      0      0      0      1.9k   0     18      0     18      0      0      0      0      0   
 26:   51.2m  25.4m   0      0      0      0      0      0      0      0      2      0      2      0      0      0      0      0   
 27:    0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0   
 28:    1.4g 519.7m   0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0   
 29:    1.7g 362.8m   0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0   
 30:    2.1g   2.0g   0      0      0      0      0      0    845      0      7      0      7      0      0      0      0      0   
 31:  530.7m   3.2g   0      0      0      0      0      0    141      0      1      0      1      0      0      0      0      0   
 32:    0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0   
 33:    0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0   
 34:    1.8g   2.5g   0      0      0      0      0      0      0     16      5      0      9      0      0      0      0      0   
 35:  238.4m   3.7g   0      0      0      0      0      0      0     16      5      0      9      0      0      0      0      0   
...

Output parsing

Parsing data meant for human consumption is usually not pretty, and this case is no exception.

require 'fileutils'

class FabricOSStats
  attr_accessor :switchshow_cmd, :porterrshow_cmd, :portperfshow_cmd, :scratch_file_path

  ERROR_FIELDS = [
    :frames_tx,
    :frames_rx,
    :enc_in,
    :crc_err,
    :crc_g_eof,
    :too_shrt,
    :too_long,
    :bad_eof,
    :enc_out,
    :disc_c3,
    :link_fail,
    :loss_sync,
    :loss_sig,
    :frjt,
    :fbsy,
    :c3_timeout_tx,
    :c3_timeout_rx,
    :pcs_err
  ]

  def pop_stats
    switchshow_ports = self.switchshow_ports
    by_index = Hash[switchshow_ports.map { |p| [p[:index], p] }]
    by_slot_port = Hash[switchshow_ports.map { |p| [[p[:slot], p[:number]], p] }]
    porterrshow_output = porterrshow_cmd.call
    porterrshow_ports(porterrshow_output).each do |port|
      by_index[port[:index]].merge! port
    end
    prev_porterrshow_ports.each do |port|
      by_index[port[:index]].merge! port do |key, curr, prev|
        if ERROR_FIELDS.include? key
          curr - prev
        else
          curr
        end
      end
    end
    portperfshow_ports.each do |port|
      by_slot_port[[port[:slot], port[:number]]].merge! port
    end
    FileUtils.mkdir_p File.dirname scratch_file_path
    File.write(scratch_file_path, porterrshow_output)
    switchshow_ports
  end

  def switchshow_ports
    result = switchshow_cmd.call
    FabricOSStats.parse_switchshow result.lines
  end

  def porterrshow_ports input
    FabricOSStats.parse_porterrshow input.lines
  end

  def prev_porterrshow_ports
    FabricOSStats.parse_porterrshow File.read(scratch_file_path).lines
  rescue
    warn 'scratch file'
    []
  end

  def portperfshow_ports
    result = portperfshow_cmd.call
    FabricOSStats.parse_portperfshow result.lines
  end

  class << self
    def parse_switchshow lines
      ports = []
      lines.drop_while do |line|
        not line.strip.match /^=+$/
      end.drop(1).each do |line|
        m = line.strip.match /^(\d+)\s+(\d+)\s+(\d+)\s+[^\s]+\s+[^\s]+\s+([^\s]+)\s+([^\s]+)/
        ports << {index: m[1], slot: m[2], number: m[3],
                  speed: m[4], state: m[5]}
      end
      ports
    end

    def parse_porterrshow lines
      ports = []
      lines.drop_while do |line|
        not line.strip.match /^\d+:/
      end.each do |line|
        cols = line.strip.split /\s+/
        port = {index: cols[0].split(':')[0]}
        cols[1..-1].each_with_index do |s, i|
          port[ERROR_FIELDS[i]] = parse_abbr_num s if ERROR_FIELDS[i]
        end
        ports << port
      end
      ports
    end

    def parse_portperfshow lines
      ports = []
      partially_ready_ports = []
      lines.each do |line|
        if line.strip.match /^=*$/
          next
        elsif line.match /^\s*slot/
          line = line.strip.sub(/^slot\s*/, '')
          slot, *vals = line.split(/\s+/)
          slot.sub!(':', '')
          partially_ready_ports.zip(vals).each do |port, val|
            port[:slot] = slot
            port[:throughput] = parse_abbr_num val
          end
        else
          ports += partially_ready_ports
          partially_ready_ports = line.strip.split(/\s+/).map { |port_num| {number: port_num} }
        end
      end
      (ports + partially_ready_ports).reject { |p| p[:number].downcase == 'total'}
    end

    def parse_abbr_num s
      match = s.strip.downcase.match /^(\d+(?:[.]\d+)?)([[:alpha:]])?$/
      (match[1].to_f * (10 ** {nil => 0, 'k' => 3, 'm' => 6, 'g' => 9}[match[2]])).to_i
    end
  end
end

We define a class that’s responsible for invoking commands, parsing their outputs and storing intermediate state. Now we can get JSON serializable documents each representing a port, with a sample of its throughput and error counts since last probe.

getter = FabricOSStats.new
getter.porterrshow_cmd = WithTimeout.call(15, SSHCommand.call(host, user, password, 'porterrshow'))
getter.portperfshow_cmd = WithTimeout.call(15, SSHCommand.call(host, user, password, 'portperfshow -t 0'))
getter.switchshow_cmd = WithTimeout.call(15, SSHCommand.call(host, user, password, 'switchshow'))
getter.scratch_file_path = './scratch'

getter.pop_stats # returns an array of hashes representing individual switch ports

Bringing it together with Sensu

At this point the code can be packaged up as a Sensu metric plugin. Subclassing Sensu::Plugin::Metric::CLI::Graphite adds an output method to the plugin class. This method is invoked with a key, value and timestamp and simply outputs the space delimited triple.

require 'sensu-plugin/metric/cli'

class FabricOS < Sensu::Plugin::Metric::CLI::Graphite
  option :host,
    description: 'ssh host',
    long: '--host HOST'

  option :user,
    description: 'ssh user',
    long: '--user USER'

  option :password,
    description: 'ssh password',
    long: '--password PASSWORD'

  option :scheme,
    description: 'metric naming scheme',
    long: '--scheme SCHEME'

  option :scratch_file,
    description: 'file where state is saved between runs',
    long: '--scratch SCRATCH',
    default: '/var/lib/fabricos-metrics/scratch'

  def run
    return unless config[:host]
    getter = FabricOSStats.new
    getter.porterrshow_cmd = WithTimeout.call(15, SSHCommand.call(config[:host], config[:user], config[:password], 'porterrshow'))
    getter.portperfshow_cmd = WithTimeout.call(15, SSHCommand.call(config[:host], config[:user], config[:password], 'portperfshow -t 0'))
    getter.switchshow_cmd = WithTimeout.call(15, SSHCommand.call(config[:host], config[:user], config[:password], 'switchshow'))
    getter.scratch_file_path = config[:scratch_file]

    stats = getter.pop_stats

    stamp = Time.now.to_i
    stats.each do |stat|
      output_sensu_metrics stat, stamp
    end
    ok
  end

  def output_sensu_metrics stat, stamp
    id = "#{stat[:slot]}-#{stat[:number]}"
    stat.each do |key, value|
      next unless value.is_a? Numeric
      output "#{config[:scheme] || 'brocade.' + config[:host].gsub('.', '-')}.fabricos-stats.#{id}.#{key}", value, stamp
    end
  end
end

Sending log lines to Logstash

Because the format of the parsed data we have is basically JSON, it is exceedingly simple to also plumb the output directly into Logstash. For instance if you have configured Logstash with a UDP input:

require 'json'
require 'socket'
 
socket = UDPSocket.new

getter.pop_stats.each do |stat|
  socket.send JSON.dump(stat), 0, logstash_port, logstash_host
end

Visualizing with Kibana

Now that our logs are plumbed into ElasticSearch, we can visualize them with Kibana. For instance to visualize the throughput of our ports we simply query for index: * AND slot: * AND number: * and visualize the results by setting the Y axis to an average of the throughput and the X axis to a date histogram on the timestamp field.

kibana_brocade_3

We can then split up the the chart by slot using a split chart and by port number using a nested split lines aggregation.
kibana_brocade_2

Leave a Reply

Be the First to Comment!

avatar
wpDiscuz