Customer Meeting3/26/08 2:02 PM

Late

  • Hubert (5 mins)
  • Dave (7 mins)

View Client

  • Latency over time
  • Sliding time window that keeps relevant data on the screen
  • Currently on random data (our dataset is too small to graph well)
  • Can change sliding window
  • Working to make time of day label more useful
  • Needs to create a database for each server being viewed due to RD4J restrictions, but can send from manager instead
  • Currently up to 25 kB if stored on client machine
  • Jacy: all those machines have 2 GB of RAM, so not worried about space. Also, there is a use case for keeping around old data (pulling in ad hoc data), so it wouldn’t just be replaced and the 25 kB limit would be breached.
  • Brad: may be able to save processing time by tying granularity of data to the size of the window
  • Jacy: we need to know how the system load will scale (will a second graph double the load, or is most of the load overhead?); we cannot overload the client. Test with at least 20 server graphs.
  • Can we trick RD4J into not writing to the disk?
  • This component would fit into the main UI by coming up when a particular server is clicked on in the dashboard; the time window could be changed via a slider or button. Should also try to see if the sliding window view could be displayed in the dashboard view (set the shape to an image)—this may make sense for intra-server correlation
  • Aggregation currently occurs on the view manager and is pulled from the disk
  • Brad: hope to show things from the view server for next week
  • Matt: promise to absolutely, definitely have live data running through an entire connected system (across Mule) by next week

Recoverability

  • Camp 1: parser tags messages (must be taken offline and restarted in order to make changes)
  • Camp 2: parser passes rules along to data client, and data client tags messages (modifiable at runtime)
  • Tree of rules that can be combined with logical connectives
  • Jacy: all the systems have different life cycles, so we don’t want to force everything to die in order to update the parser
  • Chelsea: by the same token, a parser could easily be taken offline and then be brought back up with changes without affecting the message flow
  • Jacy: don’t want to focus too much on making the parser easy to write if it causes us headaches
  • Brad: it probably makes life easier to make the parser easy to write and tag messages on its own, because it makes the data client easier to write (which is our module)
  • Each piece must pass along some knowledge of itself to the piece that is able to recover it
  • Jacy: it seems to couple the pieces too tightly; this is only worthwhile if it follows a standard API. Look at FIX’s example of recoverability protocol and emulate the uniformity and isolation they use
  • Jacy: the JPM standard generally states that handshaking upon startup determines what action must be taken to recover. The actual system restart is manually performed by Operators (run book describes how to bring them back up). There are some systems that continually ping their components and send misses along to scripts that can restart them—don’t focus on this. It may be worthwhile to do whatever is necessary to keep/start the Controller running; an external script would actually restart it, but we must write each piece such that it automatically restores its last state

Correlation

  • Concurrency design relies on a dynamically-sized thread pool
  • We’ve shifted to thinking of the system as always being in Learning Mode
  • Administrators should have the ability to explicitly remove edges and short circuit the effort to correlation along that edge

Mule

  • Currently at the end of the design phase and running into startup bugs
  • Still have questions about dynamic endpoints and TCP (Jacy hasn’t been able to get in contact with JPM Mule users yet but is trying)

Questions

  • Matt: how does logging in JPM work?
  • Jacy: each system has its own log or set of logs that is configured to give alerts to Operate; a set of patterns is then set up for Operate to handle. We should make a log for the Controller and provide this set of rules
  • Matt: is it appropriate to log other modules?
  • Jacy: at appropriate levels, yes
  • Matt: do we need to provide an implementation of “go back and correlate old messages” behavior?
  • Jacy: given the timeframe, more worried about the ability to add that later

Concerns

  • We’re still at the proof-of-concept level
  • No individual components are running, and certainly not the whole system end-to-end
  • Integration needs to happen more frequently

Want to see everything working together by next week; discussion of view client performance