Next gen Log Management is no longer a visionary concept – Network World Blog

Jon Oltsik of Network World wrote an insightful blog entry on “Next Generation Log Management” and highlighted the need for the traditional SIEM / logging vendors to go beyond conventional  log management capabilties to enable real actionable intelligence and effective forensics.

Jon specifically speaks about adding these features into SIEM products to make data more useful:

1. consolidation of logs and network flows
2. adding automatic geo location awareness into the correlation
3. providing deeper granular visibility and visual tools

Putting log and flow data in context is critical and how vendors do this varies greatly.  More so, how easy it is to access, search/analyze and retain the raw and correlated data is where the rubber meets the road.

Regarding location awareness features, its also important to show the location of not just the external IPs, but also to track the user and location of internal IP addresses. In a larger sense, user identity and location involves associating a network identity (e.g. IP address, MAC address) to a user identity (e.g. user name, computer name, domain), and the location (e.g. wired switch port, wireless LAN controller and VPN gateway).  SIEMs should auto-resolve true identity not just log reported identity. For external IPs, the AccelOps solution also includes lookups from SAN Stormcenter, Cisco sender base reputation, and the Honeypot database for any external IP.

Visibility and granularity (and analytics) must evolve beyond alerts and reams of syslogs so that the infosec professional can be more efficient and effective – reducing the time and effort to obtain and analyze data from many data sources and IT functional domains.

In addition to the above features, in order to provide a true single pane of glass, a next generation product should consolidate and correlate security/log/netflow events with performance, availability, virtualization and configuration change metrics and events (not to mention having the means to prioritize based on business impact beyond event severity)

It is time to go beyond conventional SIEM/logging – breaking away from more silo’d tools – by having a unified platform and console that empowers security professionals (and the IT organization) to eliminate extraneous operational noise, resolve problems faster, conduct investigations more efficiently, enable better collaboration, and support SLAs.

From a larger perspective, it’s time to consolidate the NOCs with SOCs and provide a true integrated monitoring tool for data centers and IT organizations.

In previous blogs, we have stressed the importance of security and network operations converging along with the need to move away from SILOed approach that current SIEM tools have taken up to this point.

Security Operations and IT Services, Competition or Cooperation?

In other words…I got it, you take it!

Lately I’ve seen many customers struggling with how to spend their very limited IT budgets.  Everyone says Security is top of mind , but since security tools are often looked at as an insurance policy, and appear to do little to help IT satisfy their SLA’s, or align with the companies objectives (Customer acquisition/retention, improved product margins), it is a tough decision to spend hundreds of thousands of dollars on a tool that will only help one aspect of the organization.

What most IT Directors are telling me, is that they have “many tools, but no big picture”. They don’t know how to streamline their operation (do more with less), while making it more “user friendly” and effective for multiple teams to address IT Services.

Often times, Security Operations and IT Services use completely different metrics, and tools to measure those metrics.  Network ops have performance monitoring tools, Security ops have SIEM, Server teams have Application Monitoring tools, and on and on. Add to that teams who are tasked with tracking user identity and activities or locations and the unfortunate help desk personnel who have no ghost buster to call, even though they are the front line of IT.

Large (and I mean very big) companies have the luxury of having hundreds of people who use dozens of disparate tools to detect and identify events that could cause harm to the company, or at least disrupt user productivity. These companies have built their operation over long periods of time around groups of individuals who are comfortable with their favorite “tool de jour”.  Many times, these groups have been created through acquisitions, so they are really still speaking different languages and not communicating effectively.

That is not to say the IT Director doesn’t feel the pressure to do more with less, but it is a lot tougher to introduce innovation and change in a very large organization, so they are often motivated to not “rock the boat” and to continue down the same path that got them into this situation in the first place.

Mid-market companies have many of the same issues and concerns, the problem for them is often greater since they may only have a fraction of the headcount doing everything from monitoring the network, securing databases, resetting passwords, installing patches on web servers, even adding new devices or changing configurations on existing devices.  In fact, they are also the Help Desk for the entire company. Many of these companies operate under the same regulations as their larger counterparts and therefore feel even more pain and must do more with less of everything.

At their worst, some events can actually take down revenue producing e-commerce web sites for minutes, hours, or days.  At best, using disparate tools to monitor these activities only complicates accountability, and does not serve to align IT with Business goals and objectives.  During an outage of any kind and for any reason (including maintenance), IT is under pressure to report who did what, where, how, and hopefully why this disruption occurred, and frankly how can the disruption be avoided in the future.

In my humble opinion, all IT organizations benefit when each unique (not disparate) group under IT has a similar vision of how all operations are interconnected, each device, each application, and most importantly each event.  One view, one set of metrics, one source of accountability, one IT Service goal, all while maintaining a separation of duties for regulatory compliance and ultimately avoiding catastrophe.

Only when Security is viewed as a cooperative effort within IT Services, will businesses truly “get what they are paying for”.  Disparate legacy tools served a purpose in their day, but that day has passed. Just like each agency under our federal government has been tasked to share information to protect our country, each area of IT must do the same in order to serve their respective stakeholders.

Beth Schultz has read the VM tea leaves well: more expansive virtualization management tools are required

Beth Schultz of Network World has read the VM tea leaves well in her recent blog – virtualization management demands extensive visibility beyond that of conventional systems / VM management tools.

Virtualization, cloud computing, security and data center complexities have introduced new functional requirements.  Given virtualization dynamics and the challenges of implementing controls across a dynamic and extended infrastructure – organizations require a more effective way to optimize resources and assure IT service / application performance reliability.

This demands holistic infrastructure coverage and cross-correlation in order to respond to VM and operational issues in a proactive manner and with pertinent details to support efficient root-cause analysis and accurate capacity planning.

If enterprises are to realize bigger virtualization gains – managing the virtual environment should extend beyond configuration, provisioning and health alerts.  IT organizations (of all sizes) need the means to discover, map and monitor physical / virtual infrastructure, their respective application dependencies, their relationship to the delivery of IT services, as well as available operational controls.

The result provides a valuable approach to understand, track and improve upon the use of VM resources as well as the delivery of IT services.

While we can commend the legacy systems vendors that are introducing new products by way of acquisition – many enterprises still can’t afford the high price, modularity and implementation effort.  This leaves the greater mid-tier and service provider market amenable for new data center management platforms that address the above virtualization management / systems management challenges.

“Staging an IT Management Evolution” – time to reassess infrastructure monitoring tools indeed

“Staging an IT Management Evolution” – time to reassess infrastructure monitoring  tools indeed.

Denise Dube is spot-on in her commentary regarding IT management tools
( http://bit.ly/dmV7mZ )

The dynamic nature of VM auto/semi-auto provisioning and the means for VMs to move to different vSwitch or content switching port (due to load balancing and dynamic resource allocation, etc.) challenges conventional fault and performance monitoring tools.

As Denise and others point out… beyond resource capacity issues, isolating VM contention, excessive movement and service impact requires a broader, more holistic monitoring footprint that extends Vmotion. Else service reliability will go down and MTTR will lengthen.  Where did that app move to and why? Did a network fault, configuration change or hardware issue degrade VM capacity supporting the app?

In terms of automation (beyond provisioning), a means to effectively map (and monitor) physical and virtual infrastructures and applications to business and business services should also be considered.  Can you fully baseline capacity and performance without accurately linking, monitoring controls and maintaining the aggregated resources supporting an IT business service.

Given the demand for service-oriented management in the brave new world of dynamic on-site, hosted or cloud-based resources – isolated or outmoded tools no longer foot the bill.

Great call to action from Denise

AccelOps SIEM vs other SIEM Solutions – FAQ

In the past few weeks, couple of people asked  specific questions about AccelOps support for 3rd party devices, flexibility of reporting/ rule framework, scalability of the system in general etc. Instead of replying to those emails individually, I decided to write a blog on those questions.   It’s bit long blog (than I wish) and highly technical in nature.

#1. How good AccelOp’s support for 3rd party devices? What’s your framework for supporting 3rd party  devices/applications? I want to add support for my favorite device XXX – what do I need to do?

For our data center and cloud monitoring solution to be useful for availability, performance, security, change and compliance, we need to support the heterogeneous environment that reflects a real data center with best-of-breed equipment from various vendors. So we are committed to supporting third party software and device in a timely fashion.

Since AccelOps monitors all aspects of a data center, our support for a single device or application tends to be comprehensive, ranging from auto discovery to categorization and normalization of SNMP traps, syslogs, netflow, WMI metrics and other event/protocol formats concerning availability, performance, security and change. While a significant majority of Tier 1 and 2 vendors are already supported, we are continuously adding new device support and keeping the existing device support up to date.

There are some technical innovations that we have developed to accelerate the device support process.

  • Typically, there are two ways to add device support: custom coding and scripts.

Custom coding involves parsing the device information within the shipping product code (Java or C++, within the main code or via an SDK in an agent). Scripts typically reside outside the main product code. The main tradeoff between custom coding and scripts is performance and flexibility.  Scripts are flexible but custom coding gives you performance – a perl based program designed to parse Netflow data or firewall logs would certainly not be able to keep with the event rate for high-end routers/firewalls.

AccelOps has developed a unique XML based scripting language through which comprehensive device support can be added without sacrificing performance.

While XML based parsing definitions exist in other products (such as Splunk), AccelOps XML parsing language has the power of programming languages (e.g. if-then-else, switch-case, temporary variables, etc) that makes comprehensive device support possible. In addition, we have developed a XML compiler and execution environment that enables AccelOps the means to execute the XML code without losing performance. In fact all our device support is written using the parsing XML language.

  • AccelOps device support library includes a large number (over 300 and growing) of parsed event attributes that encompasses events and logs from various IT management domains. This enables flexible support for a wide range of devices. More importantly, this is done without losing event processing performance and storage efficiency.

These technological innovations enable rapid and flexible device support. All it takes is to modify an existing parser XML file or create a new parser XML file and add it to the AccelOps system. In this way, new versions of supported devices can be easily added since they simply often add a few new logs. AccelOps has a dedicated team focused on device support, allowing us to provide high-quality and timely coverage. The user community can also be easily leveraged – if a partner has introduced a parser, that parser’s XML file can be redistributed to other customers

Please see our current device support list here.

#2. My current SIEM solution has very slow reporting in general and especially during high event rate processing. How is AccelOps’ reporting performance? And what if I need even faster processing?

The slowness in query response times in many SIEM products often comes from the use of a relational database. While relational database are easy to built a SIEM system around, the read-only monitoring data is ill-suited for relational database because of the following reasons

  • If data is inserted at high event rates (e.g. when dealing with firewall, netflow or Active Directory data), the database limit is quickly reached, causing the vendors to archive old data. In many systems, only a few months of data can be kept in a relational database. The effect is that another system needs to be brought up to analyze at the old data.
  • Event data may require many attributes (over a few hundred) to be parsed; a relational database table with so many columns can be unwieldy and causes performance and storage inefficiencies (also known as degradation and bloat).
  • Parallelizing a relational database for faster query performance is a non-trivial matter, both in terms of cost and implementation and maintenance complexity

On the other hand, the discovered information about devices, systems and applications (so called CMDB) is highly structured, updated often, data that merits a relational database.

AccelOps has developed a hybrid data management system that stores unstructured event data in flat file based database and structured CMDB data in an embedded commercial relational database (PostgreSQL). A data management layer unifies the two data management technologies and presents a single relational database like interface and the best of both worlds is achieved.  As an embedded RDBMS, the system does not require administrative tuning / index optimization.

More importantly, the ability to store events in a flat file database also enables query parallelization and solves the slow reporting problem. In clustered mode, AccelOps solution is deployed in a hierarchical supervisor-worker setup as shown below. The supervisor node divides a query into many sub-queries, distributes the sub-queries to the worker nodes, and creates the final query result by combining the results from the various worker nodes. Since the flat files are stored in NFS on a separate system, instant query response time reduction can be obtained by simply bringing up additional worker nodes.

#3.  I need a more flexible rules architecture so that I can change firing frequency, and create more sophisticated rules to catch security incidents, for example: “3 login failures followed by a success within a 10 minute time window”, or “multiple login failures not followed by a success to the same system within a 1 day time window”. Does your rule architecture support this?

AccelOps contains a sophisticated rule framework that can support anything from simple threshold performance rules, to highly complex security rules, all with a simple user interface. It supports the following constructs:

  • More than 300 event attributes with which to form rule conditions
  • Operators such as equals, greater than, IN, CONTAINS, BETWEEN, IS and their negative conditions
  • Ability to create multiple sub-patterns then combine them using the temporal operators: AND, OR, FOLLOWED_BY, OR_NOT, AND_NOT, and NOT_FOLLOWED_BY
  • Ability to create exceptions to rules in order to fine-tune their output
  • Ability to exclude rules from firing during specific time ranges
  • Ability to send resulting incident alerts via email, SMS, SNMP Traps, or XML via HTTP

Supports both simple and advanced work-flows when creating or editing rules.

#4.  I would like to have a single solution for long term Log Management and real time log analysis. My current SIEM product does not support this.

Our optimized file-based event database coupled with parallel data management and analysis enables AccelOps customers to have a single solution for analyzing both real-time data and historical data. Computing and storage can be incrementally added without service disruption. In contrast, most SIEM vendors must purge and archive long term data to avoid overwhelming their real-time relation databases, necessitating the use separate tools – one set of tools to manage real-time, and another to manage historical data.  This approach also has limitations for the amount of data that can be analyzed – so while the data stored may meet retention requirements, the ability to actually analyze / cross-correlate across the stored data is often severely limited.

#5. I need to collect events from 100’s of windows servers without agents and support the latest Windows 2008? Can you do that?


AccelOps in clustered mode can be deployed to accomplish this. The solution consists of many worker nodes and one supervisor node. The job of pulling windows logs from many servers via WMI is load balanced among the worker nodes. Each worker node is multi-threaded and can pull from many servers simultaneously.  The events are parsed and indexed by each worker node and the correlation to trigger rules is done by the supervisor and worker nodes in a collaborative fashion. Additional worker nodes can be deployed if more windows servers need to be monitored and the system is running out of capacity.

#6.  My current SIEM system can only correlate and alert within 1 system. As I deploy event collection to many servers or collect netflow from many routers, I need to deploy many SIEM systems and need to correlate across them. Does your architecture support this?

The clustered mode AccelOps solution can do real time global cross-correlation across multiple supervisor and worker nodes. One simple way to do this would be to filter and forward all events to the supervisor node, but that would bog down the supervisor node. AccelOps employs a novel summarized information exchange mechanism where the worker nodes do the pre-processing of the events and only sends summarized values to the supervisor node, which can do then final analysis and trigger alerts. The entire category of rules can be parallelized this way by the AccelOps clustered system and provides customers a way to scale event processing and alerting.

#7.  As I see an IP address in my dashboard or alert, I would like to the know the user behind that IP and the user’s network location if it is an internal one, or learn about the owner, domain etc from internet sites for external IPs. Can your system do this?


AccelOps provides full identity and location information for IP addresses – both external and internal, and in real time. For internal IP addresses, AccelOps derives the identity information by combining Active Directory discovery, domain logon information, DHCP events, Wireless and VPN logons and the location information from Wireless and VPN logons and AccelOps own layer 2 discoveries. The challenge here is that each source of information is partial, e.g. DHCP address assignments provides (IP Address, MAC address and Host name), domain logon provide (IP Address, Host name, User name) etc. The various pieces need to be strewn together into one consistent identity and location entry and it should also dynamically reflect the changes as they occur as the user moves around. AccelOps has a novel in-memory database based approach for merging the pieces various identity and location information on a first and last seen time basis. This contextual information is available to the user for every IP address displayed on the user interface.  AccelOps binds the user identity and location information to events to allow for historical analysis.

For external IP address, AccelOps provide information such as geo-location, whois lookup and trace-route information. Also with a single click, administrator can find out whether this ip is a part of already known spam databases using tools like SAN StormCenter, Cisco Senderbase or HoneyPot database.

For more details on this feature, please see the blog entry, SIEM – The Importance of Displaying Contextual Information with an IP address.

#8.  How do I prioritize my alerts in AccelOps?

AccelOps has the notion of a business service that is a smart container of network devices, servers and applications serving a common business purpose. Every incident is tagged with the affected business service and can be used to prioritize incidents.  AccelOps goes beyond traditional event severity by providing users business impact context to incidents.

#9.  As I investigate my incidents, I would also like to know additional context. For example, when there are lots of denied connections to a server, I would like to know the CPU, memory usage on the server and what are the changes if any on the server in a preceding time period. Can I quickly do that in AccelOps itself or I have to jump to another console?

This is easily possible in AccelOps since all aspects of a device are monitored. All the user needs to do is to discover the device and set up monitoring. Then the basic system level CPU/memory/disk space/disk I/O/network interface utilization, the top applications consuming most resources on that server as well as the changes made on that server are all available within 1 click. The information is also kept up to date on a periodic basis.

#10.  I would like to create reports in PDF format with nice charts that I can show to my manager.  I would also like to customized various dashboards. My current SIEM cannot do this.

AccelOps supports the exporting of real-time, historic and saved search results / reports  in both PDF and CSV formats.  The PDF reports contains multiple colored trend charts and can be customized with customer logos and custom notes. The CSV format can be exported directly into spreadsheet products or can be used to feed to other applications easily. Furthermore, saved reports are available as templates that can be used as dashboard widgets – enabling fully customized dashboards.

Please see the sample PDF and CSV files.

top-fw-report login-failed-report login-failed-report as CSV

#11.  We like seeing a topology map, but the topology supplied with our current SIEM product is inflexible to the point of being largely useless. My manager tells me that the topology diagrams look like “ball of yarn”. Do you have a better solution?

AccelOps’ user interface is built using a Web 2.0, Adobe Flex RIA (Rich Internet Application) framework. This framework allows us to present a more engaging desktop application experience, while still running within any browser, and offering universal anytime, anywhere accessibility.

This more flexible user-interface technology allows AccelOps to generate more dynamic, up-to-date layer 3 and 2 network topology maps with interactive alerts, service overlays, filters and drill-through details.

See more information about our topology map feature here

#12.  How flexible is your reporting framework? How many system reports do you ship with the product? Can I issue a simple Google-like search to find keywords in order to perform root cause analysis?

AccelOps features an advanced SQL-like search and cross-correlation engine with multiple patterns and advanced filtering and aggregation capabilities that can be computed in a distributed manner. This enables support of IT infrastructure, availability, performance, change and security scenarios, as well as allowing compliance requirements to be handled in a unified manner.

AccelOps ships with more than 850 (and growing) of built-in and extensible reports spanning availability, performance, security and change management, as well as compliance and inventory.

We support simple keyword Google-like searches with operators such as AND, NOT, etc, and also feature the capability of searching through real-time event data using either structured (condition-based), or simple keyword queries

#13.  What is your database architecture and how scalable is it?  Do I need to worry about purging and other issues? Can I add more storage capacity to the system as I expand my data centers? How good is the system performance when you have millions of events coming in and are performing queries on the data simultaneously?

AccelOps uses a hybrid database, storing events in indexed flat-files, and storing device configuration in an embedded commercial relational database (PostgreSQL). AccelOps has a patent-pending, multi-tiered, clustered architecture, where computing and storage can be seamlessly added to the cluster to increase performance and event storage capacity. This combination of proprietary database and parallel processing gives AccelOps the dual advantage of unlimited low cost storage and high event analysis performance that other monitoring solutions strive for.

Sign up for a 30 day free trial and see it by yourself.

SIEM is a SILO, and the clock is ticking!

Otherwise known as Nero Fiddled while Rome Burned

How long is it going to take before our industry realizes that the promise of SIEM cannot be realized until organizations conclude that current SIEM solutions are a Silo, and that Security Operations require a more holistic view to do their job?

Good Security people are hard to find and retain and Security products are difficult to cost justify. Making things worse, Security operations can often be ancillary to Network and Systems groups.  Doesn’t it make sense to integrate the needs of the IT organization into a system that “includes” Security while addressing the broader need of Network and Systems Operations?

Over the last few years, most SIEM vendors have attempted to retool their 10-15 year old architectures to keep up with an ever-increasing amount of data produced by an ever-increasing number of diverse data sources.  The notion of correlating security events to “find the needle in the haystack” works ok for a “black hat” sitting alone in a dark room with his or her monitor, but in reality they are fooling themselves if they think they have the situational awareness necessary to quickly and accurately determine the root cause of the alerts they are viewing.

As an example, if my help desk gets a call from an end user stating they cannot get access to a critical application on a web server, what tools do they turn to in order to quickly determine the root cause and the scale of the incident?  Do they need:

  1. Network performance monitoring
  2. Security event information management
  3. Application performance data
  4. Device configuration data
  5. All the above?

Maybe someone changed a configuration on a router or firewall effectively creating a “Self-Imposed DoS”, what tool is used to figure that out, and how long will it take to determine who did it and when? Most likely, many different people will check many different tools, while the clock is ticking and Nero fiddles.

Up to now, there has been very little innovation regarding ‘Root Cause Analysis’, largely because most vendors are highly leveraged into their 10-15 year old code, which primarily focused on doing one thing well. Some large vendors have acquired adjacent technologies to address the 800lb Gorilla in the room (root cause analysis), or improve time to market, but they’ve ended up with a cobbled together bundle of “expensive to purchase and maintain” modules; and they still don’t really work well as a system.

IT organizations exist to serve their respective businesses, and regardless of the technology in use, stakeholders only care about the health and growth of that business.  It is time to break down the Silo’s of SIEM, Network Performance, Server and Application Monitoring, and Change Management so IT staff  can work together to solve the major issues businesses face today. Whether the driver is regulatory compliance, maintaining profitability, protecting the brand, or “doing more with less”; IT teams need to break down the logical and physical barriers that have only served each individual team, but have not provided any meaningful results towards serving the business.

What is needed is a centralized dashboard that everyone in IT can use to determine whether an incident is occurring due to an attack on the network, maybe a BOT; or someone opened up the network by putting a rogue wireless access point in their cube, or again, maybe someone made a configuration change to a Firewall causing what is essentially a self imposed denial of service.

For my money, I want one tool that gives me a quick view of all security, performance, and availability related events, and a few clicks into the CMDB to quickly identify devices that are running low on memory or where a faulty fan is causing a device to run hot.  In fact, I want a system where it’s as easy to see how VMotion is optimizing performance by moving applications from one server to another, as it is to determine if the source of failed logons are by valid internal users, or by someone who came in from overseas via VPN (or a rogue wireless router) without valid credentials.

I suppose we can continue to throw huge sums of money at multiple modules and thousands of agents, and add dozens of headcount, but what good is a tool if you spend more time with the care and feeding of the tool rather than solving problems and serving the business?

In my opinion, IT organizations benefit when each unique (not disparate) group under IT has a similar vision of how everything is interconnected, each device, each application, each event, and most importantly how these elements are grouped based on their unique business services.  One view, one set of metrics, one source of accountability, and one IT Service goal, all while maintaining separation of duties for regulatory compliance and accountability.

Silo’d legacy tools served a purpose in their day, but that day has passed. Just like each agency under our federal government has been tasked to share information to protect our country, each area of IT must do the same in order to serve and protect their stakeholders.

Sign up for a 30 day free trial and see it by yourself.

Datacenter & Network Management: AccelOps – Jack of All Trades and Master of None?

It has been a while, since I wrote my last blog where I promised to write about whether AccelOps monitoring solution is ‘Jack of all trades, and master of none’.

So here I am, on Mother’s day, sitting in front of the computer writing a geek’s blog. A workaholic mom and entrepreneur? I guess both are true. To me, building a startup is like raising kids, it requires 120% attention, hard work and commitment; no other choices.

Before I answer the ‘Jack of all trades and master of none’ question, let me try to start with the requirement for datacenter and network management in 2010, by quoting Evelyn Hubbert, Forrester Research:

The element-based network management era is over. Today, network management teams need to manage and understand network-related issues across silos such as servers, storage, security, databases, and applications. They need to manage complex and dynamic IP networks to connect customers, vendors, and employees. Forrester sees the traditional network management space becoming service-oriented: The attention is on the service that is being delivered to the business. Innovations such as IT automation and Web services management techniques, and best practices such as ITIL, have changed the network management market and will continue to shape it in the years to come.

How true!  Data center and network management cannot be at the element level anymore. That used to be case in the 80’s when I worked in AT&T Bell Lab, and in the 90’s when I was working in the Network Management Business Unit in Cisco; but not in 2010. Times have changed, the datacenter infrastructure has evolved and so has their requirements. Managing the business services that the datacenter infrastructure and elements are delivering is now the key.

In order to meet these requirements, two fundamental pieces in a management solution have to be done well first: CMDB and mapping infrastructure impact to business services.

CMDB is a great concept and it is a corner stone towards the new management paradigm: managing by services. But often we see that mid and large enterprises embark on the process but are not able to make much progress. This is due to the fact that they often start with the top-down approach: cross functional teams with excel files trying to map out the organizations, map out the ownership of the infrastructure, and the dependency of the business services. The process is too heavy, out-weighs the benefits and deters the original intent.

As Evelyn Hubbert from Forrester Research sees it:

A CMDB is a fundamental component of an ITIL framework. The CMDB records Configuration Items (CIs) and the details about the important relationship between CIs. A CI is an instance of an entity that has configurable attributes – for example, a computer, a process, or an employee. A key success factor in implementing a CMDB is the ability to automatically discover information about the CIs – Autodiscovery

In complete agreement with her, we believe that the bottoms-up approach via auto-discovery is the right way to go: automatically discover what is in the datacenter including the network, map out the applications to the infrastructure, and map out their relationships. Gaining visibility is key. Once the IT organization has the map, they can then start defining the business services’ relationship to the infrastructure via the map easily. This discovery driven CMDB approach not only makes it easy to populate the CMDB for the first time, it also helps to keep the CMDB up to date. Periodically rediscover or rediscover upon changes and you are done!

Many years of experience working in network and security management field has taught me that the discovery process has to be very easy for the user to use; or it beats the purpose again.

With that philosophy, what we have built is something that requires very little from the user to quickly get to the final goal:  simply define the credentials to the devices and applications, define the appropriate network range(s) and the tool should take over from there. The AccelOps discovery engine discovers all the pieces in the datacenter infrastructure, the attributes and their inter-relationships, how they relate to and impact the critical business services and applications. It discovers the configuration inside the devices, the installed and running software, the patches… It discovers L2/L3 relationships, Guest OSs to ESX relationships, Wireless APs to Controller relationships, switch modules to switch relationships… It understands the changes: differences between saved and running configurations, between saved configurations, ports going up and down, applications going up and down… It categorizes devices and applications and presents them in a very logical and but easy to understand graphical way. To do all of the above, it requires the understanding of network, systems, applications and storage. In a complex and large datacenter environment, this is a non trivial job to do, as there are so many network scenarios and so many combinations in network configurations.

The undertaking of the above tasks does not sound like a ‘jack of all trade and master of none’ would be able to cut it, does it? It requires deep understanding and the domain knowledge in network, systems and application management.

Now let’s get to the second fundamental in today’s datacenter and network management: mapping infrastructure impacts to business services. Here I would like to use examples to show, how the requirement of managing by business services cannot tolerate a ‘jack of all trade and master of none’.

In order to be able to manage by business services and map the infrastructure’s impact, a solution must be able to do the following, as a minimum:

(1)  Define a problem, an exception or a vulnerability involving any datacenter infrastructure component and detect the issue in real-time. Here are a few datacenter scenarios:

Example 1:  Service health critical

For the same hostIP, if

average cpu utilization >90% or (average memory utilization >98% and paging rate > ) or (disk I/O utilization > ) or max interface utilization >50% from 3 consecutive sample within 10 minutes

then generate an incident (alert)

Example 2:  Excessive vMotion migration

For the same VMName, if

3 or more VM-Hot-Migration events or VM-Migration events in a 15 minute window,

then generate an incident (alert)

Example 3: Excessive End user DNS queries to unauthorized DNS servers

For the same srcIP, if

TCP/UDP port = 53, destination IP is not in internal DNS server group, source IP not in management applications group and not in internal DNS server group, and source IP is from the inside, and if this happens 10 times in a 5 minutes window,

then generate an incident (alert).

Note that internal DNS server group and internal management application group are populated from auto-discovery.

Example 4: User added as admin in the accounting application. Provide the identity of the user.

If VPN login event followed by windows server login event followed by user added to global admin group event within a 15 minute window, and the following conditions are met

VPN login source IP = windows server login source IP and

windows server login user = user add event’s user and

windows server login id  = user add event’s logon id and

reporting IP in accounting server group

then generate an incident (alert)

(2)  Define what makes up a business service and any of the problems defined in step (1)’s relationship to the business service. Here is another example of one of the scenarios:

Example. If there are any of incidents for the objects/components in a business service (devices, applications, users, users, etc.), generate an incident for that business service. (this requires the nest rules support. aka. Second level of rule fires based on the first level of rules)

(3)  All the definitions can be easily entered by user via the GUI. So a user can define these scenarios, behaviors based on the IT knowledge of the user without waiting for the software vendor to come up with new upgrade to support the scenarios.

So now you can see, without good understanding of network, systems, applications, VMs, storage, security, and without the capability to describe these understandings, and the capability to monitor and detect exceptions/anomaly/problem based on the understandings, there is no way to even meet the basic requirement of managing by business service.

So today’s datacenter and network management puts a much higher bar for management solutions. The existing silo-ed management solutions cannot cut it simply because they do not a common analytical framework to handle all the data from disparate parts of the datacenter infrastructure. This is however what AccelOps does via deep-and-wide understanding of a datacenter.

Compliance Management and Compliance Automation – How and How Efficient, Part 1

Compliance Management and Compliance Automation – How and How Efficient, Part 1

Compliance is all about implementing procedures and technologies that manage / reduce business risk, and efficiently validating that controls are working according to stated expectations that address mandates. In fact, given the complexity and multitude of industry and government regulated compliance mandates, it is better to orchestrate a governance strategy that leverages best practices and a top-down approach to set policies and incorporate guidelines that cost-effectively manage pertinent business risks and explicit compliance requirements.   Many leading analyst and audit firms recommend combining ISO, COBIT and PCI standards as a foundation for governance, risk and compliance management (GRC management).  No one overarching product delivers “compliance.”

For IT (as for other organizations), compliance management is as much (or more) about human commitment as it is technology controls.  As a best practice, the human factors are key to advancing compliance management and where making the right compliance automation decisions start.  This will necessitate organizational (and individual) buy-in, assessment, documentation, accountability, adherence, attestation and education.  Once preliminary compliance policies and oversight guidelines have been established, it then becomes that much easier to apply a bottoms-up approach to incorporate information security (and other) controls and compensating controls.  This often relates to data center management, access management, configuration management and security information management that will support monitoring and documentation according to agreed upon policies and discrete compliance requirements.

Beyond setting policy and procedures, many tools among an organization’s data center management portfolio can support compliance efforts.  The question then becomes finding the right technologies that best streamline internal or external audit processes, as well as those that automate control verification and documentation.  Business process management (BPM) tools can automate and track a variety to GRC process checks and balances within an organization.  But these tools are more about tracking and measuring processes versus compliance management from a data center infrastructure, business applications and user activity technical control perspective.  When it comes to technology controls, the primary concerns are access control, system and application controls, data integrity and protection, and operational resiliency (not to dismiss physical security, secure application development, etc.).

Compliance automation considerations, in regards to data center management and security information event management (SIEM) tools, should include the means to:

  • Validate a broad set of information security policies across infrastructure technologies
  • Understand asset and identity relationships and be able to associate objects with compliance and audit requisites
  • Produce reports that adapt to existing security, governance and auditing processes and frameworks
  • Normalize compliance-relevant data across disparate systems
  • Address complex and rapidly changing environments
  • Meet auditing and data management standards leveraging out-of-the-box and user customizable controls
  • Maintain log management integrity and data retention: data capture consistency, audit records and availability
  • Facilitate investigations such as identity access control patterns and violations to accurately track identity, location and action
  • Reduce control gaps and incident response lag time (MTTR)
  • Diminish compliance liabilities and audit duration
  • Be easily extensible in terms of new and custom control coverage and reporting

A data center management platform that automates many tasks connected with compliance provides the tangible benefits of reduced business risk, faster response to operational and violation incidents, increased productivity, compliance-related documentation, and lower auditing expenditures.

Part II will examine AccelOps’ integrated data center management approach in the context of compliance automation – such as control verification, compliance documentation, incident management and investigation – with specific examples regarding PCI-DSS.  For more information, please visit http://www.accelops.net/product/siem.php.

SIEM – The Importance of Displaying Contextual Information with an IP address

In any Security Information and Event Management (SIEM) product, in order to get the full details about an incident or event, showing contextual information about the IP address is crucial.  The event or incident will include a source or destination IP, but the admin needs to know more: the hostname, OS information, version, owner, and if it’s a known server or client machine in the network.

In AccelOps’s integrated data center monitoring solution, we provide this extended information with a single click wherever IP address information is presented in the UI.

It’s also beneficial to know about the performance of the server (such as CPU, Memory) etc with a single click so that it’s easy to figure out whether it’s a performance or availability issue.

For an external IP address, it may be crucial to get contextual information such as “whois information”, geographical location, whether it’s a part of already known spam databases, etc.

Continue reading SIEM – The Importance of Displaying Contextual Information with an IP address →

Data Center Management / Virtualization Management Dynamics

Data Center Management / Virtualization Management Dynamics

Virtualization software brings the tremendous benefits of provisioning ease, deployment ease, standardization on hardware architecture, and of course, server consolidation. For these advantages, the IT community is witnessing the rapid adoption and proliferated use of VMs in the data center.

However, a byproduct of virtualization is VM sprawl, VM contention and dynamic network architecture due to vMotion.  This greatly intensifies data center management complexity by many levels of magnitude.

Now with applications running in a guest OS that constantly moves to different vSwitch or content switching port, due to load balancing and dynamic resource allocation, pinpointing which layer having a problem is extremely difficult.  Is there an application problem that is causing the performance problem, or is it the OS layer? How about the ESX layer? But sometimes the problem could be in the vSwitch layer too! Because VM images are stored in the enterprise network storage, trouble shooting the problem would also need to consider the storage layer and the content switching layer.

Given the above virtualization management issues, it becomes more difficult to triage a virtualization issue in a timely fashion and readily know:  what is the true problem and where is the problem? If the root cause is due to change, who and what trigged the change?  What resource limits have been reached? Where does the VM (and respective OS and application) reside and where/when did it move.  And more importantly, what is the impact of excessive VM movement, VM problems or hardware issues to the delivery of IT services.

AccelOps data center management solution cross-correlates information from all layers in the networking stack: information from applications, processes, guest OS, the ESX, the HW, network devices (e.g. a content switch), storage, etc. across performance, availability, security and change management. By doing so, the solution is capable of eliminating many other possibilities in order to efficiently get to the true root-cause. In addition, AccelOps’ agent-less approach to data center management / virtualization management makes it non- intrusive and easy to implement.

AccelOps’ interactive virtualization dashboard puts critical ESX, hardware, VM, guest OS and application health, key metrics and events at the operator’s fingertips – providing immediate means to assess resources, preempt issues and pinpoint the root-cause of sophisticated problems in the dynamic virtual infrastructure.

Armed with this intelligence, AccelOps virtualization management capabilities answers such questions as:

  • What is our overall VM environment health; ESX server, VM, Guest OS and application performance, availability and security status, issues and trend?
  • Did a configuration change, patch or VM provision impact business services?
  • Are resource reservations resulting in VM contention?
  • Are network or hardware issues affecting ESX performance and availability?
  • Where is there excessive VM movement?
  • Are virtualized hardware investments accounted for and optimized?
  • How are VM and physical resources linked to business services?

To learn more about how AccelOps data center management software complements VMware Vcenter and enables organizations to enhance  virtualization management visit http://www.accelops.net/product/virtualization-management.php or visit us in the VMware Virtual Appliance Marketplace @ http://www.vmware.com/appliances/directory/296113.

Names mentioned herein may be trademarks and properties of their respective owners.