06 Jul Are you protecting your Data on IBM i OS, a Company Asset?
Are you protecting your Data on IBM i OS, a Company Asset?
Back up your LPAR Profiles frequently, which takes 2 seconds to do. The critical console data backup ( CCD ) is suitable for protecting the HMC settings and profile backups; however, the profile backup is quick and effective and protects against changes made to the profile.
These profile changes will take effect once the LPAR / Profile is restarted due to any event, power failure, reboot, etc. Suppose your Power System is running IBM i OS and uses a single HMC device to manage it. Profile backup is essential to keep the System healthy. It is saved to the HMC profiles directory and on the Service Processor; they should always be current mirror images of each other for a healthy system.
In the case of a single HMC managing your Systems, it is an excellent idea to set the profiles to autostart if the HMC has a catastrophic failure and the managed System needs to be restarted due to unforeseen circumstances like a Power failure. Saving the profile to a key stick for off-site storage is also a wise move. Finally, have a look at the wonderful HMC Scanner utility tool by Federico from IBM Italy, it is excellent for documenting System configuration and detailing how the HMC/VIOS/Profiles are structured and it produces a very useful Excel Spreadsheet that has great info for your system architecture layout.
Firmware/Microcode is used to run and manage how the Service Processor works with the existing Hardware. If the HMC manages your System, the main thread here is the directed service maintenance package procedures are embedded in the HMC Software code and work in conjunction with the Service processor Firmware. This optimal code relationship is essential to protect and keep your System healthy and will aid with accurate problem determination and smooth running of your System. Please use the following IBM sites to check and maintain your HMC and Service processor code optimally. Aim to keep your HMC and Service Processor codes at the Maximum stability levels recommended by IBM for the best performance and smoother operations for your Business.
The single-level storage object concept is a beautiful logical abstract whereby all objects are scattered across all the Disk/DASD, direct access storage devices on the IBM i OS that present to the auxiliary storage pools Disks (ASP). This scattered data and the IBM i OS concept can also present an issue if you lose a disk due to failure; this scattered object data will be lost if there is no data protection (Raid or Mirroring) and will result in a complete system loss and reload.
It is imperative always to protect your data on disk using your method of choice Raid or Mirroring, provided by the manufacturer. Raid is a Hardware implemented concept vs. Mirroring implemented by IBM I OS Software. Compared to Aix, which does not scatter load the data and uses a Journalled File System (JFS) whereby a single drive failure may not result in a complete system loss, just restoring the data that resided across that hdisk VG volume.
Raid 5 is widely used on the IBM i OS to protect the data; it is wise to implement Raid 5 with a hot spare at a minimum when architecting your data protection scheme depending on your hardware. Mirroring is the ultimate protection methodology but requires an additional doubling of the disk as every object is written twice. When architecting and planning your data protection methodology, please protect the cache data running through the Disk/DASD I/O adapters ( IOA). The Cache / DASD / IOA triad must remain healthy for the system to continue to run. Suppose your system has a single DASD/Disk IOA and suffers a run-time failure in the adapter cache circuitry, and the cache was not successfully de-staged. In that case, a system data loss will likely occur, and downtime is never a welcome visitor.
Use the IBM System planning tool to check your system if you are unsure how to verify that dual IOA’s are implemented. It will guide you on how to implement the Cache protection setup. Raid 5 or 6 or mirroring will protect the Disk/Data; however, the disk adapter circuitry and Licensed code SLIC will always invoke the cache hardware circuitry on the disk IOA, and the adapters cache must be protected; there should always be redundancy concerning the cache being protected by implementing dual IOA’s to minimize Cache data loss during a failure.
Some of the most feared codes on the I5 OS are System Reference codes (SRC) A6XX0255 or 0266, which essentially is a loss of access to the primary Load source disk; all the disks must communicate with this Load source disk designated as unit #1 in the ASP. The most crucial takeaway is not to reboot the system as a reaction response; call your service provider, and they will help you to understand and capture all the relevant info associated with these SRC’s to execute the correct problem isolation procedure to resolve these SRC’s for the best possible outcome.
Specific critical reports accessed via the Dedicated Service Tools menus or the System Service Tools subset menus are required by your Hardware service provider to aid them in understanding your hardware topology. Always keep a copy of your HMC System plan. Printing the System Configuration/Rack Configs reports are beneficial. Print the reports with the options for Logicals and physicals; these reports display the Buses in decimal format and will provide excellent references when cross-referencing the SRC words 1–9 when the system has a severe fault, print this report for each Lpar/profile. The SRC System Reference codes are displayed in Hex in specific words/nibbles, and the SRC format will dictate what area/nibbles in words contain the relevant info we, as your provider, need to troubleshoot the issue presenting.
The Hardware Reference Codes are presented in 8 formats, namely 13,17,27,29,60,61,62,63. Your Service provider must understand these formats comprehensively to apply the correct Problem Isolation Procedure (PIP) when specific errors occur. It is imperative to correctly capture words 1–9 via the HMC or Operator panel if the system is not HMC-managed. These words will help to identify the failure subsystem. The Reference code history option in the HMC task area will give beneficial info on these words in addition to the Manage Serviceable events, which will guide the nature of the error and the cause.
Monitoring Software of your choice is vital to ensure all errors are captured, documented, and presented to the Service provider for analysis. The System operator Message Queue should ideally be in *Break mode to ensure no critical messages are overlooked. System reference codes ending with 8008 or 9031 should be reported to your provided immediately; these are Cache Battery failures and Disk parity array failures. In place of hardware monitoring Software and to aid in capturing and understanding the Hardware errors, you can use this CLI command to get very useful output ‘PRTERRLOG *ALLSUM’; running this daily or weekly interval is a wise move. Another useful cli command is ‘WRKPRB’ all problems will be reported in this queue.
The IBM POWER Systems describing a CHIP acronym meaning ‘Performance Optimization with Enhanced RISC’ displays two types of errors, namely Device and Platform. An example of a platform error is a Power supply failure all LPARS/Profiles will report and log this error vs. a Disk / Dasd failure which is an example of a Device error that will only be logged by the affected partition/profile.
Dedicated Service Tools ( DST) is a menu-driven service interface your provider utilizes to check and service the errors. System Service Tools (SST) is a subset of the DST menus. The Service tools’ passwords must be known and readily available so access can be granted to these tools, additionally, the Advanced System Management Interface (ASMI) password must be understood and made available upon request. These interfaces and Tools are critical for service.
These IBM I OS CLI commands outlined above will significantly assist in drilling into the what / where / who / why questions and answers of what caused the failure.