<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Big Iron-y</title>
	<atom:link href="http://www.ldscio.org/2007/03/21/big-iron-y/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.ldscio.org/2007/03/21/big-iron-y/</link>
	<description>Chief Information Officer for The Church of Jesus Christ of Latter-day Saints</description>
	<lastBuildDate>Mon, 26 Jul 2010 01:33:29 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Mr. Blue</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-990</link>
		<dc:creator>Mr. Blue</dc:creator>
		<pubDate>Thu, 03 May 2007 21:20:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-990</guid>
		<description>&gt;&gt;Finally they tried to upsell us to a higher quality SAN at a deep discount.

You gotta love when they agree their own product is junk ;) Hope everything worked out.</description>
		<content:encoded><![CDATA[<p>&gt;&gt;Finally they tried to upsell us to a higher quality SAN at a deep discount.</p>
<p>You gotta love when they agree their own product is junk <img src='http://www.ldscio.org/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  Hope everything worked out.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paul Hepworth</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-978</link>
		<dc:creator>Paul Hepworth</dc:creator>
		<pubDate>Mon, 30 Apr 2007 02:00:03 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-978</guid>
		<description>The only thing worse than not having redundancy is thinking you have it and then relaxing.

Unless fail-over configurations are tested frequently they will not work when you need them.</description>
		<content:encoded><![CDATA[<p>The only thing worse than not having redundancy is thinking you have it and then relaxing.</p>
<p>Unless fail-over configurations are tested frequently they will not work when you need them.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mr. Blue</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-967</link>
		<dc:creator>Mr. Blue</dc:creator>
		<pubDate>Thu, 26 Apr 2007 03:59:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-967</guid>
		<description>It certainly is worth it, if done right. The fact that it took 30 minutes to figure on the load balancer leaves me to question both the competence of staff as well as your monitoring capabilities or lack thereof. But i digress:

    * Written properly, monitoring scripts and scenarios can pinpoint easily the failure in your applications/systems. You need to monitor not only your balanced app, but also each individual app instance; doing so may have helped pinpoint failure at the balancer.

    * How is moving to a new vendor going to solve a poorly-architected system? Most SANs I&#039;ve ever known of are out of the box crazily-redundant. Either your datacenter was hit by a nuke, or a monkey architected your SAN, plain and simple.

&lt;em&gt;[Joel: Turns out the vendor agreed the SAN&#039;s were bad. We tried new drivers, new hardware components, etc, etc. Finally they tried to upsell us to a higher quality SAN at a deep discount.]&lt;/em&gt;</description>
		<content:encoded><![CDATA[<p>It certainly is worth it, if done right. The fact that it took 30 minutes to figure on the load balancer leaves me to question both the competence of staff as well as your monitoring capabilities or lack thereof. But i digress:</p>
<p>    * Written properly, monitoring scripts and scenarios can pinpoint easily the failure in your applications/systems. You need to monitor not only your balanced app, but also each individual app instance; doing so may have helped pinpoint failure at the balancer.</p>
<p>    * How is moving to a new vendor going to solve a poorly-architected system? Most SANs I&#8217;ve ever known of are out of the box crazily-redundant. Either your datacenter was hit by a nuke, or a monkey architected your SAN, plain and simple.</p>
<p><em>[Joel: Turns out the vendor agreed the SAN's were bad. We tried new drivers, new hardware components, etc, etc. Finally they tried to upsell us to a higher quality SAN at a deep discount.]</em></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark R. Bush</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-964</link>
		<dc:creator>Mark R. Bush</dc:creator>
		<pubDate>Fri, 20 Apr 2007 12:35:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-964</guid>
		<description>Testing!  You can plan, prevent, and pretend to protect all day long!  I&#039;ve been working in HA environments for nearly 10 years now and if there is one lesson I&#039;ve learned it&#039;s that you must test your failover/loadbalancing mechanisms.  You typically test this behavior at  the time of implementation.  But what about 60, 120, 240 days into it?  Regular, scheduled, failure testing is the only way I know to minimize (notice not eliminate completely) any risk in your HA environment.  The second thing I&#039;ve learned is people are the biggest problem.  Your change management process needs to be disciplined, SA&#039;s need to trust and use it (one of the biggest challenges), and plans need to be made, tested, and followed.  Most technical type IMHO are process oriented and if you can get them to trust the change process and follow it, it too goes a long way.</description>
		<content:encoded><![CDATA[<p>Testing!  You can plan, prevent, and pretend to protect all day long!  I&#8217;ve been working in HA environments for nearly 10 years now and if there is one lesson I&#8217;ve learned it&#8217;s that you must test your failover/loadbalancing mechanisms.  You typically test this behavior at  the time of implementation.  But what about 60, 120, 240 days into it?  Regular, scheduled, failure testing is the only way I know to minimize (notice not eliminate completely) any risk in your HA environment.  The second thing I&#8217;ve learned is people are the biggest problem.  Your change management process needs to be disciplined, SA&#8217;s need to trust and use it (one of the biggest challenges), and plans need to be made, tested, and followed.  Most technical type IMHO are process oriented and if you can get them to trust the change process and follow it, it too goes a long way.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacob</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-952</link>
		<dc:creator>Jacob</dc:creator>
		<pubDate>Mon, 16 Apr 2007 06:16:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-952</guid>
		<description>I&#039;ve found that in our data center it is always the thing we least expect to fail that does.  That means that while redundant systems are good to prevent against foreseeable problems, they won&#039;t protect you against the problems you didn&#039;t anticipate.

I recommend really good centralized monitoring software, so that you can determine very quickly what the less-anticipated problems really are, so you can fix them and get back online as fast as possible.

You said: &quot;It took us 30 minutes to realize our applications were down because of the failed load balancer.&quot;  You could have reduced your downtime by bunch of minutes if you had scripts or monitors watching your load balancer.  This sort of solution is generally much more affordable than the &quot;figure out if the load balancer is less than robust and replace it with a more robust system if it is&quot; situation.</description>
		<content:encoded><![CDATA[<p>I&#8217;ve found that in our data center it is always the thing we least expect to fail that does.  That means that while redundant systems are good to prevent against foreseeable problems, they won&#8217;t protect you against the problems you didn&#8217;t anticipate.</p>
<p>I recommend really good centralized monitoring software, so that you can determine very quickly what the less-anticipated problems really are, so you can fix them and get back online as fast as possible.</p>
<p>You said: &#8220;It took us 30 minutes to realize our applications were down because of the failed load balancer.&#8221;  You could have reduced your downtime by bunch of minutes if you had scripts or monitors watching your load balancer.  This sort of solution is generally much more affordable than the &#8220;figure out if the load balancer is less than robust and replace it with a more robust system if it is&#8221; situation.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: R. Alvez</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-900</link>
		<dc:creator>R. Alvez</dc:creator>
		<pubDate>Fri, 06 Apr 2007 01:56:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-900</guid>
		<description>I&#039;m a Unix guy and as you probably know, Unix/Linux over abounds in tools.
One tool I always run in my box is gkrellm, which incidentally has a plug-in to keep track of running servers. If they fail (as in a server becomes unavailable), I&#039;ll see it on the screen as well as I get a screen pop up or I can set it to send me an e-mail.
So far it has never failed me, I&#039;m always the first one to know if something &quot;dies on action&quot; and can take counter measures.
Sometimes the most effective tools are the simplest of them all ;-)</description>
		<content:encoded><![CDATA[<p>I&#8217;m a Unix guy and as you probably know, Unix/Linux over abounds in tools.<br />
One tool I always run in my box is gkrellm, which incidentally has a plug-in to keep track of running servers. If they fail (as in a server becomes unavailable), I&#8217;ll see it on the screen as well as I get a screen pop up or I can set it to send me an e-mail.<br />
So far it has never failed me, I&#8217;m always the first one to know if something &#8220;dies on action&#8221; and can take counter measures.<br />
Sometimes the most effective tools are the simplest of them all <img src='http://www.ldscio.org/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mario Hipol</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-757</link>
		<dc:creator>Mario Hipol</dc:creator>
		<pubDate>Tue, 27 Mar 2007 21:06:03 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-757</guid>
		<description>I recently had a personal experience that taught me something about redundancy.  I had an issue a while back when installing a beta version of a &quot;newly released office system&quot; *ahem*  When installing it asked me to reboot and I blue screened.  I was extremely distraught. I had all our family photos, etc. I wans&#039;t sure what to do. It just so happened a month prior I was testing a product that basically does and ISO of your system and restores it exactly to that point.  I searched all over and found that simple test file I had created on my system.  It was a lifesaver.  

I have since come to realize that the importance of having things in place to recover information is extremely important even on a personal level.</description>
		<content:encoded><![CDATA[<p>I recently had a personal experience that taught me something about redundancy.  I had an issue a while back when installing a beta version of a &#8220;newly released office system&#8221; *ahem*  When installing it asked me to reboot and I blue screened.  I was extremely distraught. I had all our family photos, etc. I wans&#8217;t sure what to do. It just so happened a month prior I was testing a product that basically does and ISO of your system and restores it exactly to that point.  I searched all over and found that simple test file I had created on my system.  It was a lifesaver.  </p>
<p>I have since come to realize that the importance of having things in place to recover information is extremely important even on a personal level.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: WadeA</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-742</link>
		<dc:creator>WadeA</dc:creator>
		<pubDate>Mon, 26 Mar 2007 19:35:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-742</guid>
		<description>People. Process. Technology. 

We&#039;ve all heard it before - I think - yet we often fail to put systems and methodologies in place that actaully monitor all three. Sometimes it isn&#039;t about adding more protection and complexity, it&#039;s about simplifying and watching what exists more closely or more effectively to see when there are symptoms and recover more quickly. 

You also have to be willing to invest in all three. Skimping on one usually means you&#039;ll be reinvesting back into one of the other areas - skimp a few dollars of the technology and you&#039;ll likely invest in people and process to keep it in play. Neglect the growth of your staff and there&#039;s a good chance you&#039;ll be adding process or technology attempting to compensate. Keep in mind that symptoms don&#039;t always point you to the area that is the REAL cause.

So what do you do to break the cycle? Take steps to understand where you are today. Are your existing process effective in supporting your business objectives. Are your people able to use them effectively in their roles. Examine the source of your outages - I mean the REAL source. Just like any self-critique, you&#039;re going leanr stuff you don&#039;t like and aren&#039;t fun to admit, but it&#039;s the only way to really get on the path to improvement. 

I don&#039;t want to ramble on here so lastly I&#039;ll just say, don&#039;t be afraid to get some outside help. Being the &quot;self-reliant&quot; culture that we are it&#039;s not something that we&#039;re very good at doing or knowing when to do it. That &quot;objective&quot; perspective can often provide valuable insight that can be learned in no other way.</description>
		<content:encoded><![CDATA[<p>People. Process. Technology. </p>
<p>We&#8217;ve all heard it before &#8211; I think &#8211; yet we often fail to put systems and methodologies in place that actaully monitor all three. Sometimes it isn&#8217;t about adding more protection and complexity, it&#8217;s about simplifying and watching what exists more closely or more effectively to see when there are symptoms and recover more quickly. </p>
<p>You also have to be willing to invest in all three. Skimping on one usually means you&#8217;ll be reinvesting back into one of the other areas &#8211; skimp a few dollars of the technology and you&#8217;ll likely invest in people and process to keep it in play. Neglect the growth of your staff and there&#8217;s a good chance you&#8217;ll be adding process or technology attempting to compensate. Keep in mind that symptoms don&#8217;t always point you to the area that is the REAL cause.</p>
<p>So what do you do to break the cycle? Take steps to understand where you are today. Are your existing process effective in supporting your business objectives. Are your people able to use them effectively in their roles. Examine the source of your outages &#8211; I mean the REAL source. Just like any self-critique, you&#8217;re going leanr stuff you don&#8217;t like and aren&#8217;t fun to admit, but it&#8217;s the only way to really get on the path to improvement. </p>
<p>I don&#8217;t want to ramble on here so lastly I&#8217;ll just say, don&#8217;t be afraid to get some outside help. Being the &#8220;self-reliant&#8221; culture that we are it&#8217;s not something that we&#8217;re very good at doing or knowing when to do it. That &#8220;objective&#8221; perspective can often provide valuable insight that can be learned in no other way.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill Coulam</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-721</link>
		<dc:creator>Bill Coulam</dc:creator>
		<pubDate>Sun, 25 Mar 2007 05:11:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-721</guid>
		<description>The goal is certainly noble, and should be do-able.

But in practice, the sales pitches almost never measure up.

We once had some Sun clustering solution for our 12 CPU twin E4500&#039;s, and the only serious problems we ever really had with those machines were due to the clustering. We gave it more than it&#039;s share of chances to measure up. Once we finally gave up and removed the cluster, we finally got to enjoy some of that legendary Sun reliability and never experienced unexpected downtime again.

Our next Tech Director purchased a Hitachi SAN that seemed to work alright.

Experiences with other HA solutions similar to yours seem to be the status quo, including horror stories around Raid 5 and tons of wasted effort and money on Oracle RAC. There is a semi-formal &quot;club&quot; of sorts, composed of many of the biggest names in Oracle, dedicated to stamping out all incarnations of Raid 5 (http://www.miracleas.com/BAARF/BAARF_members_sql.php).

So testing the brains out of one&#039;s HA solution seems to be the best recourse.</description>
		<content:encoded><![CDATA[<p>The goal is certainly noble, and should be do-able.</p>
<p>But in practice, the sales pitches almost never measure up.</p>
<p>We once had some Sun clustering solution for our 12 CPU twin E4500&#8217;s, and the only serious problems we ever really had with those machines were due to the clustering. We gave it more than it&#8217;s share of chances to measure up. Once we finally gave up and removed the cluster, we finally got to enjoy some of that legendary Sun reliability and never experienced unexpected downtime again.</p>
<p>Our next Tech Director purchased a Hitachi SAN that seemed to work alright.</p>
<p>Experiences with other HA solutions similar to yours seem to be the status quo, including horror stories around Raid 5 and tons of wasted effort and money on Oracle RAC. There is a semi-formal &#8220;club&#8221; of sorts, composed of many of the biggest names in Oracle, dedicated to stamping out all incarnations of Raid 5 (<a href="http://www.miracleas.com/BAARF/BAARF_members_sql.php" rel="nofollow">http://www.miracleas.com/BAARF/BAARF_members_sql.php</a>).</p>
<p>So testing the brains out of one&#8217;s HA solution seems to be the best recourse.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan Fitzpatrick</title>
		<link>http://www.ldscio.org/2007/03/21/big-iron-y/comment-page-1/#comment-717</link>
		<dc:creator>Dan Fitzpatrick</dc:creator>
		<pubDate>Sat, 24 Mar 2007 19:18:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.ldscio.org/2007/03/21/big-iron-y/#comment-717</guid>
		<description>In my experience computer hardware is like race car hardware - the harder you push it the faster it breaks (specifically disks). My solution to reliability to to over-provision hardware so it rarely exceeds 50% utilization. Kind of like an F1 racing at Nascar. If disks are reading and writing at capacity 24/7, they will wear out much faster than if they run at less than 50% capacity. For disk reliability and scalability I prefer RAID 10 with disks from different manufacturers on each RAID 0 set. 

I agree with Phillip Cox that fail-over testing is essential. It is our policy that we use our fail-over systems during all system upgrades. We fail-over, do the upgrade of the primary system, and return the primary system back to production, then upgrade the fail-over system. You kill two birds with one stone by testing fail-over and upgrading in the same process. This is basically training the staff to deal with system failures as a normal upgrade (which would have to occur anyway to fix whatever broke.)</description>
		<content:encoded><![CDATA[<p>In my experience computer hardware is like race car hardware &#8211; the harder you push it the faster it breaks (specifically disks). My solution to reliability to to over-provision hardware so it rarely exceeds 50% utilization. Kind of like an F1 racing at Nascar. If disks are reading and writing at capacity 24/7, they will wear out much faster than if they run at less than 50% capacity. For disk reliability and scalability I prefer RAID 10 with disks from different manufacturers on each RAID 0 set. </p>
<p>I agree with Phillip Cox that fail-over testing is essential. It is our policy that we use our fail-over systems during all system upgrades. We fail-over, do the upgrade of the primary system, and return the primary system back to production, then upgrade the fail-over system. You kill two birds with one stone by testing fail-over and upgrading in the same process. This is basically training the staff to deal with system failures as a normal upgrade (which would have to occur anyway to fix whatever broke.)</p>
]]></content:encoded>
	</item>
</channel>
</rss>
