Reliability Testing: The Key to Ensuring Data Center Accuracy and Performance

Reliability Testing: The Key to Ensuring Data Center Accuracy and PerformanceDevelopments in cloud computing, the Internet of Things, and the proliferation of mobile devices in everyday technology have all made innovative, highly efficient data centers more necessary than ever. Ensuring that today’s increasingly complex data centers are agile, adaptable, distributed, efficient and intelligent is a challenge for the ages. However, to achieve and maintain the highest levels of accuracy and performance in these almost futuristic structures running at high temperatures and high speeds, it is essential to carefully consider the often underappreciated but critical world of testing.

Data center testing is the only way to ensure optimal performance, but this leaves the question of what exactly needs to be tested? Individual devices within the data center often have a role to play in wider environments where they may be exposed to a greater range of temperatures, for example. The level of qualification for these devices is significantly different than for products designed for a data center closet, where they are not exposed to extreme elements.  

The products designed to live in the data center are responsible for delivering utmost reliability of the overall site as they need to send and receive data with almost no latency, and be able to get those ‘ones’ and ‘zeros’ successfully transmitted from chip to chip, from box to box, and from switch to switch.   They need to do so while avoiding downtime and optimizing capacity within the data center. These are the high-reliability products we require to avoid having racks that are not being utilized or utilized inefficiently and to ensure that data and signal integrity are preserved, transmitting without latency and at speeds close to real time. In short, the reliability of connector and cable assemblies is crucial to a data center’s performance.

The high cost of data center failure
Data center reliability is focused on uptime and achieving low latency, so when designing these products, long-term reliability and efficiency is the quest. Being able to rely on every cable’s performance and the consistency of that performance is a given, but for end-users, it’s a question of what benefits can be derived? One of the simplest ways of demonstrating the benefit of testing is to look at the impact on operator efficiency of untested or poorly tested products.

For example, if a product designed into a server switch has been poorly tested and an intermittency issue is discovered only after the raft of servers have been plugged in, the costs to change out or repair components or racks once they have been put in place is immense, not just in terms of downtime but also personnel costs. Guidelines state that engineers can only operate in the ‘hot aisle’ for 15 minutes, with work/rest schedules dictating at least an hour to recover. Multiply this across a site with up to 5000 racks, and the impact on uptime could be vast. Highly skilled operators that should be spending their time increasing capacity end up being sidelined, dealing with issues that could and should have been addressed long before installation.

The Open19 Project, an innovation initiative focused on open hardware platform for data center and edge hardware, addresses how the efficiency of data center operations can be improved. Improving hardware interoperability underpins the ethos behind this initiative, with hardware standards being developed to enable computer, storage and network manufacturers to create differentiated hardware solutions while protecting their own competitive intellectual property.  A more standardized approach such as that proposed under Open19 would typically focus on quick swap out and bring-up times. The use of a more innovative approach versus using traditional cables, can reduce time taken from around 6 hours to more like 75 minutes.

The ability to think laterally and not just in a narrow product focus, has driven investigation into the possible benefits of liquid cooling systems for 112 Gbps. The importance of reliable product performance is heightened even more with these innovative solutions.

Further, while the pursuit of greater bandwidth rules the communications marketplace, customers may potentially struggle economically to cool their systems for that next data-rate jump. Forward-thinkers are also looking at possible ways to replace much of the PCBA structure within their systems, on the server side as well as the switch, to allow greater modularity and flexibility in system design. This approach could include removing some of the materials that prevent air flow, adding to the range of possible techniques like implementing Bipass for reducing thermal issues.

Organizations that have adopted a “cooler is better” approach report achievement of as much as a 25% overall saving in energy consumption per rack. This could translate to significant savings for a 10,000-rack data center, which could potentially realize a 7-digit savings per year!

Designing for optimal signal integrity
The data center brings together myriad circuit and component interfaces. In this environment, an impedance shift or impedance mismatch will impact electrical performance. This means that manufacturing must be maintained at specified tolerances to consistently produce properly operating products.

How can this be achieved? Molex is continuously innovating the processes, protocols and materials used to manufacture products on a regular basis. This is all driven by the need to reduce the product design complexity, while complying with strictly specified tolerances.

When a chip sends a one or a zero to another chip within another box, there is a requirement that the one or zero is very clearly understood by the receiving chip. Testing then guarantees that the product will work as designed within the tolerances specified. This process ensures that the product will consistently perform to the customer requirement—from part to part, from lot to lot. And all without variation such as packet loss and intermittency that would disrupt the signaling.

Crunching more and more data
As chips become more powerful from greater bandwidth both on the Ethernet side and within the server between the network switch and the server space, two different devices drive the data rates. This results in an increase in processing power to crunch a lot more data in less time. This also means an increase in signal transmission and throughput, adding to the level of performance needed to run high-speed networks and applications.

The building blocks of a successful transmission for these accelerating data flows is a connector, whether cables, PCBs, or that which is specific to the path between the processor and the end of the channel. The aim is to ensure a seamless transition between those building blocks over the connectors and the cables to the receiving (RX) side.

This in turn places considerable demands upon innovative interconnect specialists, who need to be very precise in how they handle and comply with the multitude of factors involved, including guaranteeing the tiny mechanical tolerance ranges of products. In the case of the data center, mechanical variation ultimately means electrical variation.

Looking at a data center macroscopically prompts several critical considerations. What happens if there are data continuity issues? From a quality perspective, as more and more parts, products, cables, and switches are progressively built into the rack, what exactly happens to the data center if flipping a switch leads to sudden intermittency issues or some sort of loss of signal issue?

This real-world scenario would typically require the replacement of bad servers, replacement of bad switches, being in the hot aisle unplugging cables, re-plugging cables, and then testing the rack again. Clearly, data center productivity suffers when a technician must spend his or her valuable time in the hot aisle and not actually installing equipment and getting new assets and capacity up and running.

Valuable lessons learned
Avoiding the types of scenarios outlined above requires consistency of the connection product. Benchmarking the industry has clearly defined the many issues that can arise from electrical inconsistency—and helps mark the path forward.

The importance of innovation in the manufacture of these critical products has never been more essential. Molex’s aggressive digital transformation initiative has resulted in a global manufacturing strategy that delivers consistence, especially for miniaturized components, while also allowing cooling elements to do their job. The company’s substantial multi-year investment delivers finetuned processes and resources to achieve both volume and consistency while implementing automated production.

The reality is that connectivity is what delivers integrity — signal integrity and power integrity above all. When the integrity of the data center is maintained, performance is maintained across a variety of complex systems.

Safe, effective data center solutions are only possible by instituting comprehensive testing that encompasses each signal pair of every cable assembly produced, for consistent industry-leading quality control. This allows customers to rely on each cable’s performance and the consistency of that performance.

The rigor of this testing saves customers from having to conduct such tests themselves. This is not just a question of time and expense; it means that technically complex questions, such as those surrounding optical cabling connectivity and termination, are left in the hands of experts—those Molex engineers who continue to build on the company’s legacy of unmatched connectivity experience. 

The upshot? Effectively managing implementation — enabling immediate installation while removing the risk of faulty cable troubleshooting later — optimizes uptime, saves the customer time and money, and accelerates both time-to-market time-to-ROI.

Director, New Product Development