Memcached

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 1287 Experts worldwide ranked by ideXlab platform

Thomas F. Wenisch - One of the best experts on this subject based on the ideXlab platform.

  • thin servers with smart pipes designing soc accelerators for Memcached
    International Symposium on Computer Architecture, 2013
    Co-Authors: Kevin T. Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, Thomas F. Wenisch
    Abstract:

    Distributed in-memory key-value stores, such as Memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of Memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of Memcached behavior. We discover that, regardless of CPU microarchitecture, Memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance Memcached deployment. TSSP couples an embedded-class low-power core to a Memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

  • ISCA - Thin servers with smart pipes: designing SoC accelerators for Memcached
    Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13, 2013
    Co-Authors: Kevin T. Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, Thomas F. Wenisch
    Abstract:

    Distributed in-memory key-value stores, such as Memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of Memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of Memcached behavior. We discover that, regardless of CPU microarchitecture, Memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance Memcached deployment. TSSP couples an embedded-class low-power core to a Memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

Kevin T. Lim - One of the best experts on this subject based on the ideXlab platform.

  • thin servers with smart pipes designing soc accelerators for Memcached
    International Symposium on Computer Architecture, 2013
    Co-Authors: Kevin T. Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, Thomas F. Wenisch
    Abstract:

    Distributed in-memory key-value stores, such as Memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of Memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of Memcached behavior. We discover that, regardless of CPU microarchitecture, Memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance Memcached deployment. TSSP couples an embedded-class low-power core to a Memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

  • FPGA - An FPGA Memcached appliance
    Proceedings of the ACM SIGDA international symposium on Field programmable gate arrays - FPGA '13, 2013
    Co-Authors: Sai Rahul Chalamalasetti, Kevin T. Lim, Parthasarathy Ranganathan, Mitch Wright, Alvin Auyoung, Martin Margala
    Abstract:

    Providing low-latency access to large amounts of data is one of the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These systems are critical and often deployed across hundreds or thousands of servers. However, these systems are not well matched for commodity servers, as they require significant CPU resources to achieve reasonable network bandwidth, yet the core Memcached functions do not benefit from the high performance of standard server CPUs. In this paper, we demonstrate the design of an FPGA-based Memcached appliance. We take Memcached, a complex software system, and implement its core functionality on an FPGA. By leveraging the FPGA's design and utilizing its customizable logic to create a specialized appliance we are able to tightly integrate networking, compute, and memory. This integration allows us to overcome many of the bottlenecks found in standard servers. Our design provides performance on-par with baseline servers, but consumes only 9% of the power of the baseline. Scaled out, we see benefits at the data center level, substantially improving the performance-per-dollar while improving energy efficiency by 3.2X to 10.9X.

  • an fpga Memcached appliance
    Field Programmable Gate Arrays, 2013
    Co-Authors: Sai Rahul Chalamalasetti, Kevin T. Lim, Parthasarathy Ranganathan, Mitch Wright, Alvin Auyoung, Martin Margala
    Abstract:

    Providing low-latency access to large amounts of data is one of the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These systems are critical and often deployed across hundreds or thousands of servers. However, these systems are not well matched for commodity servers, as they require significant CPU resources to achieve reasonable network bandwidth, yet the core Memcached functions do not benefit from the high performance of standard server CPUs. In this paper, we demonstrate the design of an FPGA-based Memcached appliance. We take Memcached, a complex software system, and implement its core functionality on an FPGA. By leveraging the FPGA's design and utilizing its customizable logic to create a specialized appliance we are able to tightly integrate networking, compute, and memory. This integration allows us to overcome many of the bottlenecks found in standard servers. Our design provides performance on-par with baseline servers, but consumes only 9% of the power of the baseline. Scaled out, we see benefits at the data center level, substantially improving the performance-per-dollar while improving energy efficiency by 3.2X to 10.9X.

  • ISCA - Thin servers with smart pipes: designing SoC accelerators for Memcached
    Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13, 2013
    Co-Authors: Kevin T. Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, Thomas F. Wenisch
    Abstract:

    Distributed in-memory key-value stores, such as Memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of Memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of Memcached behavior. We discover that, regardless of CPU microarchitecture, Memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance Memcached deployment. TSSP couples an embedded-class low-power core to a Memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

Dhabaleswar K. Panda - One of the best experts on this subject based on the ideXlab platform.

  • IPDPS - High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Dipti Shankar, Wasi-ur-rahman, Nusrat Sharmin Islam, Dhabaleswar K. Panda
    Abstract:

    High-performance, distributed key-value store-based caching solutions, such as Memcached, have played a crucial role in enhancing the performance of many Online and Offline Big Data applications. The advent of high-performance storage (e.g. NVMe SSD) and interconnects (e.g. InfiniBand) on modern clusters has directed several efforts towards employing 'RAM+SSD' hybrid storagearchitectures for key-value stores running over RDMA, in order to achieve high data retention, while maintaining low latency and high throughput. In this paper, we first perform a detailed analysis of the behavior of hybrid Memcached designs, and identify two major bottlenecks: the client-side wait for request completion and the server-side SSD I/O overhead. Based on this analysis, we propose new non-blocking API extensions for Memcached Set and Get operations, to support high data retention while trying to achieve near in-memory speeds. We enhance the existing runtime designs on both the client and the server, and propose an adaptive slab manager with different I/O schemes for higher throughput. We demonstrate that LibMemcached-based applications can achieve high performance by exploiting the communication/computation overlap that is made possible by the proposed non-blocking API extensions, with either In-memory or SSD-assisted designs of RDMA-based Memcached. Performance evaluations show that the proposed extensions and designs can achieve up to 16x improvement for Memcached Set/Get latency over current hybrid design for RDMA-Memcached when all data does not fit in memory, and up to 3.6x improvement over pure in-memory design of default Memcached over 'IP-over-IB' when all data can fit in memory.

  • can rdma benefit online data processing workloads on Memcached and mysql
    International Symposium on Performance Analysis of Systems and Software, 2015
    Co-Authors: Dipti Shankar, Jithin Jose, Nusrat Sharmin Islam, Md Wasiurrahman, Dhabaleswar K. Panda
    Abstract:

    At the onset of the widespread usage of social networking services in the Web 2.0/3.0 era, leveraging a distributed and scalable caching layer like Memcached is often invaluable to application server performance. Since a majority of the existing clusters today are equipped with modern high speed interconnects such as InfiniBand, that offer high bandwidth and low latency communication, there is potential to improve the response time and throughput of the application servers, by taking advantage of advanced features like RDMA. We explore the potential of employing RDMA to improve the performance of Online Data Processing (OLDP) workloads on MySQL using Memcached for real-world web applications.

  • PABS@ICPE - Accelerating Big Data Processing on Modern Clusters
    Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems - PABS '15, 2015
    Co-Authors: Dhabaleswar K. Panda
    Abstract:

    Modern clusters are having multi-/many-core architectures, high-performance rdma-enabled interconnects and SSD-based storage devices. Hadoop framework is extensively being used these days for Big Data processing. Spark framework is emerging for real-time analytics. Similarly, Memcached is being used in data centers with Web 2.0 environment. This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern clusters. An overview of RDMA-based designs for multiple components of Hadoop (HDFS, MapReduce, RPC and HBase), Spark and Memcached will be presented. Performance benefits of these designs on various cluster configurations will be shown. The talk will also address the need for designing benchmarks using a multi-layered and systematic approach, which can be used to evaluate the performance of these middleware.

  • ISPASS - Can RDMA benefit online data processing workloads on Memcached and MySQL
    2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2015
    Co-Authors: Dipti Shankar, Jithin Jose, Wasi-ur-rahman, Nusrat Sharmin Islam, Dhabaleswar K. Panda
    Abstract:

    At the onset of the widespread usage of social networking services in the Web 2.0/3.0 era, leveraging a distributed and scalable caching layer like Memcached is often invaluable to application server performance. Since a majority of the existing clusters today are equipped with modern high speed interconnects such as InfiniBand, that offer high bandwidth and low latency communication, there is potential to improve the response time and throughput of the application servers, by taking advantage of advanced features like RDMA. We explore the potential of employing RDMA to improve the performance of Online Data Processing (OLDP) workloads on MySQL using Memcached for real-world web applications.

  • Big Data - Benchmarking key-value stores on high-performance storage and interconnects for web-scale workloads
    2015 IEEE International Conference on Big Data (Big Data), 2015
    Co-Authors: Dipti Shankar, Wasi-ur-rahman, Nusrat Sharmin Islam, Dhabaleswar K. Panda
    Abstract:

    Leveraging a distributed key-value based caching layer has proven to be invaluable for scalable data-intensive web applications. With the emergence of high-performance storage (e.g. SSD) and interconnects (e.g. InfiniBand) on modern clusters, several efforts are being made to design high-performance key-value stores that can operate well with ‘RAM+SSD’ hybrid storage architecture. This has made it essential for us to design micro-benchmarks that are tailored to evaluate these upcoming, hybrid designs. In this paper, we study popular web-scale and cloud serving workloads, to identify different application-specific aspects, including commonly occurring data request distributions, update patterns, and environmental factors, that affect the performance of hybrid key-value stores. Based on these characterization studies, we propose a micro-benchmark suite that can be used to study high-performance, hybrid key-value stores on modern clusters, from the perspectives of both the application and the key-value store. We demonstrate its ease-of-use using database-integrated and stand-alone execution modes. Performance evaluations with different Memcached distributions, such as SSD-Assisted RDMA-Memcached, fatcache, and twemcache, over different networks/protocols, show that ‘SSD+RDMA’ can significantly enhance the performance of Memcached for various read-only and read-heavy workloads, that are representative of several common web-scale workloads.

Paul Lu - One of the best experts on this subject based on the ideXlab platform.

  • Low-Latency Caching for Cloud-Based Web Applications
    Work, 2011
    Co-Authors: Adam Wolfe Gordon, Paul Lu
    Abstract:

    Many Web applications are now hosted in elastic cloud en- vironments where the unit of resource allocation is a virtual machine (VM) instance; entire VMs are added or removed to scale up or scale down. A variety of techniques can reduce the latency of communication between VMs co-located on the same server in, say, a private cloud. For example, par- avirtualized network mechanisms (e.g., vhost and virtio in Linux KVM) can optimize the number of protection bound- ary crossings. Inter-VM shared memory can further reduce boundary crossings after setting up a shared region. We present the design, implementation, and an evalua- tion of Nahanni Memcached, a port of the well-known mem- cached that uses inter-VM shared memory instead of a vir- tual network for cache reads. As a widely deployed cache for back-end datastores and databases, Memcacheds latency is important to the performance of many well-known web sites (e.g., Facebook, Twitter) and cloud platforms (e.g., Googles App Engine). Although using shared-memory IPC is a well-known strategy, the recent introduction of the ivsh- mem inter-VM shared memory mechanism (also known as Nahanni) to Linux KVM makes the strategy practical for virtual machines. Using the Yahoo Cloud Serving Bench- mark, we confirm the intuition that Nahanni Memcached can reduce the latency of cache read operations by up to 86%, and that given reasonable hit rates, this can reduce the total latency of read-related operations for a workload by up to 45% compared to standard Memcached. When using the experimental paravirtualized vhost networking mechanism in Linux KVM, Nahanni Memcached offers a smaller, but still significant, advantage of 29%.

David Meisner - One of the best experts on this subject based on the ideXlab platform.

  • thin servers with smart pipes designing soc accelerators for Memcached
    International Symposium on Computer Architecture, 2013
    Co-Authors: Kevin T. Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, Thomas F. Wenisch
    Abstract:

    Distributed in-memory key-value stores, such as Memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of Memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of Memcached behavior. We discover that, regardless of CPU microarchitecture, Memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance Memcached deployment. TSSP couples an embedded-class low-power core to a Memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

  • ISCA - Thin servers with smart pipes: designing SoC accelerators for Memcached
    Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13, 2013
    Co-Authors: Kevin T. Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, Thomas F. Wenisch
    Abstract:

    Distributed in-memory key-value stores, such as Memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of Memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of Memcached behavior. We discover that, regardless of CPU microarchitecture, Memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance Memcached deployment. TSSP couples an embedded-class low-power core to a Memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.