Associative Array

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 1515 Experts worldwide ranked by ideXlab platform

Kepner Jeremy - One of the best experts on this subject based on the ideXlab platform.

  • Mathematics of Digital Hyperspace
    'Institute of Electrical and Electronics Engineers (IEEE)', 2021
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Jananthan Hayden, Davis Timothy, Milechin Lauren
    Abstract:

    Social media, e-commerce, streaming video, e-mail, cloud documents, web pages, traffic flows, and network packets fill vast digital lakes, rivers, and oceans that we each navigate daily. This digital hyperspace is an amorphous flow of data supported by continuous streams that stretch standard concepts of type and dimension. The unstructured data of digital hyperspace can be elegantly represented, traversed, and transformed via the mathematics of hypergraphs, hypersparse matrices, and Associative Array algebra. This paper explores a novel mathematical concept, the semilink, that combines pairs of semirings to provide the essential operations for graph analytics, database operations, and machine learning. The GraphBLAS standard currently supports hypergraphs, hypersparse matrices, the mathematics required for semilinks, and seamlessly performs graph, network, and matrix operations. With the addition of key based indices (such as pointers to strings) and semilinks, GraphBLAS can become a richer Associative Array algebra and be a plug-in replacement for spreadsheets, database tables, and data centric operating systems, enhancing the navigation of unstructured data found in digital hyperspace.Comment: 9 pages, 8 figures, 2 tables, accepted to GrAPL 2021. arXiv admin note: text overlap with arXiv:1807.03165, arXiv:2004.01181, arXiv:1909.05631, arXiv:1708.0293

  • AI Data Wrangling with Associative Arrays
    2020
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Jananthan Hayden, Milechin Lauren, Samsi Siddharth
    Abstract:

    The AI revolution is data driven. AI "data wrangling" is the process by which unusable data is transformed to support AI algorithm development (training) and deployment (inference). Significant time is devoted to translating diverse data representations supporting the many query and analysis steps found in an AI pipeline. Rigorous mathematical representations of these data enables data translation and analysis optimization within and across steps. Associative Array algebra provides a mathematical foundation that naturally describes the tabular structures and set mathematics that are the basis of databases. Likewise, the matrix operations and corresponding inference/training calculations used by neural networks are also well described by Associative Arrays. More surprisingly, a general denormalized form of hierarchical formats, such as XML and JSON, can be readily constructed. Finally, pivot tables, which are among the most widely used data analysis tools, naturally emerge from Associative Array constructors. A common foundation in Associative Arrays provides interoperability guarantees, proving that their operations are linear systems with rigorous mathematical properties, such as, associativity, commutativity, and distributivity that are critical to reordering optimizations.Comment: 3 pages, 2 figures, 23 references, accepted for Northeast Database day (NEDB) 2020. arXiv admin note: text overlap with arXiv:1907.0421

  • TabulaROSA: Tabular Operating System Architecture for Massively Parallel Heterogeneous Compute Engines
    'Institute of Electrical and Electronics Engineers (IEEE)', 2019
    Co-Authors: Kepner Jeremy, Hayden Jananthan, Brightwell Ron, Edelman Alan, Jones Michael, Okhravi Hamed, Gadepally, Vijay N., Madden, Samuel R., Michaleas, Peter W., Pedretti Kevin
    Abstract:

    The rise in computing hardware choices is driving a reevaluation of operating systems. The traditional role of an operating system controlling the execution of its own hardware is evolving toward a model whereby the controlling processor is distinct from the compute engines that are performing most of the computations. In this context, an operating system can be viewed as software that brokers and tracks the resources of the compute engines and is akin to a database management system. To explore the idea of using a database in an operating system role, this work defines key operating system functions in terms of rigorous mathematical semantics (Associative Array algebra) that are directly translatable into database operations. These operations possess a number of mathematical properties that are ideal for parallel operating systems by guaranteeing correctness over a wide range of parallel operations. The resulting operating system equations provide a mathematical specification for a Tabular Operating System Architecture (TabulaROSA) that can be implemented on any platform. Simulations of forking in TabularROSA are performed using an Associative Array implementation and compared to Linux on a 32,000+ core supercomputer. Using over 262,000 forkers managing over 68,000,000,000 processes, the simulations show that TabulaROSA has the potential to perform operating system functions on a massively parallel scale. The TabulaROSA simulations show 20x higher performance as compared to Linux while managing 2000x more processes in fully searchable tables.United States. Department of Defense. Assistant Secretary of Defense for Research & Engineering (Air Force Contract No. FA8721-05-C-0002)United States. Department of Defense. Assistant Secretary of Defense for Research & Engineering (Air Force Contract No. FA8702-15-D-0001

  • A Billion Updates per Second Using 30,000 Hierarchical In-Memory D4M Databases
    2019
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Samsi Siddharth, Milechin Lauren, Arcand William, Bestor David, Bergeron William, Byun Chansup, Hubbell Matthew, Houle Micheal
    Abstract:

    Analyzing large scale networks requires high performance streaming updates of graph representations of these data. Associative Arrays are mathematical objects combining properties of spreadsheets, databases, matrices, and graphs, and are well-suited for representing and analyzing streaming network data. The Dynamic Distributed Dimensional Data Model (D4M) library implements Associative Arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database. Associative Arrays are designed for block updates. Streaming updates to a large Associative Array requires a hierarchical implementation to optimize the performance of the memory hierarchy. Running 34,000 instances of a hierarchical D4M Associative Arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.Comment: Northeast Database Data 2019 (MIT

  • Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M
    'Institute of Electrical and Electronics Engineers (IEEE)', 2019
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Samsi Siddharth, Milechin Lauren, Arcand William, Bestor David, Bergeron William, Byun Chansup, Hubbell Matthew, Houle Michael
    Abstract:

    The Dynamic Distributed Dimensional Data Model (D4M) library implements Associative Arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse Arrays that are ideal for analyzing many types of network data. D4M relies on Associative Arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M Associative Arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical Associative Arrays that reduces memory pressure and dramatically increases the update rate into an Associative Array. The parameters of hierarchical Associative Arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical Arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M Associative Arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.Comment: 6 pages; 6 figures; accepted to IEEE High Performance Extreme Computing (HPEC) Conference 2019. arXiv admin note: text overlap with arXiv:1807.05308, arXiv:1902.0084

Jeremy Kepner - One of the best experts on this subject based on the ideXlab platform.

  • tabularosa tabular operating system architecture for massively parallel heterogeneous compute engines
    IEEE High Performance Extreme Computing Conference, 2018
    Co-Authors: Jeremy Kepner, Hayden Jananthan, Vijay Gadepally, Ron Brightwell, Alan Edelman, Michael Jones, Samuel Madden, Peter Michaleas, Hamed Okhravi, Kevin Pedretti
    Abstract:

    The rise in computing hardware choices is driving a reevaluation of operating systems. The traditional role of an operating system controlling the execution of its own hardware is evolving toward a model whereby the controlling processor is distinct from the compute engines that are performing most of the computations. In this context, an operating system can be viewed as software that brokers and tracks the resources of the compute engines and is akin to a database management system. To explore the idea of using a database in an operating system role, this work defines key operating system functions in terms of rigorous mathematical semantics (Associative Array algebra) that are directly translatable into database operations. These operations possess a number of mathematical properties that are ideal for parallel operating systems by guaranteeing correctness over a wide range of parallel operations. The resulting operating system equations provide a mathematical specification for a Tabular Operating System Architecture (TabulaROSA) that can be implemented on any platform. Simulations of forking in TabularROSA are performed using an Associative Array implementation and compared to Linux on a 32,000+ core supercomputer. Using over 262,000 forkers managing over 68,000,000,000 processes, the simulations show that TabulaROSA has the potential to perform operating system functions on a massively parallel scale. The TabulaROSA simulations show 20x higher performance as compared to Linux while managing 2000x more processes in fully searchable tables.

  • Genetic Sequence Matching Using D4M Big Data Approaches
    2016
    Co-Authors: Stephanie Dodson, Darrell O. Ricke, Jeremy Kepner
    Abstract:

    Sequencing tools have led to increasing speeds of DNA sample col-lection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) – an Associative Array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST. I

  • Associative Array model of SQL, NoSQL, and NewSQL databases
    2016 IEEE High Performance Extreme Computing Conference HPEC 2016, 2016
    Co-Authors: Jeremy Kepner, Hayden Jananthan, Timothy Mattson, Dylan Hutchison, Siddharth Samsi, Vijay Gadepally, Albert Reuther
    Abstract:

    The success of SQL, NoSQL, and NewSQL databases is a reflection of their ability to provide significant functionality and performance benefits for specific domains, such as financial transactions, internet search, and data analysis. The BigDAWG polystore seeks to provide a mechanism to allow applications to transparently achieve the benefits of diverse databases while insulating applications from the details of these databases. Associative Arrays provide a common approach to the mathematics found in different databases: sets (SQL), graphs (NoSQL), and matrices (NewSQL). This work presents the SQL relational model in terms of Associative Arrays and identifies the key mathematical properties that are preserved within SQL. These properties include associativity, commutativity, distributivity, identities, annihilators, and inverses. Performance measurements on distributivity and associativity show the impact these properties can have on Associative Array operations. These results demonstrate that Associative Arrays could provide a mathematical model for polystores to optimize the exchange of data and execution queries

Gadepally Vijay - One of the best experts on this subject based on the ideXlab platform.

  • Mathematics of Digital Hyperspace
    'Institute of Electrical and Electronics Engineers (IEEE)', 2021
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Jananthan Hayden, Davis Timothy, Milechin Lauren
    Abstract:

    Social media, e-commerce, streaming video, e-mail, cloud documents, web pages, traffic flows, and network packets fill vast digital lakes, rivers, and oceans that we each navigate daily. This digital hyperspace is an amorphous flow of data supported by continuous streams that stretch standard concepts of type and dimension. The unstructured data of digital hyperspace can be elegantly represented, traversed, and transformed via the mathematics of hypergraphs, hypersparse matrices, and Associative Array algebra. This paper explores a novel mathematical concept, the semilink, that combines pairs of semirings to provide the essential operations for graph analytics, database operations, and machine learning. The GraphBLAS standard currently supports hypergraphs, hypersparse matrices, the mathematics required for semilinks, and seamlessly performs graph, network, and matrix operations. With the addition of key based indices (such as pointers to strings) and semilinks, GraphBLAS can become a richer Associative Array algebra and be a plug-in replacement for spreadsheets, database tables, and data centric operating systems, enhancing the navigation of unstructured data found in digital hyperspace.Comment: 9 pages, 8 figures, 2 tables, accepted to GrAPL 2021. arXiv admin note: text overlap with arXiv:1807.03165, arXiv:2004.01181, arXiv:1909.05631, arXiv:1708.0293

  • AI Data Wrangling with Associative Arrays
    2020
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Jananthan Hayden, Milechin Lauren, Samsi Siddharth
    Abstract:

    The AI revolution is data driven. AI "data wrangling" is the process by which unusable data is transformed to support AI algorithm development (training) and deployment (inference). Significant time is devoted to translating diverse data representations supporting the many query and analysis steps found in an AI pipeline. Rigorous mathematical representations of these data enables data translation and analysis optimization within and across steps. Associative Array algebra provides a mathematical foundation that naturally describes the tabular structures and set mathematics that are the basis of databases. Likewise, the matrix operations and corresponding inference/training calculations used by neural networks are also well described by Associative Arrays. More surprisingly, a general denormalized form of hierarchical formats, such as XML and JSON, can be readily constructed. Finally, pivot tables, which are among the most widely used data analysis tools, naturally emerge from Associative Array constructors. A common foundation in Associative Arrays provides interoperability guarantees, proving that their operations are linear systems with rigorous mathematical properties, such as, associativity, commutativity, and distributivity that are critical to reordering optimizations.Comment: 3 pages, 2 figures, 23 references, accepted for Northeast Database day (NEDB) 2020. arXiv admin note: text overlap with arXiv:1907.0421

  • A Billion Updates per Second Using 30,000 Hierarchical In-Memory D4M Databases
    2019
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Samsi Siddharth, Milechin Lauren, Arcand William, Bestor David, Bergeron William, Byun Chansup, Hubbell Matthew, Houle Micheal
    Abstract:

    Analyzing large scale networks requires high performance streaming updates of graph representations of these data. Associative Arrays are mathematical objects combining properties of spreadsheets, databases, matrices, and graphs, and are well-suited for representing and analyzing streaming network data. The Dynamic Distributed Dimensional Data Model (D4M) library implements Associative Arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database. Associative Arrays are designed for block updates. Streaming updates to a large Associative Array requires a hierarchical implementation to optimize the performance of the memory hierarchy. Running 34,000 instances of a hierarchical D4M Associative Arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.Comment: Northeast Database Data 2019 (MIT

  • Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M
    'Institute of Electrical and Electronics Engineers (IEEE)', 2019
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Samsi Siddharth, Milechin Lauren, Arcand William, Bestor David, Bergeron William, Byun Chansup, Hubbell Matthew, Houle Michael
    Abstract:

    The Dynamic Distributed Dimensional Data Model (D4M) library implements Associative Arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse Arrays that are ideal for analyzing many types of network data. D4M relies on Associative Arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M Associative Arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical Associative Arrays that reduces memory pressure and dramatically increases the update rate into an Associative Array. The parameters of hierarchical Associative Arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical Arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M Associative Arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.Comment: 6 pages; 6 figures; accepted to IEEE High Performance Extreme Computing (HPEC) Conference 2019. arXiv admin note: text overlap with arXiv:1807.05308, arXiv:1902.0084

  • TabulaROSA: Tabular Operating System Architecture for Massively Parallel Heterogeneous Compute Engines
    'Institute of Electrical and Electronics Engineers (IEEE)', 2018
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Jananthan Hayden, Brightwell Ron, Edelman Alan, Jones Michael, Madden Sam, Michaleas Peter, Okhravi Hamed, Pedretti Kevin
    Abstract:

    The rise in computing hardware choices is driving a reevaluation of operating systems. The traditional role of an operating system controlling the execution of its own hardware is evolving toward a model whereby the controlling processor is distinct from the compute engines that are performing most of the computations. In this context, an operating system can be viewed as software that brokers and tracks the resources of the compute engines and is akin to a database management system. To explore the idea of using a database in an operating system role, this work defines key operating system functions in terms of rigorous mathematical semantics (Associative Array algebra) that are directly translatable into database operations. These operations possess a number of mathematical properties that are ideal for parallel operating systems by guaranteeing correctness over a wide range of parallel operations. The resulting operating system equations provide a mathematical specification for a Tabular Operating System Architecture (TabulaROSA) that can be implemented on any platform. Simulations of forking in TabularROSA are performed using an Associative Array implementation and compared to Linux on a 32,000+ core supercomputer. Using over 262,000 forkers managing over 68,000,000,000 processes, the simulations show that TabulaROSA has the potential to perform operating system functions on a massively parallel scale. The TabulaROSA simulations show 20x higher performance as compared to Linux while managing 2000x more processes in fully searchable tables.Comment: 8 pages, 6 figures, accepted at IEEE HPEC 201

Samsi Siddharth - One of the best experts on this subject based on the ideXlab platform.

  • AI Data Wrangling with Associative Arrays
    2020
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Jananthan Hayden, Milechin Lauren, Samsi Siddharth
    Abstract:

    The AI revolution is data driven. AI "data wrangling" is the process by which unusable data is transformed to support AI algorithm development (training) and deployment (inference). Significant time is devoted to translating diverse data representations supporting the many query and analysis steps found in an AI pipeline. Rigorous mathematical representations of these data enables data translation and analysis optimization within and across steps. Associative Array algebra provides a mathematical foundation that naturally describes the tabular structures and set mathematics that are the basis of databases. Likewise, the matrix operations and corresponding inference/training calculations used by neural networks are also well described by Associative Arrays. More surprisingly, a general denormalized form of hierarchical formats, such as XML and JSON, can be readily constructed. Finally, pivot tables, which are among the most widely used data analysis tools, naturally emerge from Associative Array constructors. A common foundation in Associative Arrays provides interoperability guarantees, proving that their operations are linear systems with rigorous mathematical properties, such as, associativity, commutativity, and distributivity that are critical to reordering optimizations.Comment: 3 pages, 2 figures, 23 references, accepted for Northeast Database day (NEDB) 2020. arXiv admin note: text overlap with arXiv:1907.0421

  • A Billion Updates per Second Using 30,000 Hierarchical In-Memory D4M Databases
    2019
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Samsi Siddharth, Milechin Lauren, Arcand William, Bestor David, Bergeron William, Byun Chansup, Hubbell Matthew, Houle Micheal
    Abstract:

    Analyzing large scale networks requires high performance streaming updates of graph representations of these data. Associative Arrays are mathematical objects combining properties of spreadsheets, databases, matrices, and graphs, and are well-suited for representing and analyzing streaming network data. The Dynamic Distributed Dimensional Data Model (D4M) library implements Associative Arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database. Associative Arrays are designed for block updates. Streaming updates to a large Associative Array requires a hierarchical implementation to optimize the performance of the memory hierarchy. Running 34,000 instances of a hierarchical D4M Associative Arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.Comment: Northeast Database Data 2019 (MIT

  • Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M
    'Institute of Electrical and Electronics Engineers (IEEE)', 2019
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Samsi Siddharth, Milechin Lauren, Arcand William, Bestor David, Bergeron William, Byun Chansup, Hubbell Matthew, Houle Michael
    Abstract:

    The Dynamic Distributed Dimensional Data Model (D4M) library implements Associative Arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse Arrays that are ideal for analyzing many types of network data. D4M relies on Associative Arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M Associative Arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical Associative Arrays that reduces memory pressure and dramatically increases the update rate into an Associative Array. The parameters of hierarchical Associative Arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical Arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M Associative Arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.Comment: 6 pages; 6 figures; accepted to IEEE High Performance Extreme Computing (HPEC) Conference 2019. arXiv admin note: text overlap with arXiv:1807.05308, arXiv:1902.0084

  • Associative Array Model of SQL, NoSQL, and NewSQL Databases
    'Institute of Electrical and Electronics Engineers (IEEE)', 2016
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Hutchison Dylan, Jananthan Hayden, Mattson Timothy, Samsi Siddharth, Reuther Albert
    Abstract:

    The success of SQL, NoSQL, and NewSQL databases is a reflection of their ability to provide significant functionality and performance benefits for specific domains, such as financial transactions, internet search, and data analysis. The BigDAWG polystore seeks to provide a mechanism to allow applications to transparently achieve the benefits of diverse databases while insulating applications from the details of these databases. Associative Arrays provide a common approach to the mathematics found in different databases: sets (SQL), graphs (NoSQL), and matrices (NewSQL). This work presents the SQL relational model in terms of Associative Arrays and identifies the key mathematical properties that are preserved within SQL. These properties include associativity, commutativity, distributivity, identities, annihilators, and inverses. Performance measurements on distributivity and associativity show the impact these properties can have on Associative Array operations. These results demonstrate that Associative Arrays could provide a mathematical model for polystores to optimize the exchange of data and execution queries.Comment: 9 pages; 6 figures; accepted to IEEE High Performance Extreme Computing (HPEC) conference 201

Jananthan Hayden - One of the best experts on this subject based on the ideXlab platform.

  • Mathematics of Digital Hyperspace
    'Institute of Electrical and Electronics Engineers (IEEE)', 2021
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Jananthan Hayden, Davis Timothy, Milechin Lauren
    Abstract:

    Social media, e-commerce, streaming video, e-mail, cloud documents, web pages, traffic flows, and network packets fill vast digital lakes, rivers, and oceans that we each navigate daily. This digital hyperspace is an amorphous flow of data supported by continuous streams that stretch standard concepts of type and dimension. The unstructured data of digital hyperspace can be elegantly represented, traversed, and transformed via the mathematics of hypergraphs, hypersparse matrices, and Associative Array algebra. This paper explores a novel mathematical concept, the semilink, that combines pairs of semirings to provide the essential operations for graph analytics, database operations, and machine learning. The GraphBLAS standard currently supports hypergraphs, hypersparse matrices, the mathematics required for semilinks, and seamlessly performs graph, network, and matrix operations. With the addition of key based indices (such as pointers to strings) and semilinks, GraphBLAS can become a richer Associative Array algebra and be a plug-in replacement for spreadsheets, database tables, and data centric operating systems, enhancing the navigation of unstructured data found in digital hyperspace.Comment: 9 pages, 8 figures, 2 tables, accepted to GrAPL 2021. arXiv admin note: text overlap with arXiv:1807.03165, arXiv:2004.01181, arXiv:1909.05631, arXiv:1708.0293

  • AI Data Wrangling with Associative Arrays
    2020
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Jananthan Hayden, Milechin Lauren, Samsi Siddharth
    Abstract:

    The AI revolution is data driven. AI "data wrangling" is the process by which unusable data is transformed to support AI algorithm development (training) and deployment (inference). Significant time is devoted to translating diverse data representations supporting the many query and analysis steps found in an AI pipeline. Rigorous mathematical representations of these data enables data translation and analysis optimization within and across steps. Associative Array algebra provides a mathematical foundation that naturally describes the tabular structures and set mathematics that are the basis of databases. Likewise, the matrix operations and corresponding inference/training calculations used by neural networks are also well described by Associative Arrays. More surprisingly, a general denormalized form of hierarchical formats, such as XML and JSON, can be readily constructed. Finally, pivot tables, which are among the most widely used data analysis tools, naturally emerge from Associative Array constructors. A common foundation in Associative Arrays provides interoperability guarantees, proving that their operations are linear systems with rigorous mathematical properties, such as, associativity, commutativity, and distributivity that are critical to reordering optimizations.Comment: 3 pages, 2 figures, 23 references, accepted for Northeast Database day (NEDB) 2020. arXiv admin note: text overlap with arXiv:1907.0421

  • TabulaROSA: Tabular Operating System Architecture for Massively Parallel Heterogeneous Compute Engines
    'Institute of Electrical and Electronics Engineers (IEEE)', 2018
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Jananthan Hayden, Brightwell Ron, Edelman Alan, Jones Michael, Madden Sam, Michaleas Peter, Okhravi Hamed, Pedretti Kevin
    Abstract:

    The rise in computing hardware choices is driving a reevaluation of operating systems. The traditional role of an operating system controlling the execution of its own hardware is evolving toward a model whereby the controlling processor is distinct from the compute engines that are performing most of the computations. In this context, an operating system can be viewed as software that brokers and tracks the resources of the compute engines and is akin to a database management system. To explore the idea of using a database in an operating system role, this work defines key operating system functions in terms of rigorous mathematical semantics (Associative Array algebra) that are directly translatable into database operations. These operations possess a number of mathematical properties that are ideal for parallel operating systems by guaranteeing correctness over a wide range of parallel operations. The resulting operating system equations provide a mathematical specification for a Tabular Operating System Architecture (TabulaROSA) that can be implemented on any platform. Simulations of forking in TabularROSA are performed using an Associative Array implementation and compared to Linux on a 32,000+ core supercomputer. Using over 262,000 forkers managing over 68,000,000,000 processes, the simulations show that TabulaROSA has the potential to perform operating system functions on a massively parallel scale. The TabulaROSA simulations show 20x higher performance as compared to Linux while managing 2000x more processes in fully searchable tables.Comment: 8 pages, 6 figures, accepted at IEEE HPEC 201

  • Associative Array Model of SQL, NoSQL, and NewSQL Databases
    'Institute of Electrical and Electronics Engineers (IEEE)', 2016
    Co-Authors: Kepner Jeremy, Gadepally Vijay, Hutchison Dylan, Jananthan Hayden, Mattson Timothy, Samsi Siddharth, Reuther Albert
    Abstract:

    The success of SQL, NoSQL, and NewSQL databases is a reflection of their ability to provide significant functionality and performance benefits for specific domains, such as financial transactions, internet search, and data analysis. The BigDAWG polystore seeks to provide a mechanism to allow applications to transparently achieve the benefits of diverse databases while insulating applications from the details of these databases. Associative Arrays provide a common approach to the mathematics found in different databases: sets (SQL), graphs (NoSQL), and matrices (NewSQL). This work presents the SQL relational model in terms of Associative Arrays and identifies the key mathematical properties that are preserved within SQL. These properties include associativity, commutativity, distributivity, identities, annihilators, and inverses. Performance measurements on distributivity and associativity show the impact these properties can have on Associative Array operations. These results demonstrate that Associative Arrays could provide a mathematical model for polystores to optimize the exchange of data and execution queries.Comment: 9 pages; 6 figures; accepted to IEEE High Performance Extreme Computing (HPEC) conference 201