First there was the invention of paper, and then… search engines! We can trace the origins of data storage, transfer, and presentation to thousands of years back before the invention of paper when people wrote on bamboo slips with ink of soot, or when cavemen made marks on pieces of ochre rock 77000 years ago, but let’s start this exploration with the invention of the first “search engine” by Emanuel Goldberg in 1927 which he called a “statistical machine”. See the US patent here.
One of these machines was said to be integrated into Goldberg’s desk, and a colleague of his was quoted saying: “He was telling us that he was the only person in the world as far as he knew who had on his desk a document retrieval capability… He would dial a number, press a button and after three seconds [a microfilmed copy of] the document would be projected.” Fast forward to today, almost everyone has a machine, or two, on his desk and perhaps another in his pocket with such capability.
No matter how data is stored, transferred, and presented, and no matter where data is used, the value of data is a function of timeliness, relevance, and accuracy. When data is timely, accurate, and relevant, information is served to the user who can then make immediate use of it. That is why we have for a long time made advances in information retrieval (the tracing and recovery of specific information from stored data) and continue to do so; bringing the retrieval process from that of a manual nature heavily dependent on humans to an ever automated one built on the power of machines.
In modern business applications, different domains of data (e.g., materials, customers) go into SAP or similar business systems, often as independent data entries, and are stored in their respective tables and fields. Later, they are retrieved to be combined into informative views for achieving business objectives. Retrieving data for business use requires knowledge about domain specific rules. Business rules and logic define the meaning behind pieces of data.
In traditional systems that employed Boolean logic in search models (and many still do today), most business knowledge was held by subject matter experts of specific domains. Thus, getting information for any use case was performed by the corresponding expert (or teams of experts for those that are lucky). This is not dissimilar to when cavemen made geometric marks on pieces of rock where only those who had knowledge about the meaning of these marks could retrieve information from a collection of marked rocks. In this model, from the perspective of generating business value from data, businesses are putting all their eggs in one basket – whether any data is timely, accurate or relevant all depends on a person or set of persons.
That’s not to say that it is bad for businesses to depend on humans by any means, after all we will always remain the benefactors and beneficiaries of business. While businesses may never become fully automated by machines, leveraging the strengths of machines to make up for our weaknesses has proven to improve the efficacy of business.
The development of more complex search algorithms such as approximate string matching and TF-IDF weighting, gave birth to probabilistic search models that could produce not only exact but also close matches to a query. However, in this model results still may not contain the information desired because the meaning behind a query is not understood by the system. Both the Boolean and the probabilistic models revolve around pattern matching with text, which didn’t make a huge contribution to the relevance of data retrieved.
Today, systems that leverage machine learning combine the use of Boolean and statistical algorithms for matching and ranking, as well as techniques in natural language processing (NLP) to enhance the search experience for users. With businesses now having accumulated decades of data for analysis, NLP provides a new venue for businesses to improve the value of their data. Using NLP and machine learning, the most meaningful information becomes readily available from business systems as soon as queries come in from the outside world, all without any need for human intervention in the input, translation, correction, and output of data because analysis had been performed to extract meaning from data for any given context. This is as if cavemen who were visiting from Sibudu Cave over 1500km (about the distance from Florida to New York City) away could immediately understand the rock markings used by those at Blombos Cave, and therefore could quickly retrieve a piece of rock for immediate use.
NLP has been a hot topic of research for decades, and its implementation can range from the simple identification of synonyms and stems to the more complex tagging of parts of speech or text disambiguation. For a business for example, complex machine learning models for information retrieval may combine NLP with industry synonyms and technical terms stemming (e.g., units of measures) with the analysis of customer demographic, purchase history and seasonal data, so that the system can correctly interpret a customer’s query for “apple” as a demand for materials that contain “iPhone.” As more data from a wider variety of sources is used to form an integrated analysis in a machine learning model, the complexity of the solution increases, but so does the value of the data that resides in the business systems.
Fortunately, with such a wide range of analytical tools available today, the machine learning team at DataXstream has been busy tackling these complexities by sorting through customer data, performing statistical and NLP analysis, and building machine learning models that bring more meaning to data. In the next part of this series of exploration in machine learning and information retrieval, we will take a technical dive into how search has evolved from SAP R/3 to SAP S/4 HANA, and how businesses can leverage NLP and machine learning to bring advanced search to the next level. In the meantime, I would love to hear from you about how your organization’s search functionality has evolved over time and what you envision for its future.