Information explosion is producing more data every day than we can process. This is augmented by day-to-day Internet usages even for simplest things in our life. Much of these data are in the form of texts. They are mostly scattered and unstructured. Therefore, analyzing unstructured text data is a big part of Big Data Analytics. This article delves into understanding text data in the light of Big Data Analytics.
What Is Unstructured Data?
Unlike unstructured data, structured data is ordered, organized in a manner to ensure ease of accessibility to facilitate convenience, especially to Big Data Analytics. It has a defined length and format and is posed in a manner that any inclusion, search, and so forth made on the database is simple, seamless, and straightforward—something we that commonly see in case of relational databases where the underlying engines provide the essential mechanism to every means of accessibility to the user.
Unstructured data, on the other hand, are rather in a raw format with almost no ordering. This poses great difficulty to handle them or derive any meaningful information out of it. Big Data Analytics require more effort and resources to deal with them. Unfortunately, there are a lot more unstructured or semi-structured data available for a Big Data analyst to deal with. The text data that we find in Big Data Analytics comes from several sources and those, too, are in a different format. For example, the data found in text documents are something like laid down scripts, even if it may be speaking about some financial transaction or a bank statement. Similarly, for example, the data found in e-mails, logs, tweets, or social network posts have very little ordering. Although some of them may follow some sort of a template, they are mostly unstructured or at best semi-structured.
So, the primary challenge for text analytics is to organize them before analysis.
The crux of the matter in text analytics is to organize the unstructured texts, analyze them, extract some meaningful and relevant information, and give this information a structure that can be leveraged to analyze further in the future. Text data analytics uses several techniques to achieve that. These techniques are derived from multiple disciplines, such as Natural Language Processing (NLP), data mining, knowledge discovery, statistics, computational linguistics, and so on, along with many other complimentary tools. There is no hard and fast technique or tool available. In fact, anything that helps in the process of text analytics and influences its effect can be deemed to be an indispensable option for the analysis. The process is important—not the tools or techniques. Here, we describe only some common methods employed in text analytics.
One should note that text analytics do not only mean a keyword search, although it is indeed sometimes a part of it. There is a very fundamental difference. In text analytics, the primary focus is in extracting relevant information without actually knowing what we may get, whereas in searches we use keywords to retrieve relevant information but the result is always known or predictable. Text analytics works on the principle of information discovery. So, the analytics uses the search as a technique or means to categorize or classify documents and to get a gist of the content.
Natural Language Processing (NLP), in combination with statistics, is commonly used to extract information out of unstructured data. NLP is a complex and widely researched field developed over two decades to derive meaningful information from text. It historically has been used by computation linguistics to identify meaningful sentences by using a grammatical structure and parts of speech. Text analytics uses this technique to identify the nature of data. The NLP analyses are performed on several levels, such as:
- Morphological level analysis: Analysis at this level deals with the structure of the word and its formation. Here, the focus is on the individual component of the meaningful word, referred to as a morpheme. For example, the three morphemes of the word, say, "unavoidable," is un/avoid/able (prefix/stem/suffix), where each has its significant meaning. Thus, to increase recall, text analytics matches the morphological variants of documents or unstructured texts.
- Lexical analysis: Lexical analysis is the study of words according to some dictionary or thesaurus. An individual unit of lexical meaning is termed as a lexeme. For example, when analyzing documents, the words sales, sale, purchase, purchases, payment, and the like may add up to the idea of the occurrence of some form of transactions that we may be interested to find out further.
- Syntactic analysis: Syntactic analysis focuses on the grammatical aspect of the text. Here, the focus extends from individual words to phrase, clause, and sentence. The main process is to parse the sentences and group them into phrases and clauses. For example, if we analyze a sentence: Mother looked after the child; the syntactic analyses would tag the phrases found in the sentence with parts-of-speech tags.
- Semantic analysis: Semantics analysis determines the meaningfulness of a sentence by examining the word order and sentence structure while disambiguating the sentence according to the syntax preserved in the sentence and paragraph in the document. This increases the query precision, thus increasing the recall in the process, as well.
- Discourse level analysis: Here, the analysis goes beyond a single sentence. The focus is on the structure and meaning of the words and sentences by making connections among many of them. This level is where Anaphora Resolution (AR) is achieved by picking entity references by an anaphor. Note that AR is the problem of resolving the meaning of a sentence with reference to antecedent or precedent items in the discourse. It is a challenge to optimize and is an active area of research in computational linguistics.
Now, once the information is extracted, the challenge is to derive the meaning. This is done with the help of other statistical or linguistic techniques that are used to automatically tag the text documents according to terms, named entities, facts or relationship, events, concepts, and sentiments. This provides an overall picture and sometimes finer understanding of the information extracted.
Taxonomy plays a crucial role in text analytics. Here, the information is organized into a hierarchical relationship.
Unstructured to Structured Data
All these techniques are employed to organize the unstructured data into structured ones so that it can be combined with other structured data persisted in the data warehouse. Finally, it is time to apply data mining tools and business intelligence to get meaningful and relevant insight.
This article gave you a very high-level overview of what is done in text analytics of big data by going into very minimum depth of the technical details. The primary idea is to structure it using NLP and statistical techniques, and then combine repeatedly with other structured data, and then finally put all the analytical tools to work to derive greater insights.