Introduction of Text Summarization

January 9, 2018 Author: virendra
Print Friendly, PDF & Email

With the dramatic growth of the Internet, people are overwhelmed by the tremendous amount of online information and documents. This expanding availability of documents has demanded exhaustive research in the area of automatic text summarization. Every day, people rely on a wide variety of sources to stay informed — from news stories to social media posts to search results. Being able to develop Machine Learning models that can automatically deliver accurate summaries of longer text can be useful for digesting such large amounts of information in a compressed form.

What is Text Summarization?




Text summarization is the problem of creating a short, accurate, and fluent summary of a longer text document. Summarization can also serve as an interesting reading comprehension test for machines. To summarize well, machine learning models need to be able to comprehend documents and distill the important information, tasks which are highly challenging for computers, especially as the length of a document increases.

The World Wide Web has brought us a vast amount of on-line information. Due to this fact, every time someone searches something on the Internet, the response obtained is lots of different Web pages with much information, which is impossible for a person to read completely. With the wide spread use of internet and the emergence of information exploration era, quality text summarization is essential to effectively condense the information.

Text Summarization is condensing the source text into a shorter version preserving its information content and overall meaning. It is very difficult for human beings to manually summarize large documents of text. The Internet normally provides more information than is needed. Therefore, a twofold problem is encountered: searching for relevant documents through an overwhelming number of documents available, and absorbing a large quantity of relevant information.

The goal of automatic text summarization is condensing the source text into a shorter version. Summaries may be classified by any of the following criteria:

  • Detail: Indicative/informative
  • Granularity: specific events/overview
  • Technique: Extraction/Abstraction
  • Content: Generalized/Query-based

Definition of text summarization




Text summarization is the process of producing shorter presentation of original content which covers no redundant and salient information extracted from a single or multiple documents. A summary can be defined as a text that is produced from one or more texts, that contain a significant portion of the information in the original text(s), and that is no longer than half of the original text(s)

More specifically, text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or user) and task (or tasks).

When this is done by means of a computer, i.e. automatically, we call this Automatic Text Summarization. Despite the fact that text summarization has traditionally been focused on text input, the input to the summarization process can also be multimedia information, such as images, video or audio, as well as on-line information or hypertexts. Furthermore, we can talk about summarizing only one document or multiple ones. In that case, this process is known as Multi-document Summarization (MDS) and the source documents in this case can be in a single-language (monolingual) or in different languages (trans-lingual or multilingual).

Automatic Text Summarization




With the increasing amount of information, it has become difficult to take out concise information. Thus it is necessary to build a system that could present human quality summaries. Automatic text summarization is a tool that provides summaries of a given document. Automatic text summarization is the technique, where a computer summarizes a text. A text is entered into the computer and a summarized text is returned, which is a non redundant extract from the original text.

Traditionally, summarization has been decomposed into three main stages.

  • Interpretation of the source text to obtain a text representation,
  • Transformation of the text representation into a summary representation, and,

Finally, generation of the summary text from the summary representation

Automatic summarization of text works by first calculating the word frequencies for the entire text document. Then, the 100 most common words are stored and sorted. Each sentence is then scored based on how many high frequency words it contains, with higher frequency words being worth more. Finally, the top X sentences are then taken, and sorted based on their position in the original text. By keeping things simple and general purpose, the automatic text summarization algorithm is able to function in a variety of situations that other implementations might struggle with, such as documents containing foreign languages or unique word associations that aren’t found in Standard English language corpuses.

General Process of Text Summarization

Figure 1: General Process of Text Summarization

Approaches to automatic text summarization involves

  • Elimination of redundancy: The sentences in the text which convey the same meaning are said to be redundant and can be eliminated in the summary.
  • Identification of Significant Sentences: Summary being a shorter representation of text requires including only salient sentences from the original document.
  • Generation of Coherent Summaries: Sentences selected for summarization needs to be ordered and grouped so that coherence and readability is maintained.
  • Metrics for Evaluating the Automatically Generated Summaries: In most of the cases the quality of the summary is judged by humans and hence automatic evaluation is a desirable feature.

Examples of Text Summaries

There are many reasons and uses for a summary of a larger document. One example that might come readily to mind is to create a concise summary of a long news article, but there are many more cases of text summaries that we may come across every day.

  • headlines (from around the world)
  • outlines (notes for students)
  • minutes (of a meeting)
  • previews (of movies)
  • synopses (soap opera listings)
  • Reviews (of a book, CD, movie, etc.)
  • digests (TV guide)
  • biography (resumes, obituaries)
  • abridgments (Shakespeare for children)
  • bulletins (weather forecasts/stock market reports)
  • sound bites (politicians on a current issue)
  • histories (chronologies of salient events)

References

[1] “Chapter 6: Text Summarization”, available online at: http://shodhganga.inflibnet.ac.in/bitstream/10603/34713/14/14_chapter%206.pdf

[2] Seyedamin Pouriyeh, Mehdi Assefi and Saeid Safaei, “Text Summarization Techniques: A Brief Survey”, arXiv preprint arXiv: 1707.02268 (2017).

[3] Elena Lloret, “Text summarization: an overview”, Paper supported by the Spanish Government under the project Text-Mess (TIN2006-15265-C06-01) (2008).

[4] Jason Brownlee, “A Gentle Introduction to Text Summarization”, November 29, 2017, available online at: https://machinelearningmastery.com/gentle-introduction-text-summarization/

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert