With the increasing advancements in the field of artificial intelligence, its sub-fields including natural language processing, natural language generation, computer vision, etc., have rapidly gained a lot of popularity due to their wide range of use cases. Optical character recognition (OCR) is a well-established and widely investigated area of computer vision. It has many applications, such as document digitization, handwriting recognition, and visual text recognition. Mathematical expression recognition is an area of OCR that has been of great interest in academic studies.
The Portable Document Format (PDF) is one of the most widely used formats for scientific knowledge, often preserved in books or published in scholarly journals. The second most used data format on the Internet, accounting for 2.4% of information, PDF is frequently used for document distribution. Despite their widespread use, extracting information from PDF files can be difficult, especially when dealing with highly specialized material such as scientific research articles. In particular, when these papers are converted to PDF format, the semantic information of mathematical expressions is often lost.
To address the challenges, Meta AI’s team of researchers has introduced a solution called Nougat, which stands for “Neural Optical Understanding for Educational Documents”. Nougat is a visual transformer model for performing optical character recognition (OCR) on scientific texts. Its goal is to convert these files into a markup language so that they are more easily accessible and machine-readable.
To demonstrate the effectiveness of the methodology, the team also created a new dataset of academic papers. This method provides a viable answer to increase the accessibility of scientific knowledge in the digital age. It bridges the gap between written materials that are easy for people to read and text that can be processed and analyzed by computers. Researchers, educators and anyone interested in scientific literature can access and manipulate scientific papers using Nougat. Nougat is basically a transformer-based model designed to convert images of document pages, especially PDFs, into formatted markup text.
The team summarized their major contributions as follows –
- Release of pre-trained model: The team has created a pre-trained model that can convert PDF to simple markup language. This pre-trained model has been made public on GitHub, where the research community and anyone with the relevant code can access it.
- Pipeline for dataset creation: A method for creating datasets that combines PDF documents with their corresponding source code. This dataset development method is important for testing and refining the Nougat model and may be useful for future document analysis research and applications.
- Dependency on page image only: One of the best features of Nougat is its ability to work on page image only. This makes it a flexible tool for extracting content from a variety of sources, even when the original document is not available in digital text format. It can process scanned papers and books.
check Paper and Github. All credit for this research goes to the researchers in this project. Also, don’t forget to participate Our 29k+ ML SubReddit, 40k+ Facebook community, Discord ChannelAnd Email newsletterWhere we share the latest AI research news, cool AI projects and more.
If you like our work, you will like our newsletter.
Tanya Malhotra is an undergrad from University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking skills, along with a keen interest in learning new skills, leading teams and managing work in an organized manner.