The CIA’s office of the Chief Scientist in collaboration with Northrop Grumman Corporation has created a very innovative system called Language Now incorporating many technologies such as from NovoDynamics, AppTek among others. |
Important documents are usually only available in paper or image form. Searching and retrieving relevant documents or passages from large collections is a difficult task. |
Central Intelligence Agency’s Office of the Chief Scientist provides strategic leadership, coordination, and expertise to support scientific innovation to solve our nation’s most pressing intelligence problems. In addressing national intelligence problems with effective, automated natural language processing tool designed to assist analysts in locating high value electronic documents located on hard drives, portable media, and other data repositories, the CIA’s office of the Chief Scientist in collaboration with Northrop Grumman Corporation has created a very innovative system called Language Now.
Using technology from AppTek with its machine translation, automatic speech recognition and human language processing along with Novodynamic’s advanced optical character recognition (OCR) and software capture technology among others, the Office of the Chief Scientist have integrated numerous technologies in the creation of Language NOW as a web-based integration of Commercial-off-the- shelf (COTS) and Government-off-the-shelf (GOTS) natural language processing tools used to support analysts in identifying foreign language documents of interest.
The CIA’s Office of the Chief Scientist has taken the Language NOW system under its wing having incorporated its own internal research into natural language processing (NLP) to meet advanced US government requirements. CIA’s initial research project focused on integrating all NLP tools (file type identification, language identification, optical character recognition [OCR], translation, etc.) to process Arabic script languages and was called Arabic NOW.
Optical Character Recognition (OCR)
NovoDynamics, an In-Q-Tel portfolio company, was a natural fit in that they were already a preferred government software solution provider and specialized in being able to process large volumes of Arabic, Dari, Farsi/Persian, or Pashto documents. Its OCR technology enabled the CIA to have high accuracy results, even on yellowed pages or documents with stains and smudges. The system was such a success that The Office of Chief Scientist embarked on additional projects that focused on other languages, such as Chinese dubbed Chinese NOW and Russian renamed Russian NOW. “After we realized that the environment supported dozens of languages, we just started calling the system Language NOW which became its official name,’ said Preston Golson, CIA Spokesperson for The Office of Chief Scientist. Documents in all the languages are being processed daily and there really isn’t one language that is more popularly used than another. It varies day to day.
The Language NOW system is simple to use and doesn’t require any advanced computer skills or language proficiency. “If you can order from Amazon.com, you can use Language NOW,” continued Golson. ‘There is a full manual on line and context sensitive help, but when we examine logs of systems returning after deployments generally the manual has not been accessed. There are advanced and configurable features to meet the needs of “power users”, but Language NOW can be used with default settings.”
Accessed from Multiple Devices and Locations
Language NOW can be hosted on enterprise servers, standalone laptops and used from Android mobile devices. Language NOW can be managed from a central location, or deployed in the field at multiple locations. It integrates seamlessly with other systems and 3rd party applications.
“Yes, Language NOW has been integrated with other systems, on both the front end and back end, and has worked with third party applications via the Language Now application programmer interfaces (APIs),” said Golson.
After years of careful planning in the design of Language NOW and conducting a lot of experimentation, Language NOW can be used with ease by a single user, a team, or an entire organization. “We have a number of management and reporting features too that are designed to support team use,” Golson added. “On a single server, Language NOW can translate at over 100,000 words per minute (WPM), and on a recent model laptop, between 8,500 and 10,000 WPM.”
In regard to the design, development, and implementation of innovative, scalable data processing, exploitation, and analysis solutions, many of the high value electronic documents Language NOW processes varies in terms not only the level of intelligence being gathered but even the kinds of devices from which information is being retrieved. “A hard drive, laptop, CD or USB key is found containing foreign language files”, said Golson. “The analyst would use Language NOW to triage and analyze material that they couldn’t otherwise understand.”
High value documents are plentiful. Golson continued, “Value is always in the “eye of the beholder.” I personally think that structured data (our spreadsheets for example) contain valuable information, and our scientific and technical publications can be considered highly valuable. We really do have to be prepared to convert any data into a form that can be used by analysts.”
The Cloud and System Design
The Language NOW is an integrated system using the powerful technologies like that of NovoDynamics or Aptek but designed in a way that the user interface, processing workflow, and service provider layers are very abstract and independent from one another and not dependent on any particular vendors product capabilities.
Like most emerging technologies, cloud computing offers compelling advantages such as higher hardware utilization with simplified centralized administration but also can have disadvantages such as lower absolute performance. Golson agrees, “while we can improve system flexibility with these techniques, the virtualization approach at the heart of most cloud computing implementations significantly impacts translation speed.”
Language NOW currently can take advantage of a cloud computing environment. According to Golson, “the highly-decoupled Language Now internal architecture is very compatible with cloud computing concepts, and we have explored hosting Language Now’s compute intensive back end services such as machine translation inside compute clouds.”
The system was developed using develop innovative algorithms, technologies and linguistic resources and today manages machine translation, information extraction and optical character recognition, semantic analysis, and certain aspects of text summarization and document clustering.
Exponentially Growing ROI
The CIA’s Chief Office of Scientists was able to achieve a hefty ROI. According to Golson, “Yes, it exceeded our expectations. Language NOW returned more than ten times the original investment in its early years, and continues to provide additional value on a daily basis.”
The overall goal in developing automated tools that will assist human analysts in judging, quickly and accurately, the relevance of individual documents in either English or one or more foreign languages, and also within clusters of topically related documents that are related topically was achieved. The team can quickly identify specific facts or other items of interest from single documents or sets of documents.
The Chief Office of Scientist are also reaching their longer-term goal in enabling their analysts with ability to perform accurate information analysis from an automatically-produced
English-language translation or from an automatically-produced, condensed, English-language textual or non-textual rendition (summary) of a document or document cluster, instead of from the foreign language original(s). This provides a huge time savings and proves to provide a high level of accuracy.
Overall Language NOW has surpassed the CIA’s expectations in being able to fully automate important aspects of the CIA’s information extraction and intelligence gathering process in acquiring information from text documents and other natural language-based sources. Government information analysts have come to rely on Language NOW and are able to fulfill their analytic tasks with confidence.