Times Insider explains who we are and what we do, and delivers behind-the-scenes insights into how our journalism comes together.
Reporters often spend hours sifting through documents.
A lengthy court document was the starting place when Willy and several other Times reporters chased down Client 9 in a prostitution investigation, eventually identifying Eliot Spitzer, the New York governor at the time, as the client referred to in the criminal complaint.
Ben covers the federal courts in Manhattan for The Times, routinely digesting lengthy court filings, often on deadline. He recently dug through unsealed court papers and indictments to piece together the story of a New York state residential treatment center where at-risk teenage residents were forced into a sex-trafficking ring. Nineteen people were charged in a case that led to the closing of the center.
So when a judge last week unsealed nearly 900 pages of search warrant applications and affidavits prepared by the federal authorities investigating Michael D. Cohen, President Trump’s former lawyer and fixer, they immediately dug into the documents, looking for news.
In these situations, reporters with a strong understanding of the case need to skim with a critical eye, looking for important details. One time-honored way to review documents involves printing them out and annotating them by hand. Some reporters find this is still the most effective way to digest and review a large number of pages, when there is ample time.
But when a deadline is looming and hundreds of pages await, it helps to have a tool that can speed up the process. At The Times, technologists like myself on the Interactive News team work to identify reporting patterns that come up a lot and develop tools that help reporters handle these tasks more efficiently.
In close collaboration with the newsroom staff, I recently led the development of a tool called DocumentHelper. The tool is used internally at The Times to quickly ingest large numbers of documents and make them searchable. Steps that reporters previously followed across a few different applications can be combined into the equivalent of a “one-stop shop.”
DocumentHelper isn’t a particularly unique tool. Its power comes from tailoring it to The Times’s vocabulary and work flow. It uses technologies well known to people who digitize documents, namely optical character recognition, which is commonly abbreviated as OCR. Like journalists, legal professionals rely on OCR to help manage huge sets of documents. Likewise, archivists and libraries use specialized scanning rigs to digitize their collections of text materials.
But OCR technology is found in all kinds of day-to-day tasks, like online banking and toll-road license plate scanning, as well as in website security Captchas and even mobile language-translation apps that “translate” photos taken by travelers. It turns printed notation of all shapes and styles back into digital content, so text can be copy/pasted or saved in digital form.
OCR works by isolating each individual letter, then comparing its extracted shape against mappings of letterforms across dozens of written notation systems, like languages or music. It does this for every letterform on a page, as well as for punctuation, formatting like italics and even meaningful white space. By preserving the order of matches, it creates a digital edition.
But the process is not foolproof. Distorted letterforms — whether from skewed pages, aged paper, old-fashioned typefaces or even the vagaries of handwriting — sometimes cause the software to make imperfect matches, like mistaking a letterform like “d” for “ol” to get “olog” instead of “dog,” or archaic letters like a long s for the letter “f.” So we need to be open-minded about how to search within documents that have been OCR’d.
(The long-s error provides amusing reading within certain eras of digitized works. Late-18th-century authors weren’t particularly foul-mouthed — modern software just struggles to read them right.)
When the Cohen search warrant affidavits were unsealed last week, DocumentHelper came to the rescue.
“As Ben and I crashed our way through nearly 900 pages with little time to spare,” Willy Rashbaum recounted, “it served as equal parts power drill, spotlight, microscope and jackhammer.
“Almost like finding the proverbial needle in a haystack, it helped us locate useful and potentially newsworthy nuggets of information in a vast collection of court documents, which would have been an otherwise daunting task in the limited time we had to review it.”
The tool helped them to zero in on specific details from the Cohen investigation in under ten minutes once the documents were available and uploaded. When was the inquiry actually referred from the special counsel to federal prosecutors in Manhattan? Pop in the word “referral” and search. (Answer: February 2018.) What did the various warrants say about Mr. Cohen’s assets and liabilities? Search for “financial statement.” What specific charges were being looked at? Search for “money laundering” and “fraud.”
And where was that reference to Viktor Vekselberg, the Russian billionaire with apparent ties to the Kremlin? The tool quickly called up the seven pages with his name. And indeed, in Document 43-1, which alone ran 269 pages, there it was, the first reference, on Page 25.
The Times is fortunate to be able to invest in and incorporate newsroom developers, who can create work flow-focused software for reporters and staff. The job is part coach and part coder, listening for pain points and evaluating possible solutions, whether those fixes take the shape of work flow adjustments or custom software. In this case, we have a dedicated tool that will continue to grow and help the newsroom wrangle all those last-minute documents.
Agustin Armendariz contributed reporting.
Follow the @ReaderCenter on Twitter for more coverage highlighting your perspectives and experiences and for insight into how we work.