How The Declassification Engine Caught America's Most Redacted

President Eisenhower pointing and staring directly at the camera, black & white

Eisenhower Edition

Methodology

We began with a set of over 117k documents from Gale Cengage’s U.S Declassified Documents Online system (DDO). The collection includes most documents declassified at presidential libraries over the last forty years, including thousands of pages of top-level documents from the CIA, State Department, and the Pentagon. They cover US foreign policy since World War I, but most are from the Cold War era. Gale used double-key entry to transcribe most of these documents, though more recently-released materials were scanned using Optical Character Recognition (OCR).

This collection includes both “sanitized” and “unsanitized” versions of the same documents. One reason is that different departments and agencies redact different things depending on what they deem to be most sensitive. Bringing them together can reveal the people, places, and things that are most likely to be redacted, thus helping to correct the intrinsic bias in the public record: we only know what the government will let us know.

Sasha Rush, who was then a Ph.D. student in computer science at MIT, wrote a program that combined visual and textual analysis, and then ran it on the database. It enabled us to identify over five thousand examples of (un)redacted text. With this kind of data, never before available, we could identify which names are disproportionately likely to be blacked out relative to how often they appear in the rest of the corpus.

To come up with the list, we first had to decide what period to focus on. The DDRS has a lot more documents for the 1950s than for the 1980s, so a “most redacted” list for the whole Cold War would be misleading. We therefore decided to just start with the period when Eisenhower was President.

We did not want just the absolute number, i.e. the names most likely to be redacted overall, since that largely reflects what names appear most often in the collection, i.e. the Secretary of State, the Director of the CIA, etc. Instead, we wanted to get at the relative sensitivity, i.e. the names of people disproportionately likely to be blacked out.

Finally, we had to find some way to get the names themselves amidst the 60k + words in redacted text. We therefore ran a Named Entity Recognizer (NER) over the collection to extract these person names. We then made several types of calculations about the odds of a name showing up in the documents, most important among them

  • Total Prob: The number of times a specific name appeared divided by the total number of words in all the Eisenhower-era documents. This gives us the odds of the name appearing anywhere in these documents.
  • Redacted Prob: The number of times the name appears in redacted text divided by the total number of words in all the redactions. This gives us the probability that a name will appear in redacted text from the Eisenhower era.
  • Log Odds: The logarithm of Total Prob divided by the logarithm of Redacted Prob. This tells us which names are most likely to appear in redactions compared to their likelihood of appearing in the whole collection (the lower the number, the relatively higher likelihood of the name appearing in a redaction).

 

The Data

The data in table one show the top terms that the NER tagger identified as being person names. They are sorted by the Log Odds of whether or not they would show up in redacted text.

Table 1: NER Most Redacted Person Names

However, there were also names relatively likely to appear in redacted text that were not tagged as names by NER. Here is a list of words most likely to be redacted that the NER could not categorize:

Table 2: Sample of cases where most redacted plain words were actually highly-ranked names

After combining these two sets together, we then scoured the actual redacted text with these names to correct any errors. In some cases the count was too low because of misspellings. In others it had to be adjusted down because some redactions were double-counted. And there is always the possibility that the total count is off, if for instance there are multiple people with the same name that appear elsewhere in the collection. As we continue processing these collections, we will identify and correct any such errors.

But for now this is the best intelligence we have about the people you don’t see enough of in the official history because they have been blacked out of the official record.

Table 3: Adjusted Final Top 10

 

Examples of redactions

"...more frequent. The opposition party maintains that the government is trying to have Mr. Inonu lynched. The Turkish Defense Minister recently remarked that the military leaders may have to intervene if the tension continues. If Inonu were killed, a revolt could take place in..."

"...He did not feel that Azzam Pasha was intrinsically evil, but rather that he could not be trusted to carry any messages. Mr. Lloyd said, however, that he agreed there was considerable room for maneuver with respect to the over-all problem..."

"...Mayor Brandt was a most interesting character and also a possible candidate for leader of them German Socialist Democrat Party in the future. He was often called "the Bastard of Berlin" because he had no known father. In any event, he was a self-made man and one to be reckoned with. Brandt was strongly on our side and it was our hope hat he and Adenauer would be able to get together..."

"...JOXE REPLIED HE WAS IN FULL AGREEMENT WITH ME AND SAID HE HAD ALREADY TOLD PRESS RELATIONS OFFICIALS AT QUAI TO EMPHASIZE TRIPARTITE AND WESTERN UNITY IN REGARD TO EVENTS IN HUNGARY. HE SAID HE WOULD SPEAK TO THEM AGAIN WITH SPECIFIC EMPHASIS ON UNITED STATES ROLE. JOXE THEN SAID, APPARENTLY THINKING OF PINEAU'S STATEMENT, THAT FRENCH GOVERNMENT HAD ALWAYS REALIZED SERIOUSNESS WITH WHICH UNITED STATES VIEWED HUNGARIAN AFFAIR BUT HAD..."

"...Mr. Allen Dulles then turned to the situation in Iran, which was very disturbed owing to the highly-publicized trial of Mossadegh. Mr. Herbert Hoover, Jr., had returned from his first visit to Teheran with a pessimistic judgment as to the prospects for an oil settlement. Mr. Hoover had reported the Iranians very ignorant as to the facts of life with regard to their oil resources, Secretary Dulles said that Mr. Hoover had been rather more optimistic in reporting to him, and had expressed the view that something could be worked out over a period of time..."