By Jesse Marx
By Chris Parker
By Jake Rossen
By Jesse Marx
By Michelle LeBow
By Alleen Brown
By Maggie LaMaack
By CP Staff
Over the decades, the two companies strove to outdo each other in providing the best online legal research tools in the market. Each came up with more and more sophisticated ways to access legal information through the use of online technology.
Though it focused aggressively on improving delivery, West never confused the vehicle with the content. The internet itself was never a product. Rather, West's product was its value-added legal information. When Opperman invested in Westlaw, he had no way of knowing that the internet would overtake print publishing. "I saw it as another way of selling our material," he says.
Turn words into math
At the end of World War I, a German engineer named Arthur Sherbius created a machine that could encrypt and decode messages. It was adopted by the Nazis and called the Enigma.
One of the devices was sent by mistake to the Biuro Szyfrów (Poland's codes bureau), where a Polish mathematician named Marian Rejewski applied mathematics to crack the cypher. For the next seven years, cryptologists at the Biuro Szyfrów regularly deciphered Enigma-encrypted messages.
Five weeks before the outbreak of World War II, the Poles shared the code-breaking method with their French and British allies. Cryptologists at Bletchley Park in England used the information to decode thousands of Nazi messages. The intelligence they collected became known by the code name ULTRA, for ultra-secret, and it has been credited with hastening the end of World War II by two years.
The story of Engima and ULTRA was kept secret from the public until 1974, but Warren Weaver, who headed up the Applied Mathematics Panel at the U.S. Office of Scientific Research and Development during the war, would have surely known of it. In July 1949, Weaver drafted a bold memo. He proposed that languages, like codes, could be cracked with math—using computers.
In the midst of the Cold War, Weaver's idea had vast appeal. His memo set off a renaissance in computer science, as researchers busily scratched out calculations in an attempt to translate Russian through the use of mathematical formulas.
But by about 1970, the flurry of research abruptly stopped, when scientists realized Weaver was fundamentally wrong. It turns out that languages can't be cracked through math because they aren't math-based. (In retrospect, Weaver should have known better. After all, the Allies used the Navajo language as an unbreakable code during the war.)
But there was a glimmer of the future in Weaver's idea: Powerful technologies can be created when words are treated like math. As the scientists worked with linguistic data, they discovered sophisticated mathematical formulas that could describe patterns in the data. These algorithms could be used to teach computers to recognize patterns, and once a computer understood a pattern, it could sort and categorize new data, even if it didn't technically understand what the language meant.
Westlaw's vast trove of legal documents turned out to be the perfect diet for the new technologies. The company's computer scientists designed a system of algorithms that they dubbed CARE, Categorization and Recommendation Engine. The computer uses a system of statistics, including Bayesian probability, to predict where documents should be categorized. CARE suggests key numbers for new cases, identifies cases affected by a new decision, and performs a host of other tasks. Before CARE, West had hired freelance attorneys to do this work. Now, a computer can do it more quickly and more accurately.
Separate the signal from the noise
Type the word "jaguar" into Google's search engine and you'll get 64 million results. Some of the returns have to do with the animal; others refer to the luxury car. The jaguar problem is precisely the kind of search confusion that Westlaw tries to avoid.
"It's all about trying to find a needle in a haystack," says West CEO Peter Warwick.
The basic information-retrieval technology West uses is the same as the technology that underlies a search engine like Google. It's called TF/IDF, for term frequency/inverse document frequency, and it essentially measures the frequency of a term in a document and compares it to how rare that term is in the vast pool of data that composes the entire system. Those parameters tell the computer which information is most relevant for the search. But West's system has some important differences.
"Google knows how pages are linked, but it doesn't really know why," says Peter Jackson, chief scientist and head of research and development at Thomson Reuters.
West can return more targeted search results than Google for three reasons:
• The information in West's database is already connected through the key number system—the organizational structure that John West set up 100 years ago. West uses the connections between documents—citations as well as key numbers—to recommend search results the user might otherwise not have found.
• The pool of data is more limited because it is only legal information. West's database contains less irrelevant information than Google's massive database, which tries to index everything.
• The vocabulary in the pool of information is also more specific. Legal terms are by necessity uncreative. Rather than find a new word for the term "bankruptcy," an attorney will specifically use that term 20 times in a document, because it has a specific legal meaning that he is trying to convey. That repetition of terms makes legal information easier to search—and West's search technology includes a thesaurus that recognizes synonyms.