In Google we trust

Discussion in 'Cultural Discussions' started by timpeac, Aug 24, 2005.

  1. timpeac

    timpeac Senior Member

    England
    English (England)
    Hi

    I would like your opinions on how much trust we should put in google when searching for language usages. I am a firm believer in the fact that there are enough nutters in cyberspace for most things to have been written - rightly or wrongly - at some point and so the fact that you find a certain number of examples of a certain usage is not necessarily proof that it is correct/acceptable or even remotely admitted as a possibility by native speakers.

    But how many hits showing a certain usage are necessary for us to say "whatever books on grammar, vocabulary or usage may say, it is clear that x or y usage/word/idiom is certainly said by a reasonable number of speakers?" In the French forum I just googled the supposedly French word "méditeur" which doesn't exist and had 135 hits (mostly misspellings of "médiateur" = "mediator"). Or would you say that that number of hits is enough to say that the word "méditeur" exists?

    What are the pitfalls of googling? Should you try to exclude sites written by foreign speakers in your target language? How would you go about that?

    It seems to me that google is a hugely useful and relevant source. But as with all statistical evidence you can prove pretty much whatever you want. What is the best way to get "clean" evidence?
     
  2. Cath.S.

    Cath.S. Senior Member

    Bretagne, France
    français de France
  3. timpeac

    timpeac Senior Member

    England
    English (England)

    75,400 hits for "shakspeare"!! Oh that's just fantastic:D . Actually, I've just started a thread on the theme of googling in the culture forum. Could you move this message and yours there? Cheers muchly.
     
  4. astronauta Senior Member

    canada
    Spain. Spanish (ES, MX) English (UK, CA, US)
    I love it and hate it....

    It's highly commercial but useful.

    However, have you tried looking for the direct phone # or postal address of a hotel???

    It's almost impossible to find it unless you also google the area code as well; you get a zillion travel agent pages....

    About what you said, New Scientist magazine published an article about the zillion pages that contain misspells, including universities and things of the sort...

    I don't know, I would not know how to answer your question.
     
  5. Kelly B

    Kelly B Senior Member

    USA English
    I wish I could accurately remember the example I looked at recently. There was an ASTONISHING number of hits on a misspelled word. Google was smart enough, however, to ask "did you mean correct spelling?" One suggestion, then, is that it is very important to check for that little block. The fact that vast numbers of people use it incorrectly does not make it correct (yet!), particularly when the point here is to discuss correct spelling and grammar with people who care about that sort of thing.
    As you said, many Google entries are written by non-native speakers, increasing the likelihood of errors. They don't have it emblazoned across their websites any more than they have it written on their foreheads, and it's much harder to know, in the absence of an accent.... I would not be surprised to find that this is very often the case on university sites, given the demographics of those pursuing advanced degrees in the U.S.
     
  6. rob.returns

    rob.returns Senior Member

    Phil
    Philippines-English, tagalog, spanish, chavacano, tausog, visaya, ilonggo.
    In my point of view, If you do google, you need to verify that information on other sites. Segregate what's fact and fiction. Ask questions. Do research. Know the Key Words. That way you get much truth and not trash on your Topic.
    And know "HOW" to search...
     
  7. timpeac

    timpeac Senior Member

    England
    English (England)
    Hi Astronauta

    Your comments are interesting and true - however I need to make sure this thread stays on track -

    I am asking only about the use of google as a tool to get linguistic evidence.

    Please feel free to start a thread on the merits of google generally, I think that would be very interesting too!!:)

    Thanks! Tim
     
  8. timpeac

    timpeac Senior Member

    England
    English (England)
    I suppose I should give my opinion - I would say that a google hit of less than 500 is pretty meaningless in terms of verifying a usage. I would like to see at least 10,000 before saying something is common.

    However, of course, common does not equal right - as Egueule's googling of "Shakspeare" instead of "Shakespeare" returned 75,400 hits!! I imagine that is a bit of an exception since his name is very unusual but it goes to show some of the pitfalls.
     
  9. panjandrum

    panjandrum PongoMod

    Belfast, Ireland
    English-Ireland (top end)
    Use of Google - or any equivalent search - needs to be accompanied by intelligence. As clearly stated by others, it finds the characters you search for. This gives no value judgement on the sites.
    So you need to look at a good sample of the actual sites found. It is usually obvious whether or not they are credible - can you cite them as instances of legitimate use of the character string you searched for?

    Without care, you could find yourself endorsing usage that appears only in blogs.

    Useful refinements are to list only UK sites or sites in a particular language.

    I used Google recently to assess use of corpse v body. body outnumbered corpse by about 10:1 - but significantly, sites listed for "<words and> corpse" were fiction, sites for "<words and> body" were fact.
    As with all information analysis, a great deal of judgement must be applied. Careless use of raw counts from Google should be punished severely.

    You can get a much less random analysis of use of English by searching the British National Corpus.

    Hit counts of <100 can still be useful, if you know why. My body count was looking for a very specific set of key words. But sure enough, anything that comes up with a low hit count is likely to be due to your poor spelling:)

    I should add that I am an information sceptic. I never trust any source on its own; always look for confirmation; never trust the first analysis; pain in the butt really:eek:
     
  10. Cath.S.

    Cath.S. Senior Member

    Bretagne, France
    français de France
    There is none. Language is dirty in essence. :)

    But going back to your méditeur example, we know the word does not -yet - exist because that's not what the person who spelled it that way intended to write.
    I know it's a pretty hard question, but we must ask ourselves, every time we use Google, did the writer really intend to write what he did?

    In my book, a word or usage only exists if it is voluntary.
     
  11. germinal

    germinal Senior Member

    Bradford, England
    England English

    Interesting choice Shakespeare but you probably know that his name was often spelled in different ways: :)

    1. Introduction

    One of the most common articles of Oxfordian faith is that there is great significance in the various spellings of Shakespeare's name. The spelling "Shakespeare," according to most Oxfordians, was used to refer to the author of the plays and poems, while the spelling "Shakspere" (or "Shaksper," in the version sometimes promoted by more militant Oxfordians such as Charlton Ogburn) was used to refer to the Stratford man. A milder version of this claim acknowledges that Elizabethan spelling was not absolute, but still says that the usual and preferred spelling of the Stratford man's name was "Shaksper(e)," as opposed to the poet "Shakespeare." These claims about spelling are usually accompanied by an assertion that the two names were pronounced differently: "Shakespeare" with a long 'a' in the first syllable, as we are accustomed to pronouncing it today, but "Shakspere" with a "flat" 'a,' so that the first syllable sounds like "shack." A separate but related claim involves hyphenation: the name was occasionally hyphenated in print as "Shake-speare," a fact which Oxfordians say points to it being a pseudonym. These claims are given more or less prominence in different presentations of the Oxfordian theory, but they are virtually always present in one form or another. Indeed, they are vital for the Oxfordian scenario, since they make it easier for Oxfordians to believe that the "William Shakespeare" praised as a poet was some mysterious figure with no apparent connection to the glover's son and actor "William Shaksper" from Stratford-upon-Avon.

    www.shaksper.net/archives

    Germinal.

    .
     
  12. timpeac

    timpeac Senior Member

    England
    English (England)
    Ah, excellent quote Germinal. I had a feeling that the present spelling of Shakespeare was more by convention than anything else. But that is really interesting for trusting google - it may not tell us the most accepted form, but a hit rate as high as 75k is enough to show that the variant is far from unheard of!! Maybe this is just an example of how much most people couldn't care less if they are told to spell something a certain way, they will carry on expressing themselves as the see fit regardless.
     
  13. Cath.S.

    Cath.S. Senior Member

    Bretagne, France
    français de France
    Tim, be careful with the numbers Google quotes!
    The first number you get includes duplicates.
    In fact there are only 783 pages that contain the Shakspeare spelling. You cannot trust the number that appears on the first page of results, what you have to do in order to get the real number of instances is this:
    1. select advanced search
    2. set the number of results per page to 100
    3. go to the last page of results, at the bottom of the page you'll see a message like this one:
    In order to show you the most relevant results, we have omitted some entries very similar to the 783 already displayed.

    Edit
    Germinal, the information you give about alternate spellings of Shakespeare is truly fascinating! Thanks! :)
     
  14. cuchuflete

    cuchuflete Senior Member

    Maine, EEUU
    EEUU-inglés
    To Tim's original question...usage, when prolonged and widespread, finds its way into grammar books and dictionaries. The following logically meaningless atrocity is on its way into the English language, and may have arrived, despite my futile protests:

     
  15. timpeac

    timpeac Senior Member

    England
    English (England)
    Thanks Egueule - this is the sort of tip I was hoping for, along with the discussion of the merits of using google for linguistic information.
     
  16. timpeac

    timpeac Senior Member

    England
    English (England)
    Haha, it's a brave new world Cuchu.
     
  17. Edwin

    Edwin Senior Member

    Tampa, Florida, USA
    USA / Native Language: English
  18. timpeac

    timpeac Senior Member

    England
    English (England)
  19. lsp

    lsp Senior Member

    NY
    US, English
    I've questioned the use of Google in these forums many times. Spelling is a different case than grammar, because, as mentioned, Google will suggest an alternative. But I tihnk that the numbers alone don't mean anything, anyway. They are not a percentage of anything meaningful. The sample itself is not pure. Our examples of misspellings and bad grammar serve to increase the instances in search results. Non-natives seeking to be corrected are included. Teen blogs, song lyrics, quizzes with multiple choice answers... and so on and so on. And there are plenty of sites that would likely add to the counts on the side of grammatical correctlness but don't because they are subscription based or not well indexed for Google's spiders (ever notice the Wall St. Journal doesn't come up in any Google results?). It's too soon IMHO to declare the mere number of results a fair resource to guide our understanding of language usage.
     
  20. Edwin

    Edwin Senior Member

    Tampa, Florida, USA
    USA / Native Language: English
    LSP, in fact Google does give hits on WSJ articles. See
    "Wall Street Journal" Bush


    Although counts don't count for grammatical constructions, I find it sometimes helpful to search on expressions restricted to Spanish language pages and then actually open some of the pages and try to determine the author and type of site.
     
  21. lsp

    lsp Senior Member

    NY
    US, English
    Only one of the results was an article within the domain of the Wall St. Journal. The rest referred to the journal but were on other sites. I am surprised there was even one, but if you were to have googled a quote from today's paper, without adding WSJ to your search terms, you'd have gotten no results.
     
  22. panjandrum

    panjandrum PongoMod

    Belfast, Ireland
    English-Ireland (top end)
    Absolutely agree. Unless you do that for some of the links that are listed, you have no idea what you are really counting.
     
  23. timpeac

    timpeac Senior Member

    England
    English (England)
    Ah - this highlights a very important point for me - I do not try to use google to find the "correct" usage - 9 times out of 10 this will just be the usage of the people with most power at the time, and there are lots of (coflicting) books to refer to on the issue - I use it to find out what is really said by people.

    I agree you need to be really careful with using google to support usage - however, surely there is some number where you need look no further - it must be a normal spelling-usage. I don't know, if you had a million hits for something would you not be willing to bet a sizeable sum on that evidence alone that it is a normal usage?
     
  24. Merlin Senior Member

    Philippines
    Philippines - Tagalog/English
    Google, google, google. Some say it's not accurate. Some say it is. All I can say is it helps a lot. You just have to find a way to get the best ones. It's a great start if you're searching for something. We just have to wisely use it...
     
  25. lsp

    lsp Senior Member

    NY
    US, English
    Reluctantly, with caveats. I still feel we don't know enough about or consider seriously enough what is searched, what is not, etc., to take numbers - even really big ones - at face value, which so many already are doing. I feel the tide is overwhelming, which means I'm afraid that further explanations won't be coming anytime soon as people accept what they already get from google as the end all and be all of incontrovertible scientific research, rather than interesting anecdotal information.
     
  26. cuchuflete

    cuchuflete Senior Member

    Maine, EEUU
    EEUU-inglés
    It's curious that in all of this discussion about what Google is and is not good for, we have had such scant mention...one or maybe two posts...of the statistical sample from which Google data is extracted.

    Imagine the largest concentric circle: All speakers and writers, both native and non-native, of a language. Now remove the majority of them, who do not have a computer or acccess to one. That tends to leave the wealthiest, together with business and non-commercial organizations, including government agencies. This inner circle is the population from which web pages emerge.

    Perhaps in English and French one might argue that the computer users, and their written usage on line, are fairly representative of the population as a whole. I am not so sure about that. In Portuguese, a much smaller percentage of the total speakers have computer access.


    Google may be a useful tool in approximating the current usage of a significant sub-set of a population, but we need to take care in extrapolating to the total population, or we will risk some Google sized errors. Groups probably under-represented in the Google sample include those over 50 years of age, the poor, and people in more rural areas without telecommunications infrastructure.

    In other words, if you are a Maine lobsterman, Google may not reflect your usage, while if you work for a bank in London, your words carry extra weight.
     
  27. timpeac

    timpeac Senior Member

    England
    English (England)
    Very good points, Cuchu. You have underlined the exclusion of the usage of certain native speakers, so I will mention some other implications of this - the internet contains a lot of other evidence of usage other than current chatty usage. For a start it is going to favour the written over the spoken (so if you are someone who would always write "whom" but only ever say "who" you will be misrepresented). It will also contain quotes from people who haven't got access to the internet, the greatest majority of whom must be dead I suppose. So the fact that you can almost certainly find all of William Shakespeare's plays on line somewhere will influence our google-based decision on whether it is more usual to say "I am to bed" or "I'm off to bed".

    It is part of my question as to how we can best filter down our body of evidence before googling - eg should we try to exclude non-natives? This is not a black and white question for me, since I do not think per se that someone can, or should be able, to influence a language only if they speak it as a mother tongue. But obviously you don't want people who can't speak it to save their lives to have influence either. Particularly for English, or rather English in its globish form, it would seem relevant to look at how Italians speak to Portuguese in English and vice versa.
     
  28. panjandrum

    panjandrum PongoMod

    Belfast, Ireland
    English-Ireland (top end)
    Here is an illustrative example.
    Supposing someone in these forums asked:
    I am working in the elephant house in XXXXX zoo. I provide the elephant with food and water. I know I can say "I feed the elephant," but can I also say, "I water the elephant"?

    So, here we go a googling (among the leaves so green..........)
    ....pause. Whistle Greensleeves.....

    And I come back with this reply:
    Thank you for your really interesting question. I have completed an exhaustive search of the known literature and can confirm that "water the elephant" is perfectly normal usage. It is a little less common than "feed the elephant", of course, but that is not surprising because the keepers must actually provide food for the elephants whereas the water is normally available from an automatic source.
    Source: Google:
    "feed the elephant" = 1,310 hits
    "water the elephant" = 423 hits.

    But in fact, when you look closely, none of the first ten links actually use the phrase in the sense I was looking for. To save you the bother of doing it yourself, of the first ten links:
    5 = ...water. The elephant ...
    3 = ...to get water, the elephant...
    1 = ...sucking it full of water the elephant....
    1 = ...The water the elephant sprays...

    OK, so the hit rate is way less than timpeac's suggested million, but I think this makes the point, or some point, or at least wasn't totally pointless.

    My sceptic's point, I suppose, is that Google is one of many sources. Never trust any source - alone, and always make sure you know the quality and the characteristics of each.
     
  29. timpeac

    timpeac Senior Member

    England
    English (England)
    Pan - your points are excellent as always!!:) Can I please just point out that I was not suggestion a million - I was only picking a number so ridiculously large that it is pretty much beyond argument that there is something behind the usage. I would love to be able to bring that figure down somehow to something we can all agree is reasonable - I don't think that is going to be very likely though!!
     
  30. panjandrum

    panjandrum PongoMod

    Belfast, Ireland
    English-Ireland (top end)
    Sorry Tim - I wasn't meaning to question your ridiculously large figure:p

    But even if you find a million of what you are looking for, you need to bear in mind that what Google counts is not necessarily what you think it counts - hence the need to inspect a good sample.
     
  31. cuchuflete

    cuchuflete Senior Member

    Maine, EEUU
    EEUU-inglés
    I'm enjoying this thread because we have been using Google counts to determine whether or not certain words and phrases should be added to the WR translation dictionaries.

    We use a secret number, determined with total and absolute objective integrity by one M.K., to decide if a word or phrase is worthy. Translators are allowed to ignore this for certain technical terms which are not going to be in Google with any great frequency, and we use the Google counts more as a guideline than a rule.

    I added a word with only about 400 citations, but it was very precise and correct in a narrow technical application. I also gave the more popular common speech version with its 50,000 appearances, and noted that the latter is colloquial.

    Where Google is of little use is determining which usages are current, and which are dated or obscure, as Panj pointed out with his mention of the dead writers.

    Ah, the joys of playing Results 1 - 10 of about 103,000 for lexicographer
     
  32. timpeac

    timpeac Senior Member

    England
    English (England)
    <<Spitting feathers>> Please! I don't mind not getting any recongnition for my comments, but equating me with an Irish monkey...:eek:

    :D :D :D
     
  33. panjandrum

    panjandrum PongoMod

    Belfast, Ireland
    English-Ireland (top end)
    ~~~~Big Chuckle~~~~
    Oh - that was YOU writing about dead righters:)
    I spent ages looking back to find out what on earth cuchu was getting at.
    I had finally concluded he was harking back to my early post about corpses and bodies - although that seemed too elusive an allusion even for cuchu.
    I am so glad to have had that resolved that I will forgive your use of that word instead of orangutan - one day, maybe soon.
     
  34. cuchuflete

    cuchuflete Senior Member

    Maine, EEUU
    EEUU-inglés
    Apologies to all those I have not offended...

    "the greatest majority of whom must be dead I suppose..." was in fact the work of Pandemonium's altered ego .

    To Tim goes the credit for the astute observation, to Panj the credit for his orangargantuan patience in dealing with my misattribution.

    Reminds me that as a child, when we were being taught to use library resources, the English teacher kept referring to the D A B. Some sassy child finally mustered the courage to ask what it was. "Oh," she replied, "Dictionary of American Biography, but easier to remember if you think of it as DAB for Dead and Buried."
     
  35. timpeac

    timpeac Senior Member

    England
    English (England)
    Ouch - did I write "the greatest majority"!! OK OK I take it back, it was Panj it was Panj!!:eek:
     
  36. cuchuflete

    cuchuflete Senior Member

    Maine, EEUU
    EEUU-inglés
    They are out there, watching us. They have friends in the silentest majority.
    The smaller majorities are alive and well and counting their googles.
     
  37. Edwin

    Edwin Senior Member

    Tampa, Florida, USA
    USA / Native Language: English
    I imagine that everyone reading this thread is familiar with the various advanced search techniques available with Google. For example, at http://www.google.com/advanced_search there are a number of available options. In particular there are links to the following:

    A complete list of advanced search operators can be found at http://www.google.com/help/operators.html

    Also by clicking the various links on the Google site one can find out a lot about how to get the most from Google. For example, here http://www.google.com/help/features.html you will find out about many search features including the Spell Checking that Google does for each inquiry.

    Actually there is probably enough material on optimal use of Google to justify a course. Eventually, I suppose one will be able to get a PhD in Google Studies. Maybe someone already has?
     
  38. garryknight Senior Member

    Kent, UK
    UK, English
    That's almost as bad as the following:
    [size=-1]
    [/size]
     
  39. Edwin

    Edwin Senior Member

    Tampa, Florida, USA
    USA / Native Language: English
    That's just alot :cool: of people being cool! The Macquarie Book of Slang says

    alot
    (a common spelling error, but also used deliberately as a `cool' spelling) adverb 1. a great deal: The traffic has eased up and we're cruising alot faster now. --adjective 2. many; a great number or amount of: It was alot of fun.
     
  40. cuchuflete

    cuchuflete Senior Member

    Maine, EEUU
    EEUU-inglés
    And I suppose dr. google will give us a high number for alotment: the excessive use of that very unique word alot. Groannnnnnnnnn
     
  41. modgirl Senior Member

    USA English, French, Russian
    My favorite misspelling, :cross:definately :cross: receives

    "2,890,000 (hits) for definately"

    Whoa baby......

    It is true that several of those sites are correcting the spelling, but there are many in which the word is substituted for definitely!
     
  42. Edwin

    Edwin Senior Member

    Tampa, Florida, USA
    USA / Native Language: English
    You are right lsp. I was annoyed to find that out. Investigation shows that the WSJ is one of the many publications that are available online only for a price. However, many libraries provide access via such electronic databases as LexisNexis, Access World News, and ProQuest Newpapers. Unfortunately LexisNexis does not provide access to the WSJ. I am informed by the reference desk of my library that "The 'ProQuest newspapers' database searches the WSJ (as well as some other papers, though not as many as LexisNexis) back to 1982 and the new "ProQuest newspapers (historical)" database provides full-text access to the WSJ back to 1889.

    Official description of Access World News: "Access world news from NewsBank provides full-text information and perspectives from over 600 U.S. and over 500 international sources, each with its own distinctive focus offering diverse viewpoints on local, regional and world issues. Each newspaper or wire service provides unique coverage of local and regional news, including specific information about local companies, politics, sports, industries, cultural activities, and the people in the community. Paid advertisements are excluded."

    Access World News might be useful for lexicographical purposes.

    There are, of course, many other examples of databases that are accessible only for a fee. Much of the scientific literature has this problem. But by using libraries one can often get free access.
     
  43. lsp

    lsp Senior Member

    NY
    US, English
    Exactly, and to my earlier point, whole universes of spellings and grammatical usages that would certainly skew results in one direction or another are invisible to Google.
     

Share This Page