wesley tanaka

Speculative Explanation of Google's Infamous Error Message

‹ China Daily: Soft Porn? | Soul Particles ›

Ruslan Abuzant noticed a fragment of a server status page on one of google's webpages. The error's been confirmed as "real", which has caused some speculation as to what it all might mean.

There are some problems with that particular speculation:

  1. It doesn't take into account the magnitude of the values in the first "server status" section.
  2. It doesn't seem to take into account google's workforce. Namely, it's probably reasonable to assume that google's software engineers range from reasonably good to great, and thus, the variable names and switches that appear will be short, concise, and consistently named.

Thus, I've created my own comments and speculation as to what this all means, trying to keep those two things in mind.

update: I just found this page (blocked in China). Had I seen it sooner, my weird motivation to write this up probably would have subsided. The other one didn't agree with me nearly as much. =)

The fragment as reported:

pacemaker-alarm-delay-in-ms-overall-sum 2341989

pacemaker-alarm-delay-in-ms-total-count 7776761
cpu-utilization 1.28
cpu-speed 2800000000
timedout-queries_total 14227
num-docinfo_total 10680907
avg-latency-ms_total 3545152552
num-docinfo_total 10680907
num-docinfo-disk_total 2200918
queries_total 1229799558
e_supplemental=150000 --pagerank_cutoff_decrease_per_round=100 --pagerank_cutoff_increase_per_round=500 --parents=12,13,14,15,16,17,18,19,20,21,22,23 --pass_country_to_leaves --phil_max_doc_activation=0.5 --port_base=32311 --production --rewrite_noncompositional_compounds --rpc_resolve_unreachable_servers --scale_prvec4_to_prvec --sections_to_retrieve=body+url+compactanchors --servlets=ascorer --supplemental_tier_section=body+url+compactanchors --threaded_logging --nouse_compressed_urls --use_domain_match --nouse_experimental_indyrank --use_experimental_spamscore --use_gwd --use_query_classifier --use_spamscore --using_borg"

My thoughts

The first part (the bold parameters) appear to be status variables, analagous to what might be displayed if you ran the query "SHOW STATUS" in MySQL. The second part, given the -- prefix and the use/nouse at the beginning of the names is probably command line switches in GNU-ish style passed to the server process. If you take those two assumptions to be true, then all of the values set for these parameters are specific to a single server process. With that in mind:

  • pacemaker-alarm-delay-in-ms-overall-sum 2341989: Assuming "ms" stands for milliseconds, that's 2341.989 seconds or 39 minutes.
  • pacemaker-alarm-delay-in-ms-total-count 7776761: Again assuming milliseconds, that's 7776.761seconds or 129.6 minutes. However given the naming of the previous variable (overall sum) and this one (total count), I think these two variables are used in conjunction to calculate an average, which would make this an integer "count" as opposed to a number of milliseconds. (2341.989/7776761 = .0003 sec = 0.3 milliseconds). An "alarm" usually refers to some kind of scheduled task which happens after a delay, and a pacemaker is a device which regulates the beating of the heart, so there's probably some kind of task that gets run irregularly but continuously on the server. Perhaps it's a spider that's downloading a page off the web every 0.3 milliseconds on aveage. Or perhaps the server farm is set up so that computers ask their "parents" for tasks when they have enough free CPU/network/memory to do work -- the task could be a spider task, a search query task, or something else. The server would fetch the data from gfs (see below). That would make the distributed system conceptually simpler than having the parents try to assign work -- there would be no special cases to deal with when a slave/leaf computer dies (but see rpc_resolve_unreachable_servers below).
  • cpu-utilization 1.28: Given the value 1.28, probably the load average on the particular server that served the page.
  • cpu-speed 2800000000: 2.8 billion -- probably means that this process is running on a 2.8 GHz CPU core.
  • timedout-queries_total 14227: if queries_total is talking about a database, I imagine that google has enough software engineering prowress such that "query" would be consistently named here and also mean "database query" in this context. In any case, it seems a bit too low to be talking about web page fetches, and calling a web page fetch a "query" seems a bit... obtuse. So this could mean that 14,227 database queries failed due to timing out. That's 0.001% of the queries_total value, which seems reasonable. Given the number of servers google supposedly has, that 0.001% could be a database computer dying and the software needing to find a replacement. Update: after thinking about --use_query_classifier, I think "query" means search query (it even matches with the "q" in google's search results URLs), in which case these might be:
    • database queries that timed out, or
    • people that requested a google search, left the connection open, but never downloaded the results page. Both of these options shouldn't happen too often, which fits with the 0.001% value
  • num-docinfo_total 10680907: if we keep with the theme that "total" means "count", that's 10 million occurrences of something.
  • avg-latency-ms_total 3545152552: again, assuming ms = milliseconds, that's a bit under 985 hours.
  • num-docinfo_total 10680907: Look! It's even duplicated in the output for some reason (bug within a bug?). In combination with num-docinfo-disk_total, this is probably a structure that's stored in virtual memory (like how many caches work). If there are 10 million of these docinfo structures with 2 million paged out to disk, that's still 10680907-2200918 = 8479989 docinfo's in memory. If we assume that a server has 4 x 10^9 bytes of memory, that's 471 bytes per docinfo, which is a pretty small data structure. It could be enough to store a "row" containing a single word and some integers. It would make sense that each server farm computer would be doing spidering, rank calculations and queries all at the same time -- spidering is network intensive and the rank calculations are probably memory and CPU intensive, and by doing them in parallel, you have a much better chance of saturating each server computer (and thus getting the most from your hardware budget).
  • num-docinfo-disk_total 2200918: Count (again total == count?) of swapped out docinfo records. I'll guess that because the other two reasonable interpretations for this don't seem likely. 2GB seems too small for a single process on a single commodity computer, and 2TB is too big.
  • queries_total 1229799558: That's 1.23 billion. That high of a number almost certainly means "database queries", though after encountering "query_classifier" below, I'm inclined to think it means "search queries", which is the only other thing which might get that high. Interestingly, if we could figure out the uptime of this server process, this would give some idea of how many search queries google is processing. I don't think that avg-latency-ms_total is comparable to the uptime. At 41+ days, that's 346.9 search queries per second (900 million per month per server in the farm) which is impossibly high.
  • e_supplemental=150000: I would argue that this variable is named surprisingly poorly compared to the rest of them. A new employee couldn't look at this and guess at its meaning without knowing what "e" and "supplemental" meant. At the rate google is hiring, the 10-20 minutes to change this switch name might cumulatively save them many hours in training down the road. Then again, maybe the google new employee briefing covers both of these. Update: There seems to be some kind of thing called a "supplemental index" or "supplemental results". I don't know if "e" makes more sense in that context, because I have no idea what either of those things are.
  • --pagerank_cutoff_decrease_per_round=100: These can't be percentages, because 100% does not imply a decrease. So they're likely either a divisor and multiplier or an subtrahend and addend.
  • --pagerank_cutoff_increase_per_round=500: Should this be parsed pagerank (cutoff increase per round) or (pagerank cutoff) increase per round.
    • "pagerank cutoff" per round: not likely, since it's a constant change per round and global for the server
    • So we might surmise that these limit the amount that pagerank can change in each "round". Like with other computer science problems, the relationship of the web is probably represented as a (very sparse) matrix, and updated by operating on that matrix, where each iteration might be called a "round." By adding these cutoffs, google can dampen pagerank motion (for example, if I add a million links to my page X in one day, this pagerank cutoff-increase will put the brakes on how quickly page X's pagerank rises. This could be a customer perception issue -- people seem to get mad when their rankings in google change, and one way to mollify them is to artificially make the change more gradual. It also would help prevent a site going down temporarily from affecting its rankings too quickly, and make it easier for the site to recover its rankings once it came back online. One question I have is why they didn't name it the arguably clearer pagerank_max_increase_per_round or pagerank_increase_per_round_limit?
  • --parents=12,13,14,15,16,17,18,19,20,21,22,23: since these are command line arguments to a server process, these would have to be the "parents" of the particular server. This implies that google's spidering or rank calculation server farm is organized into a tree or a directed graph of some sort. Perhaps this server is allowed to contact any of servers 12--23 to get new tasks.
  • --pass_country_to_leaves: Again suggests that the server farm is organized into a tree structure. I'm not sure why you *wouldn't* want to pass any country information to the leaf servers in the farm. Maybe it breaks something else.
  • --phil_max_doc_activation=0.5: another poorly named switch, though this one only requires knowledge of "phil" to guess at.
  • --port_base=32311: Lowest IP port in a range of ports available for use. This might suggest that several server processes might be running on the same network device (multi-cpu or multi-core servers with a single network card) and thus need to listen for messages from each other on different sets of ports.
  • --production: usually a production and debug builds are separate binaries, but this could be doing something like not evaluating assertions and whatnot.
  • --rewrite_noncompositional_compounds. The page of guesses linked above quotes: non-compositional compounds (NCCs) such as “kick the bucket” and “hot dog.” NCCs are compound words whose meanings are a matter of convention and cannot be synthesized from the meanings of their space-delimited components. Google probably has a big list of compounds like "hot dog", "kick the bucket", "by and large", "ad hoc", and "palo alto" which, since this switch is on, will get re-written into a form that the indexer can recognize as one unit (maybe something like "hot_@#DEI$%^_dog"). That way, the indexer can distinguish between "hot dog" and "dog", or between "ad hoc" and "ad".
  • --rpc_resolve_unreachable_servers: RPC stands for remote procedure call, and refers to a message (and possibly return value) passed between two servers, both of which you usually control. In any case, to think of a "download" of a web page to be a procedure call is stretching at best, so this likely refers to inter-server-farm communication (intra-server farm communication probably doesn't run into unreachable servers *that* often). "resolve" usually means to look up a name in a directory. But I'm not sure what that buys you if the server you're trying to contact is unreachable. Perhaps this is part of the failover system -- you contact a server by a logical id number (see --parents) which returns some IP address. The address is cached, but maybe at some point, the parent server might die of some hardware failure, in which case a new piece of hardware takes on the id number. When that happens, the child server assumes the parent had a hardware failure (since it's unreachable) and re-resolves the parent's logical id
  • --scale_prvec4_to_prvec: this probably expands to "scale page rank vector 4 to page rank vector". I wonder if page rank is actually a vector (multiple numbers) as opposed to a scalar (single number) like everyone assumes (and like is displayed by the toolbar). It would make sense -- the page rank for a page could store other aspects of the page, like how likely it is to be spam, in addition to an idea of how linked-to the page is. The page rank you see in the google toolbar would be some scalar function of the page rank vector.
  • --sections_to_retrieve=body+url+compactanchors: it's well known that google indexes text found in the body of a page, the url, and the text of links which point at the page. That's probably what's stated here. "Anchor" in this case probably refers to the <a> tag rather than the part of a URL that comes after the #. "Body" probably refers to the body of the HTTP result rather than the <body> tag, since the <title> tag is important and isn't contained in the <body> tag. "compactanchors" insinuates that if your link text is too long, it won't get used.
  • --servlets=ascorer: Interesting that they say "servlets", which has a specific meaning in Java. Given the other output below, this may have been under development at the time. Update Jul-21 08:26 CST: Could have something to do with accessible search.
  • --supplemental_tier_section=body+url+compactanchors: It's rather strange that the supplemental_tier_section is the same as sections_to_retrieve. The fact that there are two switches means that you might want to set them differently in some cases. This would have been more illuminating if it had been different. Update: Supplemental apparently does have a meaning w.r.t. google searches, which might provide more insight.
  • --threaded_logging: Do logging in a different thread(?), which would help better saturate CPU/Disk load on the server.
  • --nouse_compressed_urls: This uses "compressed" and not "compact", so "compressed" probably doesn't just mean "short." This might be some kind of internal storage space thing -- whether or not URLs are compressed in the database to store space. This is off, which means that google server farm is probably limited by CPU rather than by disk space at the moment.
  • --use_domain_match: ?
  • --nouse_experimental_indyrank: There's a lot of speculation about this online. Given that it's experimental and off, there's not much to go on. However, this might support the idea that pagerank is a vector, with components named *rank.
  • --use_experimental_spamscore: Another thing which might be included in the pagerank vector.
  • --use_gwd: ?
  • --use_query_classifier:
    • my guess that "query" means database query is incorrect
    • or there's some kind of classifier for database queries
    • or they've overloaded the word "query" to talk about two different things (shame shame)
    • But most likely, "query" actually means "search query" instead of "database query". In that case, you'd run the query as entered by a user through a classifier to put it into one of several categories ("run time" would get classified differently from "run mend" or "run away", and the search results that you got back would come from the correct category).
  • --use_spamscore: Another part of the pagerank vector?
  • --using_borg: Believed to be this borg. I think it might be an acronym that somebody probably thought was funny when they added it at 4:43 in the morning.

More Error Messages

./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.attachments.bodysource_split_1.0.0.2470573324.1148547087P X p ꇚ x 6 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.attachments.contentinfoJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.attachments.contentinfo.0.0.13439660.1148545967P X p x 6 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.attachments.gwdcapsinfoJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.attachments.gwdcapsinfo.0.0.556.1148544290P X p x 6 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.attachments.labelsJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.attachments.labels.0.0.5294968.1148544245P X p x 6 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.attachments.legacyperdocdataJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.attachments.legacyperdocdata.0.0.20135960.1148544293P X p x 6 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.attachments.metacapsinfoJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.attachments.metacapsinfo.0.0.219016.1148545618P X p x 6 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.attachments.navboostJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.attachments.navboost.0.0.33884.1148544360P X p܈ x ܈ 6 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.attachments.sitemapJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.attachments.sitemap.0.0.36.1148544339P X p$x $ 6 <;By/gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.docidsJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.docids.0.0.15383092.1148545622P X p x 6 uptodate: yes dserve:mustang.ascorer@borg:1225422070:4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1:96:199 5,0:1023,set:pre:/gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.,docids:docids,attachments.anchorinfo:attachments.anchorinfo,attachments.basicinfo:attachments.basicinfo,
attachments.bodycapsinfo:attachments.bodycapsinfo,attachments.contentinfo:attachments.contentinfo,attachments.gwdcapsinfo:attachments.gwdcapsinfo,attachments.labels:attachments.labels,attachments.legacyperdocdata:attachments.legacyperdocdata,
attachments.metacapsinfo:attachments.metacapsinfo,attachments.navboost:attachments.navboost,attachments.sitemap:attachments.sitemap,attachments.bodysource_split_0:attachments.bodysource_split_0,attachments.bodysource_split_1:attachments.bodysource_split_1,
tokenspace.body+url+compactanchors:tokenspace.body+url+compactanchors,tokenspace.gwd:tokenspace.gwd,tokenspace.meta:tokenspace.meta,tokenspace.navboost:tokenspace.navboost,tokenspace.restricts:tokenspace.restricts fraccpu:mustang.ascorer@borg 1148807088:0.3065 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.tokenspace.body+url+compactanchorsJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.tokenspace.body%2burl%2bcompactanchors.0.0.3095455312.1148547228P X p ̃x ̃7 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.tokenspace.gwdJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~home~mustang~www~4bbase~segments~2006-05-22-02-02~prod6-repos~mustang~L1~mustang-00096-of-00199.tokenspace.gwd.0.0.3095904.1148544361P X p x 6 <;B /gfs-shadow/ro-mgt/home/mustang/www/4bbase/segments/2006-05-22-02-02/prod6-repos/mustang/L1/mustang-00096-of-00199.tokenspace.metaJ
./alloc/cachedir.task83/gfs-shadow~ro-mgt~ho...

More Thoughts

This actually seems more revealing. I wonder if user "mustang" is in trouble, even if he or she was working hard at 2:02 in the morning. I sure hope not.

gfs probably stands for google file system, given that it's part of what looks like a file path. ro in the path would mean "read only", which would imply that you work on your files (/home/mustang/www/4bbase/segments/....) in your normal fashion in your home directory, and at some point, the google file system makes those files available (read only) across the entire server farm (including the one that produced this revealing output).

Suggested Links

- pacemaker is related to

- pacemaker is related to high availability .. one server continuously pinging another to see if it's alive - mustang is java 1.6

Syndicate content