parallel processing - How HBase partitions table across regionservers? -
Please tell me how is the HBase partition table on regional server. For example, suppose that my line keys are integers 0 to 10 m and I have 10 field rows.
Does it mean that the first fielder has all the rows 0 to 10m with the key, second 1m-2m, third 2m-3m, ... tenth 9m-10m?
I want my line key to be a timestamp, but most inquiries will be applied to the latest dates, all questions will be processed only one fielder, is this true?
Or maybe this data will be spread in a different way?
Or maybe I can create an area in some way I have area servers, so (as given in the example) server 1 0 - 0,5 m and 3m - 3,5 m keys, this way mine The data will spread evenly, is it possible?
Update
I have just found that the option is hbase.hregion.max.filesize
, do you think Will my problem be solved?
WRT partition, you can read Lars on HBbase architecture or Google's Bigtable paper Hbiz "Clone ".
If your line has only one timestamp, yes the field with the biggest key will always be hit with new requests (since one field is served by only one area server).
Do you want to use the timeframe for small scans? If so, consider hitting your keys (search for Google to see what Mozilla did with Soroko).
Can any prefix timestamp with any ID? For example, if you only request data for specific users, then prefill the TS with that user ID and this will give you better load distribution.
If not, then distribute your keys to UUID or something else that will be randomly used.
About hbase.hregion.maxfilesize
Make maximum fixes on that table (which you can do with the shell), it does not Ensure that each area is exactly x MB (where x is the value you set) is bigger. Suppose that your line keys are all timestamps, which means that each new row key is larger than the previous one, this means that this field will always be inserted into this area with the blank end key (the last one). At some point, one of the files will grow more and more through the maximum (affiliate), and that area will be split around the middle. The lower keys will be in your area, the higher keys in one and the other but since your new line key is always Compared to earlier, this means that you will only write on that new area (and so on).
tl; Dr Although you have more than 1000 areas, the area with the largest row key always writes with this schema, which means that the hosting area server will become an obstacle.
Comments
Post a Comment