I have a very simple table: Field A, Field B
No more data will be written to the table, only lookups are needed. The Table will consist of at least 500 million records but could possibly hit around 1 billion.
What would be the most efficient method in MySQL for performing these lookups? Multiple tables? One large table? Indexing strategies?
Field A would be unique.
This...
You will be looking up by A, not B, correct?
InnoDB, with PRIMARY KEY(A).
No advantage for splitting the table.
Please do
SHOW CREATE TABLE
so I can check for other issues.
CREATE TABLE `EMPLOYEE`.`IDENT` (
`SSN` varchar(11) NOT NULL,
`USERID` varchar(5) NOT NULL,
PRIMARY KEY (`SSN`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The USERID was a company identifier. It's not necessarily unique because USERIDs were recycled (e.g. a new employee might have used an ex-employee's USERID).
At 1B rows, plan on about 100GB of disk space needed. Because of the way BTrees work in InnoDB, and assuming innodb_buffer_pool_size is big enough, plan on 1 disk hit per random fetch.
innodb_buffer_pool_size should be about 70% of available RAM.
The equivalent MyISAM table would take about 35GB; about evenly split between data and index. Even so, it might be slower because it would ...
You're right, not all SSNs have been allocated. I actually only have around 10,000,000 but the integrity can't be vouched for. The reason it may go up is because I can generate SSNs and randomly assign UIDs for future projects. What I'm mainly interested in is using MySQL as a basic retrieval system for large data sets (not a full transactional or even relational system).
The company or ...
Look at it this way. Let's say you have 1GB for innodb_buffer_pool_size (a practical limit on a 2GB machine) and the data is 100GB. Now, let's say you want to fetch a _random_ SSN. I do mean random; you can't predict what value it will have. What are the odds of that one being in cache at the moment? 1GB/100GB = 1%. That is, fetch 100 _random_ SSNs, 99 will hit the disk, 1 will be in ram....
No names. Not even certain whether the SSNs are all valid. It doesn't matter, though. Let's just assume that it's a randomly generated anonymous SSN death index, which is probably closest to the truth at this point.
The memory analysis for a random fetch definitely made that clear. It just seems that no matter whether the SSN is random that if the SSN is sequentially ordered as a 9-...
All the 11122xxxx numbers were issued over the last century from Bangor Maine. 11123xxxx came from Portland, Maine over the same time period. (I am making up the specifics; but maybe you get the idea.)
If your searches are by geographic area, then they will nicely cluster. If they are by age (eg those born in 1899), they will be scattered "randomly".
If you need to look up by userid,...
No names. Not even certain whether the SSNs are all valid. It doesn't matter, though. Let's just assume that it's a randomly generated anonymous SSN death index, which is probably closest to the truth at this point. The memory analysis for a random fetch definitely made that clear. It just seems that no matter whether the SSN is random that if the SSN is sequentially ordered as a 9-digit number such that 111223333 <...
All the 11122xxxx numbers were issued over the last century from Bangor Maine. 11123xxxx came from Portland, Maine over the same time period. (I am making up the specifics; but maybe you get the idea.) If your searches are by geographic area, then they will nicely cluster. If they are by age (eg those born in 1899), they will be scattered "randomly". If you need to look up by userid, you need another index. That would add...
Thread profile page for "Very Large Non-Dynamic Table for Lookups" on http://www.mysql.com.
This report page is a snippet summary view from a single thread "Very Large Non-Dynamic Table for Lookups", located on the Message Board at http://www.mysql.com.
This thread profile page shows the thread statistics for: Total Authors, Total Thread Posts, and Thread Activity