name-hash.c: fix endless loop with core.ignorecase=true

With core.ignorecase=true, name-hash.c builds a case insensitive index of all tracked directories. Currently, the existing cache entry structures are added multiple times to the same hashtable (with different name lengths and hash codes). However, there's only one dir_next pointer, which gets completely messed up in case of hash collisions. In the worst case, this causes an endless loop if ce == ce->dir_next (see t7062). Use a separate hashtable and separate structures for the directory index so that each directory entry has its own next pointer. Use reference counting to track which directory entry contains files. There are only slight changes to the name-hash.c API: - new free_name_hash() used by read_cache.c::discard_index() - remove_name_hash() takes an additional index_state parameter - index_name_exists() for a directory (trailing '/') may return a cache entry that has been removed (CE_UNHASHED). This is not a problem as the return value is only used to check if the directory exists (dir.c) or to normalize casing of directory names (read-cache.c). Getting rid of cache_entry.dir_next reduces memory consumption, especially with core.ignorecase=false (which doesn't use that member at all). With core.ignorecase=true, building the directory index is slightly faster as we add / check the parent directory first (instead of going through all directory levels for each file in the index). E.g. with WebKit (~200k files, ~7k dirs), time spent in lazy_init_name_hash is reduced from 176ms to 130ms. Signed-off-by: Karsten Blees <blees@dcon.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: Karsten Blees <karsten.blees@gmail.com> 2013-02-27 23:57:48 (GMT)
committer: Junio C Hamano <gitster@pobox.com> 2013-02-28 07:29:04 (GMT)
commit: 2092678cd5265e40dfb2b8e3adc8f2075faeece4 (patch)
tree: 3e697a301abb8500a1f88ea4f4ee278dfd13b000 /cache.h
parent: 15999998fbda60552742275570947431b57108ae (diff)
download: git-2092678cd5265e40dfb2b8e3adc8f2075faeece4.zip
git-2092678cd5265e40dfb2b8e3adc8f2075faeece4.tar.gz
git-2092678cd5265e40dfb2b8e3adc8f2075faeece4.tar.bz2
1 files changed, 3 insertions, 14 deletions
diff --git a/cache.h b/cache.h
index a58df84..2d938ea 100644
--- a/cache.h
+++ b/cache.h
@@ -131,7 +131,6 @@ struct cache_entry {
 	unsigned int ce_namelen;
 	unsigned char sha1[20];
 	struct cache_entry *next;
-	struct cache_entry *dir_next;
 	char name[FLEX_ARRAY]; /* more */
 };
 
@@ -267,25 +266,15 @@ struct index_state {
 	unsigned name_hash_initialized : 1,
 		 initialized : 1;
 	struct hash_table name_hash;
+	struct hash_table dir_hash;
 };
 
 extern struct index_state the_index;
 
 /* Name hashing */
 extern void add_name_hash(struct index_state *istate, struct cache_entry *ce);
-/*
- * We don't actually *remove* it, we can just mark it invalid so that
- * we won't find it in lookups.
- *
- * Not only would we have to search the lists (simple enough), but
- * we'd also have to rehash other hash buckets in case this makes the
- * hash bucket empty (common). So it's much better to just mark
- * it.
- */
-static inline void remove_name_hash(struct cache_entry *ce)
-{
-	ce->ce_flags |= CE_UNHASHED;
-}
+extern void remove_name_hash(struct index_state *istate, struct cache_entry *ce);
+extern void free_name_hash(struct index_state *istate);
 
 
 #ifndef NO_THE_INDEX_COMPATIBILITY_MACROS
author	Karsten Blees <karsten.blees@gmail.com>	2013-02-27 23:57:48 (GMT)
committer	Junio C Hamano <gitster@pobox.com>	2013-02-28 07:29:04 (GMT)
commit	2092678cd5265e40dfb2b8e3adc8f2075faeece4 (patch)
tree	3e697a301abb8500a1f88ea4f4ee278dfd13b000 /cache.h
parent	15999998fbda60552742275570947431b57108ae (diff)
download	git-2092678cd5265e40dfb2b8e3adc8f2075faeece4.zip git-2092678cd5265e40dfb2b8e3adc8f2075faeece4.tar.gz git-2092678cd5265e40dfb2b8e3adc8f2075faeece4.tar.bz2